CN106021626A

CN106021626A - Data search method based on data mining

Info

Publication number: CN106021626A
Application number: CN201610600204.8A
Authority: CN
Inventors: 赖真霖; 文君
Original assignee: Chengdu Sixiang Lianchuang Technology Co Ltd
Current assignee: Chengdu Sixiang Lianchuang Technology Co Ltd
Priority date: 2016-07-27
Filing date: 2016-07-27
Publication date: 2016-10-12

Abstract

The invention provides a data search method based on data mining. The method includes the steps that user search words are preprocessed, the text vector reverse word frequency and an approximation degree matrix are calculated, a classification result of the search words input by a user is obtained, and search is carried out according to the classification result. According to the data search method based on data mining, the intention of the user is indicated and understood through the features of the search words, the efficiency of a search engine is improved through an improved indexing mechanism, and user experience is enhanced.

Description

Data search method based on data mining

Technical field

The present invention relates to data search, particularly to a kind of data search method based on data mining.

Background technology

Along with the development of the Internet, search engine data volume to be processed is increasing, the property to search engine Can require more and more higher.Efficiently for the search word of user, some potential user views can be excavated, Wherein it is appreciated that user wants to search the result in which field, for user view according to the classification of search word And the feature in this field can meet the application of user's request to user's recommendation, carry out Optimizing Search result.For Raising Information Retrieval Efficiency, information retrieval system the most all uses parallel computing platform based on cloud and index Technology.Structure design, storage mode, search operaqtion and the dynamic update algorithm effect to information retrieval of index Rate has a great impact.But, the search engine of prior art is the most perfect to the process of search word, does not has Especially user view is effectively predicted.And in index process, adding of index needs newly-increased Record merges with the identical entry of original index.In HDFS, this operation needs to re-create block.Each Copy information on DataNode is also required to update.And deletion content in the index can only delete whole piece, And produce substantial amounts of I/O communication, waste substantial amounts of time and resource.

Summary of the invention

For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of number based on data mining According to searching method, including:

Pretreatment user search word, calculates the reverse word frequency of text vector and degree of approximation matrix, obtains user defeated The classification results of the term entered, retrieves according to classification results.

Preferably, the described reverse word frequency of calculating text vector and degree of approximation matrix, farther include:

In conjunction with participle characteristic and the semantic weight combining domain knowledge, all dispose point at all of DataNode Word tool kit, obtains the term of text and this term occurrence number in this document after participle, with Statistics word frequency, scans domain knowledge base, calculates the position weights of this word and semantic span weights, then the One Map stage calls semantic degree of approximation function and adds up；Then each word is calculated in the Reduce stage These information are output as the input of second MapReduce by reverse word frequency；Second Map stage, Calculate semantic degree of approximation result, if the degree of approximation is more than presetting semantic threshold value, then return semantic weight, otherwise Do not weight, all of weights of recombinant, obtain final semantic weight；Second Reduce stage, to vector Normalization processes；

For each text, only calculate the numbering degree of approximation more than oneself text, whole square can be calculated Battle array, under its MapReduce environment, parallel procedure is:

The Map stage: first carry out data segmentation, the data of segmentation are distributed to by NameNode DataNode, numbers to DataNode simultaneously, calculates the degree of approximation of text simultaneously, terminate when the Map stage After, intermediate object program is exported the input locally as the Reduce stage；

The Reduce stage: collect the data of every a line, and collect output matrix；

Then use the input mode optimizing scribing, higher dimensional matrix is blocked into minor matrix computing；The Map stage Carrying out data scribing, read submatrix A and B according to scribing offset information and do multiplication in this locality, result is square One submatrix of battle array C, output is to the position that specifies Offsets；The Reduce stage calculates Matrix C=AB；By right Matrix is claimed to be expressed as symmetric triple-diagonal matrix by orthogonal similarity transformation；Solve the front K of symmetric triple-diagonal matrix Individual minimal characteristic vector, is eventually converted into Laplacian Matrix and vector by the computing of symmetric triple-diagonal matrix Multiplication operation, after obtaining minimal eigenvalue and the characteristic vector of matrix, calculates each text to a certain set The degree of approximation.

The present invention compared to existing technology, has the advantage that

The present invention proposes a kind of data search method based on data mining, utilizes the character representation of search word Understand user view, and improve the efficiency of search engine by the Indexing Mechanism improved, enhance user's body Test.

Accompanying drawing explanation

Fig. 1 is the flow chart of data search method based on data mining according to embodiments of the present invention.

Detailed description of the invention

Hereafter provide one or more embodiment of the present invention together with the accompanying drawing of the diagram principle of the invention is detailed Thin description.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.This Bright scope is limited only by the appended claims, and the present invention contains many replacements, amendment and equivalent.? Middle elaboration many detail is described below to provide thorough understanding of the present invention.For exemplary purposes And these details are provided, and can also be according to right without some in these details or all details Claim realizes the present invention.

An aspect of of the present present invention provides a kind of data search method based on data mining.Fig. 1 is according to this The data search method flow chart based on data mining of inventive embodiments.

The term that the present invention inputs according to user, uses the expansion to term of the training in advance good grader Text is classified, and classification results is sent to application system, identifies attribute and the user view of term, Obtain template and data, then be presented in search engine with the form of Web application.

Described application system includes:

Proxy server, protection built-in system port the most externally exposes, and supports high concurrent, load balancing, Also contribute to improve access speed.

Web server: use the framework of MVC, calls each service, data is assembled, finally Integrate with style sheet.

Caching server: information low for some renewal frequencies be stored in caching server, without instead Multiple searching database, alleviates the pressure of data base.

Attribute identification services: identifies subclass and user view, selects to be suitable for the application of user's request Template.

Divide Domain resource database: according to different fields, set up the resources bank of oneself.

Data base: using master library and the structure from storehouse, master library is responsible for write, is responsible for reading from storehouse, it is achieved read Write separation.Additionally need between master library to synchronize, also to keep Tong Bu with the data of master library from storehouse.

Retrieval server: when retrieval is with certain condition, or when having the demand of sequence, make the skill of index of reference Art quickly obtains result according to search condition.

For the classification of term, the present invention utilizes search engine that term carries out text expansion, will retrieval The front k bar summary info that word is extended in search-engine results, k is the entry number after expanding, and will expand text Classification results as the classification results of former search text.

At term pretreatment stage, it is that term is expanded, with the form storage of vector, abandons wide The less results of quantity of information such as announcement, picture, Web application, filtering web page label, before extracting, k bar summary is The expansion result of text, then carries out pretreatment to expansion text, and pretreatment is divided into participle, abandons stop words. Each summary, i.e. one document, what the inside comprised is characteristic item set, it is represented with vector.Then In the training stage, according to the expansion information of the text in training set, carry out feature selection, characteristic weighing, build Vertical grader.Specifically include and be extended for the term in training set getting in term pretreatment stage The expansion text of this term.Feature selection approach is utilized to select characteristic features item, by the big order of magnitude Characteristic item dimensionality reduction, the improvement the being applicable to text classification reverse word frequency method using the present invention to improve carries out feature Weighting, finally uses the method for text classification to set up grader, for test phase.At sorting phase, Term to be sorted is classified.Specifically include and term to be sorted is first extended for expanding text, so The feature that rear utilization extracted in the training stage carries out dimensionality reduction, by Feature Weighting Method by feature vector table Show, finally utilize the grader that the training stage optimized that the characteristic vector of the information of expansion is classified, obtain The classification that result is this term.

Sorter model is adjusted by the algorithm of feedback training.Grader judges which classification a text belongs to While, provide a credibility, if this value is less than threshold value, then judges that its credibility is the highest, can remember Record is to the data text collection of feedback training.By big data platform off-line analysis User action log, acquisition point These words also be recorded feedback instruction by the term that the amount of hitting is corresponding less than the Web application of predetermined click volume threshold value In the middle of the data text collection practiced.The most manually carry out screening, marking, by the term of a part of high access It is directly placed in high frequency vocabulary, can first mate with the word in high frequency vocabulary when of search, one next time Part is as the final data of study feedback.Successively in the vectorial and former grader in comparative training grader to The degree of approximation of amount, if similar, just deletes this vector in former grader, substitutes with new vector, if Do not find similar vector, be added in new grader vector.

Wherein utilize the grader that the training stage optimized that the characteristic vector of the information of expansion is carried out text classification, It is further divided into the reverse word frequency of text vector set up and calculate degree of approximation matrix.The present invention combine participle characteristic with In conjunction with the semantic weight of domain knowledge, all dispose participle tool kit at all of DataNode, obtain after participle The term of text and this term in the occurrence number of this document, to add up word frequency, scanning field is known Know storehouse, calculate the position weights of this word and semantic span weights, then call semanteme first Map stage Degree of approximation function is added up.Then, calculate the reverse word frequency of each word in the Reduce stage, these are believed Breath is output as the input of second MapReduce.Second Map stage, calculate semantic degree of approximation result, If the degree of approximation is more than presetting semantic threshold value, then returning semantic weight, do not weight, recombinant is all of Weights, obtain final semantic weight.Second Reduce stage, process that vector gauge is formatted.

For each text, it is only necessary to calculate the numbering degree of approximation more than oneself text, can calculate whole Individual matrix.Under MapReduce environment, paralell design scheme is as follows:

The Reduce stage: collect the data of every a line, and collect output matrix.

Then use the input mode optimizing scribing, higher dimensional matrix is blocked into minor matrix computing.The Map stage Carrying out data scribing, read submatrix A and B according to scribing offset information and do multiplication in this locality, result is square One submatrix of battle array C, output is to the position that specifies Offsets.The Reduce stage calculates Matrix C=AB.By right Matrix is claimed to be expressed as symmetric triple-diagonal matrix by orthogonal similarity transformation.Solve the front K of symmetric triple-diagonal matrix Individual minimal characteristic vector, is eventually converted into Laplacian Matrix and vector by the computing of symmetric triple-diagonal matrix Multiplication operation.After obtaining minimal eigenvalue and the characteristic vector of matrix, calculate each text to a certain set The degree of approximation.

In terms of the improvement of index structure, the index structure of the present invention includes base index, increment index, difference Subindex and dictionary.According to said structure, use the file storage strategy sorted based on word frequency size and word frequency, And the compressed encoding of data in item.For above-mentioned index structure, devise and carry out under cloud platform accordingly Index is set up, updates, deletes and is retrieved.

Index structure is by increment index, differential indices and the dictionary on the NameNode operating in HDFS Constitute with the base index run on DataNode.Base index is made up of multiple documents.Document is by multiple Index entry is constituted.Each index entry comprises a term and multiple listed files.Listed files is by file ID The number and position information structure occurred in this document with this term.File is saved in HDFS, DataNode different in cluster it is stored in after piecemeal.Differential indices is saved in the NameNode of HDFS On.Document id to be deleted in all document library of invalid document list records, during response retrieval request, be Unite according to this list, from retrieval result, filter out the file corresponding for all ID in this list.Differential indices Generate when having file to be deleted in document library.To differential indices file reach certain threshold value time, system according to This index file updates base index.Increment index operates on each DataNode of HDFS, is used for protecting Deposit newly-increased index entry.When library has newly-increased file, this document generate an independent index.Solely Lithol is incorporated in one document of record the information that all terms occurred occur in the document.Work as independence The quantity of index arrives certain threshold value, can trigger merging process, all independent indexes are merged into one whole Body, i.e. one increment index.When the quantity of increment index reaches certain threshold value, union operation also can be triggered, Multiple increment indexes are merged in base index.In index structure, by the word in dictionary by frequency partition it is High frequency words and low-frequency word, be stored in the NameNode of HDFS in different ways to the word of different frequency In internal memory, with the form tissue of hash table.For high frequency words, each word preserves it with an independent file Index, for low-frequency word, is saved in the table of multiple terms in same file.Will by hash function The index information of each high frequency words is mapped in independent file, and is reflected by the index information of multiple low-frequency words It is mapped in a file.During retrieval, according to the frequency attribute of word, carry out distributed inspection in different ways Rope.

The structure of index entry is as follows:

<wordid,docnum,<doc_id,freq,<p₁,p₂,...,p_n>>,...,<docid,freq,<p₁, p₂,...p_n>>>

Wordid represents that term, docnum represent term wordid and occur in docnum document Cross.Be below docnum Doclist, each Doclist represent that wordid occurs in one file time Number and positional information.Docid represents the reference number of a document that this term occurs, freq represents that term is at this The number of times occurred in individual file, is followed by the list of an a length of freq, represents what term occurred every time Positional information.

Two incremental lists of existence in said structure:

The document id number being incremented by: (doc₁, doc₂, doc₃..., doc_n)。

The positional information sequence being incremented by: (p₁, p₂, p₃..., p_n)。

Retaining the Section 1 in sequence, each item below replaces with the difference of previous item with it.Process it After sequence as follows:

Document id sequence of differences: (doc₁, doc₂-doc₁, doc₃-doc₂..., doc_n-doc_n-1)。

Alternate position spike value sequence in individual file: (p₁, p₂-p₁, p₃-p₂..., p_n-p_n-1)。

Sequence after conversion process, sample space reduces and can use more efficient compression method.

After search engine collecting to a number of webpage this locality text library, need to set up initial index. Before setting up index, needing the text to capturing to carry out pretreatment, a most important step is i.e. word segmentation processing. The basic process setting up index is described as follows:

First step l. carries out participle to the text grabbed, and then the file after participle is stored in HDFS Each DataNode in.

Step 2. then creates the job processor of a MapReduce, and holding of operation is set up in responsible control OK.

Step 3. job processor according to circumstances creates map task, map task run on DataNode, Document in each piecemeal is respectively processed, and produces intermediate file.

After step 4.map tasks carrying is complete, multiple reduce tasks start to perform, and map task are produced Intermediate file merges, and generates index file.

Step 5. sorts according to the frequency attribute of term and the word frequency in dictionary, uses following hash function, The index of different terms is saved in different file.

Described hash function is expressed as SHA1 (f (r)), the frequency of low-frequency word is sorted as the input of function, Output preserves the filename of this glossarial index.R is the word frequency sequence in dictionary of this term, and f (r) is by multiple retrievals Root, according to word frequency ordering map a to positive integer value, is expressed as:

f (r) = (\begin{matrix} r - r % (10 * α) & N < r \leq 10 N \\ r - r % (20 * α) & 10 N < r \leq 50 N \\ r - r % (40 * α) & 50 N < r \end{matrix}

N is a threshold value, represents in dictionary, and the sequence term more than N belongs to low-frequency word, sorts little In or be high frequency words equal to the word of N.Wherein α is predefined arithmetic number parameter.

When search engine constantly crawls internet information, index upgrade process is as follows:

When the newly-increased text in text library reaches some, triggering system is called and text analyzing is generated one The independent increment index of determined number, independent increment index is stored in internal memory.

When the independent increment in internal memory indexes and increases to some, trigger independent increment index and merge journey Many entry indexs of identical term are merged into one by sequence, are saved in corresponding increment index file.

When increment index file reaches certain quantity, trigger master file consolidation procedure.

Wherein, the operation generating single increment index is triggered by web crawlers, newly added to network crawler system Text be analyzed, generate the item of all terms in the text.This operation is in the way of MapReduce Operate on the different DataNode in HDFS, meet data localization, the single increment rope of generation as far as possible Draw and be distributed on different DataNode.Increment index union operation is touched by the operation generating single increment index Send out.When the data on a DataNode internal memory reach threshold value, single increment index can be merged into by this operation Increment index file in HDFS.Term as the mapping relations of increment index filename by above-mentioned hash Function completes.Increment index is merged by single increment index and increment index with the union operation of base index Operation triggers.

The present invention carries out the strategy of information deletion based on differential indices.Trigger when increment index merges and delete Threshold value P1 of operation is defined as follows:

P1=N1/N2

N1 represents the deleted document number that increment index contains, and N2 is invalid document list in differential indices Middle element number.

Increment index is defined as follows with threshold value P2 triggering deletion action when merging of base index:

P2=N1/N3

N1 is element number in invalid document list in differential indices equally.N3 is all documents in document sets Number.

When increment index union operation occurs, first judge whether to reach threshold value P1.If reached, first from Differential indices file reads all of differential indices, by these differential indices that comprise all in increment index Information deletion, then remerges base index.Without reaching threshold value, just it is directly merged into basic rope Draw.After being merged into base index, then judge whether to reach threshold value P2.If reached, just by base index Information is deleted the information in all differential indices.

So, during system response retrieval request, first retrieve result from base index and increment index acquisition of information, The document information being deleted is fallen further according to the invalid document list filtering deleted in document.When certain Between section differential indices be used for have updated all information in increment index and base index after, just by this time All document ids that the differential indices of section deletes from file and this event section deleted are from invalid document list Delete.

The retrieval of index is the key operation of search engine system response retrieval request.Search engine receives retrieval After request, it is analyzed processing to the term information of input, then comes index with the term after process Retrieve.Operating procedure is described as follows:

Dictionary retrieved in step l. term.First get including this term essential information according to hash function Position in depositing, then reads essential information.Judge the low-and high-frequency field of essential information in dictionary.If It is high frequency words, forwards step 2 to, if low-frequency word, then forward step 3 to.

Step 2. calls hash function using term as input, obtains preserving the filename of this term index.

Step 3. calls hash function, is inputted as function by the word frequency sequencing information of term, is preserved The filename of this term index.

Base index, according to the filename of the HDFS obtained, is retrieved by step 4..First task is created Manager, performs MapReduce task, is separately operable a number of on each DataNode of system Map task and Reduce task.Map task processes the file block data in this locality, therefrom gets number According to item, save as intermediate data.It is then passed to Reduce task and merges process.Reduce task Output is the initial results of retrieval.

Step 5., by the method identical with step 4, carries out MapReduce retrieval to increment index file, looks into The result arrived merges with initial results obtained in the previous step.

Step 6. retrieves independent increment index from internal memory, and retrieval result merges with result obtained in the previous step.

Step 7. retrieves the differential indices in NameNode, according to invalid document list, obtains previous step Result carry out filtration treatment.

The all lists of documents obtained are ranked up by step 8. by predefined algorithm.

Step 9. returns the result after sequence.

In sum, the present invention proposes a kind of data search method based on data mining, utilizes search word Character representation understand user view, and improve the efficiency of search engine by the Indexing Mechanism improved, increase Strong Consumer's Experience.

Obviously, it should be appreciated by those skilled in the art, each module or each step of the above-mentioned present invention are permissible Realizing by general calculating system, they can concentrate in single calculating system, or is distributed in many On the network that individual calculating system is formed, alternatively, they can use the executable program code of calculating system Realize, it is thus possible to be stored in storage system being performed by calculating system.So, this Bright be not restricted to any specific hardware and software combine.

It should be appreciated that the above-mentioned detailed description of the invention of the present invention is used only for exemplary illustration or explains this The principle of invention, and be not construed as limiting the invention.Therefore, without departing from the spirit and scope of the present invention In the case of any modification, equivalent substitution and improvement etc. done, should be included in protection scope of the present invention Within.Additionally, claims of the present invention be intended to fall into scope and border or Whole in the equivalents on this scope of person and border change and modifications example.

Claims

1. a data search method based on data mining, it is characterised in that including:

Method the most according to claim 1, it is characterised in that the described reverse word frequency of calculating text vector With degree of approximation matrix, farther include: