CN106021626A - Data search method based on data mining - Google Patents

Data search method based on data mining Download PDF

Info

Publication number
CN106021626A
CN106021626A CN201610600204.8A CN201610600204A CN106021626A CN 106021626 A CN106021626 A CN 106021626A CN 201610600204 A CN201610600204 A CN 201610600204A CN 106021626 A CN106021626 A CN 106021626A
Authority
CN
China
Prior art keywords
matrix
text
data
approximation
degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610600204.8A
Other languages
Chinese (zh)
Inventor
赖真霖
文君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Sixiang Lianchuang Technology Co Ltd
Original Assignee
Chengdu Sixiang Lianchuang Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Sixiang Lianchuang Technology Co Ltd filed Critical Chengdu Sixiang Lianchuang Technology Co Ltd
Priority to CN201610600204.8A priority Critical patent/CN106021626A/en
Publication of CN106021626A publication Critical patent/CN106021626A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data search method based on data mining. The method includes the steps that user search words are preprocessed, the text vector reverse word frequency and an approximation degree matrix are calculated, a classification result of the search words input by a user is obtained, and search is carried out according to the classification result. According to the data search method based on data mining, the intention of the user is indicated and understood through the features of the search words, the efficiency of a search engine is improved through an improved indexing mechanism, and user experience is enhanced.

Description

Data search method based on data mining
Technical field
The present invention relates to data search, particularly to a kind of data search method based on data mining.
Background technology
Along with the development of the Internet, search engine data volume to be processed is increasing, the property to search engine Can require more and more higher.Efficiently for the search word of user, some potential user views can be excavated, Wherein it is appreciated that user wants to search the result in which field, for user view according to the classification of search word And the feature in this field can meet the application of user's request to user's recommendation, carry out Optimizing Search result.For Raising Information Retrieval Efficiency, information retrieval system the most all uses parallel computing platform based on cloud and index Technology.Structure design, storage mode, search operaqtion and the dynamic update algorithm effect to information retrieval of index Rate has a great impact.But, the search engine of prior art is the most perfect to the process of search word, does not has Especially user view is effectively predicted.And in index process, adding of index needs newly-increased Record merges with the identical entry of original index.In HDFS, this operation needs to re-create block.Each Copy information on DataNode is also required to update.And deletion content in the index can only delete whole piece, And produce substantial amounts of I/O communication, waste substantial amounts of time and resource.
Summary of the invention
For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of number based on data mining According to searching method, including:
Pretreatment user search word, calculates the reverse word frequency of text vector and degree of approximation matrix, obtains user defeated The classification results of the term entered, retrieves according to classification results.
Preferably, the described reverse word frequency of calculating text vector and degree of approximation matrix, farther include:
In conjunction with participle characteristic and the semantic weight combining domain knowledge, all dispose point at all of DataNode Word tool kit, obtains the term of text and this term occurrence number in this document after participle, with Statistics word frequency, scans domain knowledge base, calculates the position weights of this word and semantic span weights, then the One Map stage calls semantic degree of approximation function and adds up;Then each word is calculated in the Reduce stage These information are output as the input of second MapReduce by reverse word frequency;Second Map stage, Calculate semantic degree of approximation result, if the degree of approximation is more than presetting semantic threshold value, then return semantic weight, otherwise Do not weight, all of weights of recombinant, obtain final semantic weight;Second Reduce stage, to vector Normalization processes;
For each text, only calculate the numbering degree of approximation more than oneself text, whole square can be calculated Battle array, under its MapReduce environment, parallel procedure is:
The Map stage: first carry out data segmentation, the data of segmentation are distributed to by NameNode DataNode, numbers to DataNode simultaneously, calculates the degree of approximation of text simultaneously, terminate when the Map stage After, intermediate object program is exported the input locally as the Reduce stage;
The Reduce stage: collect the data of every a line, and collect output matrix;
Then use the input mode optimizing scribing, higher dimensional matrix is blocked into minor matrix computing;The Map stage Carrying out data scribing, read submatrix A and B according to scribing offset information and do multiplication in this locality, result is square One submatrix of battle array C, output is to the position that specifies Offsets;The Reduce stage calculates Matrix C=AB;By right Matrix is claimed to be expressed as symmetric triple-diagonal matrix by orthogonal similarity transformation;Solve the front K of symmetric triple-diagonal matrix Individual minimal characteristic vector, is eventually converted into Laplacian Matrix and vector by the computing of symmetric triple-diagonal matrix Multiplication operation, after obtaining minimal eigenvalue and the characteristic vector of matrix, calculates each text to a certain set The degree of approximation.
The present invention compared to existing technology, has the advantage that
The present invention proposes a kind of data search method based on data mining, utilizes the character representation of search word Understand user view, and improve the efficiency of search engine by the Indexing Mechanism improved, enhance user's body Test.
Accompanying drawing explanation
Fig. 1 is the flow chart of data search method based on data mining according to embodiments of the present invention.
Detailed description of the invention
Hereafter provide one or more embodiment of the present invention together with the accompanying drawing of the diagram principle of the invention is detailed Thin description.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.This Bright scope is limited only by the appended claims, and the present invention contains many replacements, amendment and equivalent.? Middle elaboration many detail is described below to provide thorough understanding of the present invention.For exemplary purposes And these details are provided, and can also be according to right without some in these details or all details Claim realizes the present invention.
An aspect of of the present present invention provides a kind of data search method based on data mining.Fig. 1 is according to this The data search method flow chart based on data mining of inventive embodiments.
The term that the present invention inputs according to user, uses the expansion to term of the training in advance good grader Text is classified, and classification results is sent to application system, identifies attribute and the user view of term, Obtain template and data, then be presented in search engine with the form of Web application.
Described application system includes:
Proxy server, protection built-in system port the most externally exposes, and supports high concurrent, load balancing, Also contribute to improve access speed.
Web server: use the framework of MVC, calls each service, data is assembled, finally Integrate with style sheet.
Caching server: information low for some renewal frequencies be stored in caching server, without instead Multiple searching database, alleviates the pressure of data base.
Attribute identification services: identifies subclass and user view, selects to be suitable for the application of user's request Template.
Divide Domain resource database: according to different fields, set up the resources bank of oneself.
Data base: using master library and the structure from storehouse, master library is responsible for write, is responsible for reading from storehouse, it is achieved read Write separation.Additionally need between master library to synchronize, also to keep Tong Bu with the data of master library from storehouse.
Retrieval server: when retrieval is with certain condition, or when having the demand of sequence, make the skill of index of reference Art quickly obtains result according to search condition.
For the classification of term, the present invention utilizes search engine that term carries out text expansion, will retrieval The front k bar summary info that word is extended in search-engine results, k is the entry number after expanding, and will expand text Classification results as the classification results of former search text.
At term pretreatment stage, it is that term is expanded, with the form storage of vector, abandons wide The less results of quantity of information such as announcement, picture, Web application, filtering web page label, before extracting, k bar summary is The expansion result of text, then carries out pretreatment to expansion text, and pretreatment is divided into participle, abandons stop words. Each summary, i.e. one document, what the inside comprised is characteristic item set, it is represented with vector.Then In the training stage, according to the expansion information of the text in training set, carry out feature selection, characteristic weighing, build Vertical grader.Specifically include and be extended for the term in training set getting in term pretreatment stage The expansion text of this term.Feature selection approach is utilized to select characteristic features item, by the big order of magnitude Characteristic item dimensionality reduction, the improvement the being applicable to text classification reverse word frequency method using the present invention to improve carries out feature Weighting, finally uses the method for text classification to set up grader, for test phase.At sorting phase, Term to be sorted is classified.Specifically include and term to be sorted is first extended for expanding text, so The feature that rear utilization extracted in the training stage carries out dimensionality reduction, by Feature Weighting Method by feature vector table Show, finally utilize the grader that the training stage optimized that the characteristic vector of the information of expansion is classified, obtain The classification that result is this term.
Sorter model is adjusted by the algorithm of feedback training.Grader judges which classification a text belongs to While, provide a credibility, if this value is less than threshold value, then judges that its credibility is the highest, can remember Record is to the data text collection of feedback training.By big data platform off-line analysis User action log, acquisition point These words also be recorded feedback instruction by the term that the amount of hitting is corresponding less than the Web application of predetermined click volume threshold value In the middle of the data text collection practiced.The most manually carry out screening, marking, by the term of a part of high access It is directly placed in high frequency vocabulary, can first mate with the word in high frequency vocabulary when of search, one next time Part is as the final data of study feedback.Successively in the vectorial and former grader in comparative training grader to The degree of approximation of amount, if similar, just deletes this vector in former grader, substitutes with new vector, if Do not find similar vector, be added in new grader vector.
Wherein utilize the grader that the training stage optimized that the characteristic vector of the information of expansion is carried out text classification, It is further divided into the reverse word frequency of text vector set up and calculate degree of approximation matrix.The present invention combine participle characteristic with In conjunction with the semantic weight of domain knowledge, all dispose participle tool kit at all of DataNode, obtain after participle The term of text and this term in the occurrence number of this document, to add up word frequency, scanning field is known Know storehouse, calculate the position weights of this word and semantic span weights, then call semanteme first Map stage Degree of approximation function is added up.Then, calculate the reverse word frequency of each word in the Reduce stage, these are believed Breath is output as the input of second MapReduce.Second Map stage, calculate semantic degree of approximation result, If the degree of approximation is more than presetting semantic threshold value, then returning semantic weight, do not weight, recombinant is all of Weights, obtain final semantic weight.Second Reduce stage, process that vector gauge is formatted.
For each text, it is only necessary to calculate the numbering degree of approximation more than oneself text, can calculate whole Individual matrix.Under MapReduce environment, paralell design scheme is as follows:
The Map stage: first carry out data segmentation, the data of segmentation are distributed to by NameNode DataNode, numbers to DataNode simultaneously, calculates the degree of approximation of text simultaneously, terminate when the Map stage After, intermediate object program is exported the input locally as the Reduce stage;
The Reduce stage: collect the data of every a line, and collect output matrix.
Then use the input mode optimizing scribing, higher dimensional matrix is blocked into minor matrix computing.The Map stage Carrying out data scribing, read submatrix A and B according to scribing offset information and do multiplication in this locality, result is square One submatrix of battle array C, output is to the position that specifies Offsets.The Reduce stage calculates Matrix C=AB.By right Matrix is claimed to be expressed as symmetric triple-diagonal matrix by orthogonal similarity transformation.Solve the front K of symmetric triple-diagonal matrix Individual minimal characteristic vector, is eventually converted into Laplacian Matrix and vector by the computing of symmetric triple-diagonal matrix Multiplication operation.After obtaining minimal eigenvalue and the characteristic vector of matrix, calculate each text to a certain set The degree of approximation.
In terms of the improvement of index structure, the index structure of the present invention includes base index, increment index, difference Subindex and dictionary.According to said structure, use the file storage strategy sorted based on word frequency size and word frequency, And the compressed encoding of data in item.For above-mentioned index structure, devise and carry out under cloud platform accordingly Index is set up, updates, deletes and is retrieved.
Index structure is by increment index, differential indices and the dictionary on the NameNode operating in HDFS Constitute with the base index run on DataNode.Base index is made up of multiple documents.Document is by multiple Index entry is constituted.Each index entry comprises a term and multiple listed files.Listed files is by file ID The number and position information structure occurred in this document with this term.File is saved in HDFS, DataNode different in cluster it is stored in after piecemeal.Differential indices is saved in the NameNode of HDFS On.Document id to be deleted in all document library of invalid document list records, during response retrieval request, be Unite according to this list, from retrieval result, filter out the file corresponding for all ID in this list.Differential indices Generate when having file to be deleted in document library.To differential indices file reach certain threshold value time, system according to This index file updates base index.Increment index operates on each DataNode of HDFS, is used for protecting Deposit newly-increased index entry.When library has newly-increased file, this document generate an independent index.Solely Lithol is incorporated in one document of record the information that all terms occurred occur in the document.Work as independence The quantity of index arrives certain threshold value, can trigger merging process, all independent indexes are merged into one whole Body, i.e. one increment index.When the quantity of increment index reaches certain threshold value, union operation also can be triggered, Multiple increment indexes are merged in base index.In index structure, by the word in dictionary by frequency partition it is High frequency words and low-frequency word, be stored in the NameNode of HDFS in different ways to the word of different frequency In internal memory, with the form tissue of hash table.For high frequency words, each word preserves it with an independent file Index, for low-frequency word, is saved in the table of multiple terms in same file.Will by hash function The index information of each high frequency words is mapped in independent file, and is reflected by the index information of multiple low-frequency words It is mapped in a file.During retrieval, according to the frequency attribute of word, carry out distributed inspection in different ways Rope.
The structure of index entry is as follows:
<wordid,docnum,<docid,freq,<p1,p2,...,pn>>,...,<docid,freq,<p1, p2,...pn>>>
Wordid represents that term, docnum represent term wordid and occur in docnum document Cross.Be below docnum Doclist, each Doclist represent that wordid occurs in one file time Number and positional information.Docid represents the reference number of a document that this term occurs, freq represents that term is at this The number of times occurred in individual file, is followed by the list of an a length of freq, represents what term occurred every time Positional information.
Two incremental lists of existence in said structure:
The document id number being incremented by: (doc1, doc2, doc3..., docn)。
The positional information sequence being incremented by: (p1, p2, p3..., pn)。
Retaining the Section 1 in sequence, each item below replaces with the difference of previous item with it.Process it After sequence as follows:
Document id sequence of differences: (doc1, doc2-doc1, doc3-doc2..., docn-docn-1)。
Alternate position spike value sequence in individual file: (p1, p2-p1, p3-p2..., pn-pn-1)。
Sequence after conversion process, sample space reduces and can use more efficient compression method.
After search engine collecting to a number of webpage this locality text library, need to set up initial index. Before setting up index, needing the text to capturing to carry out pretreatment, a most important step is i.e. word segmentation processing. The basic process setting up index is described as follows:
First step l. carries out participle to the text grabbed, and then the file after participle is stored in HDFS Each DataNode in.
Step 2. then creates the job processor of a MapReduce, and holding of operation is set up in responsible control OK.
Step 3. job processor according to circumstances creates map task, map task run on DataNode, Document in each piecemeal is respectively processed, and produces intermediate file.
After step 4.map tasks carrying is complete, multiple reduce tasks start to perform, and map task are produced Intermediate file merges, and generates index file.
Step 5. sorts according to the frequency attribute of term and the word frequency in dictionary, uses following hash function, The index of different terms is saved in different file.
Described hash function is expressed as SHA1 (f (r)), the frequency of low-frequency word is sorted as the input of function, Output preserves the filename of this glossarial index.R is the word frequency sequence in dictionary of this term, and f (r) is by multiple retrievals Root, according to word frequency ordering map a to positive integer value, is expressed as:
f ( r ) = r - r % ( 10 * &alpha; ) N < r &le; 10 N r - r % ( 20 * &alpha; ) 10 N < r &le; 50 N r - r % ( 40 * &alpha; ) 50 N < r
N is a threshold value, represents in dictionary, and the sequence term more than N belongs to low-frequency word, sorts little In or be high frequency words equal to the word of N.Wherein α is predefined arithmetic number parameter.
When search engine constantly crawls internet information, index upgrade process is as follows:
When the newly-increased text in text library reaches some, triggering system is called and text analyzing is generated one The independent increment index of determined number, independent increment index is stored in internal memory.
When the independent increment in internal memory indexes and increases to some, trigger independent increment index and merge journey Many entry indexs of identical term are merged into one by sequence, are saved in corresponding increment index file.
When increment index file reaches certain quantity, trigger master file consolidation procedure.
Wherein, the operation generating single increment index is triggered by web crawlers, newly added to network crawler system Text be analyzed, generate the item of all terms in the text.This operation is in the way of MapReduce Operate on the different DataNode in HDFS, meet data localization, the single increment rope of generation as far as possible Draw and be distributed on different DataNode.Increment index union operation is touched by the operation generating single increment index Send out.When the data on a DataNode internal memory reach threshold value, single increment index can be merged into by this operation Increment index file in HDFS.Term as the mapping relations of increment index filename by above-mentioned hash Function completes.Increment index is merged by single increment index and increment index with the union operation of base index Operation triggers.
The present invention carries out the strategy of information deletion based on differential indices.Trigger when increment index merges and delete Threshold value P1 of operation is defined as follows:
P1=N1/N2
N1 represents the deleted document number that increment index contains, and N2 is invalid document list in differential indices Middle element number.
Increment index is defined as follows with threshold value P2 triggering deletion action when merging of base index:
P2=N1/N3
N1 is element number in invalid document list in differential indices equally.N3 is all documents in document sets Number.
When increment index union operation occurs, first judge whether to reach threshold value P1.If reached, first from Differential indices file reads all of differential indices, by these differential indices that comprise all in increment index Information deletion, then remerges base index.Without reaching threshold value, just it is directly merged into basic rope Draw.After being merged into base index, then judge whether to reach threshold value P2.If reached, just by base index Information is deleted the information in all differential indices.
So, during system response retrieval request, first retrieve result from base index and increment index acquisition of information, The document information being deleted is fallen further according to the invalid document list filtering deleted in document.When certain Between section differential indices be used for have updated all information in increment index and base index after, just by this time All document ids that the differential indices of section deletes from file and this event section deleted are from invalid document list Delete.
The retrieval of index is the key operation of search engine system response retrieval request.Search engine receives retrieval After request, it is analyzed processing to the term information of input, then comes index with the term after process Retrieve.Operating procedure is described as follows:
Dictionary retrieved in step l. term.First get including this term essential information according to hash function Position in depositing, then reads essential information.Judge the low-and high-frequency field of essential information in dictionary.If It is high frequency words, forwards step 2 to, if low-frequency word, then forward step 3 to.
Step 2. calls hash function using term as input, obtains preserving the filename of this term index.
Step 3. calls hash function, is inputted as function by the word frequency sequencing information of term, is preserved The filename of this term index.
Base index, according to the filename of the HDFS obtained, is retrieved by step 4..First task is created Manager, performs MapReduce task, is separately operable a number of on each DataNode of system Map task and Reduce task.Map task processes the file block data in this locality, therefrom gets number According to item, save as intermediate data.It is then passed to Reduce task and merges process.Reduce task Output is the initial results of retrieval.
Step 5., by the method identical with step 4, carries out MapReduce retrieval to increment index file, looks into The result arrived merges with initial results obtained in the previous step.
Step 6. retrieves independent increment index from internal memory, and retrieval result merges with result obtained in the previous step.
Step 7. retrieves the differential indices in NameNode, according to invalid document list, obtains previous step Result carry out filtration treatment.
The all lists of documents obtained are ranked up by step 8. by predefined algorithm.
Step 9. returns the result after sequence.
In sum, the present invention proposes a kind of data search method based on data mining, utilizes search word Character representation understand user view, and improve the efficiency of search engine by the Indexing Mechanism improved, increase Strong Consumer's Experience.
Obviously, it should be appreciated by those skilled in the art, each module or each step of the above-mentioned present invention are permissible Realizing by general calculating system, they can concentrate in single calculating system, or is distributed in many On the network that individual calculating system is formed, alternatively, they can use the executable program code of calculating system Realize, it is thus possible to be stored in storage system being performed by calculating system.So, this Bright be not restricted to any specific hardware and software combine.
It should be appreciated that the above-mentioned detailed description of the invention of the present invention is used only for exemplary illustration or explains this The principle of invention, and be not construed as limiting the invention.Therefore, without departing from the spirit and scope of the present invention In the case of any modification, equivalent substitution and improvement etc. done, should be included in protection scope of the present invention Within.Additionally, claims of the present invention be intended to fall into scope and border or Whole in the equivalents on this scope of person and border change and modifications example.

Claims (2)

1. a data search method based on data mining, it is characterised in that including:
Pretreatment user search word, calculates the reverse word frequency of text vector and degree of approximation matrix, obtains user defeated The classification results of the term entered, retrieves according to classification results.
Method the most according to claim 1, it is characterised in that the described reverse word frequency of calculating text vector With degree of approximation matrix, farther include:
In conjunction with participle characteristic and the semantic weight combining domain knowledge, all dispose point at all of DataNode Word tool kit, obtains the term of text and this term occurrence number in this document after participle, with Statistics word frequency, scans domain knowledge base, calculates the position weights of this word and semantic span weights, then the One Map stage calls semantic degree of approximation function and adds up;Then each word is calculated in the Reduce stage These information are output as the input of second MapReduce by reverse word frequency;Second Map stage, Calculate semantic degree of approximation result, if the degree of approximation is more than presetting semantic threshold value, then return semantic weight, otherwise Do not weight, all of weights of recombinant, obtain final semantic weight;Second Reduce stage, to vector Normalization processes;
For each text, only calculate the numbering degree of approximation more than oneself text, whole square can be calculated Battle array, under its MapReduce environment, parallel procedure is:
The Map stage: first carry out data segmentation, the data of segmentation are distributed to by NameNode DataNode, numbers to DataNode simultaneously, calculates the degree of approximation of text simultaneously, terminate when the Map stage After, intermediate object program is exported the input locally as the Reduce stage;
The Reduce stage: collect the data of every a line, and collect output matrix;
Then use the input mode optimizing scribing, higher dimensional matrix is blocked into minor matrix computing;The Map stage Carrying out data scribing, read submatrix A and B according to scribing offset information and do multiplication in this locality, result is square One submatrix of battle array C, output is to the position that specifies Offsets;The Reduce stage calculates Matrix C=AB;By right Matrix is claimed to be expressed as symmetric triple-diagonal matrix by orthogonal similarity transformation;Solve the front K of symmetric triple-diagonal matrix Individual minimal characteristic vector, is eventually converted into Laplacian Matrix and vector by the computing of symmetric triple-diagonal matrix Multiplication operation, after obtaining minimal eigenvalue and the characteristic vector of matrix, calculates each text to a certain set The degree of approximation.
CN201610600204.8A 2016-07-27 2016-07-27 Data search method based on data mining Pending CN106021626A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610600204.8A CN106021626A (en) 2016-07-27 2016-07-27 Data search method based on data mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610600204.8A CN106021626A (en) 2016-07-27 2016-07-27 Data search method based on data mining

Publications (1)

Publication Number Publication Date
CN106021626A true CN106021626A (en) 2016-10-12

Family

ID=57113982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610600204.8A Pending CN106021626A (en) 2016-07-27 2016-07-27 Data search method based on data mining

Country Status (1)

Country Link
CN (1) CN106021626A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528540A (en) * 2016-12-16 2017-03-22 广州索答信息科技有限公司 Word segmentation method and word segmentation system for seed questions
CN107368464A (en) * 2017-07-28 2017-11-21 深圳数众科技有限公司 A kind of method and device for obtaining bid product information
CN108415903A (en) * 2018-03-12 2018-08-17 武汉斗鱼网络科技有限公司 Judge evaluation method, storage medium and the equipment of search intention identification validity
CN111611293A (en) * 2020-04-24 2020-09-01 太原太工天宇教育科技有限公司 Outlier data mining method based on feature weighting and MapReduce
CN111639099A (en) * 2020-06-09 2020-09-08 武汉虹旭信息技术有限责任公司 Full-text indexing method and system
CN112686313A (en) * 2020-12-31 2021-04-20 江西理工大学 Improved parallel deep forest classification method based on information theory
EP4191434A1 (en) * 2019-04-19 2023-06-07 Fujitsu Limited Identification method, generation method, dimensional compression method, display method, and information processing device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴鹏: "基于Hadoop的领域知识库自动构建关键技术研究", 《中国优秀硕士学位论文全文数据库》 *
高扬: "短文本分类在搜索引擎中的研究与应用", 《中国优秀硕士学位论文全文数据库》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528540A (en) * 2016-12-16 2017-03-22 广州索答信息科技有限公司 Word segmentation method and word segmentation system for seed questions
CN107368464A (en) * 2017-07-28 2017-11-21 深圳数众科技有限公司 A kind of method and device for obtaining bid product information
CN108415903A (en) * 2018-03-12 2018-08-17 武汉斗鱼网络科技有限公司 Judge evaluation method, storage medium and the equipment of search intention identification validity
EP4191434A1 (en) * 2019-04-19 2023-06-07 Fujitsu Limited Identification method, generation method, dimensional compression method, display method, and information processing device
CN111611293A (en) * 2020-04-24 2020-09-01 太原太工天宇教育科技有限公司 Outlier data mining method based on feature weighting and MapReduce
CN111611293B (en) * 2020-04-24 2023-09-29 太原太工天宇教育科技有限公司 Outlier data mining method based on feature weighting and MapReduce
CN111639099A (en) * 2020-06-09 2020-09-08 武汉虹旭信息技术有限责任公司 Full-text indexing method and system
CN112686313A (en) * 2020-12-31 2021-04-20 江西理工大学 Improved parallel deep forest classification method based on information theory
CN112686313B (en) * 2020-12-31 2022-05-13 韶关学院 Improved parallel deep forest classification method based on information theory

Similar Documents

Publication Publication Date Title
CN106021626A (en) Data search method based on data mining
CN106156357A (en) Text data beam search method
US20070294223A1 (en) Text Categorization Using External Knowledge
US9305100B2 (en) Object oriented data and metadata based search
EP2041669B1 (en) Text categorization using external knowledge
US8131684B2 (en) Adaptive archive data management
US7783626B2 (en) Pipelined architecture for global analysis and index building
CN106202552A (en) Data search method based on cloud computing
CN107729336A (en) Data processing method, equipment and system
CN104376406A (en) Enterprise innovation resource management and analysis system and method based on big data
US20110264997A1 (en) Scalable Incremental Semantic Entity and Relatedness Extraction from Unstructured Text
CN107844493B (en) File association method and system
Wu et al. Efficient near-duplicate detection for q&a forum
CN106503274A (en) A kind of Data Integration and searching method and server
CN103744954A (en) Word relevancy network model establishing method and establishing device thereof
Uma et al. Noise elimination from web pages for efficacious information retrieval
US9552415B2 (en) Category classification processing device and method
CN108170799A (en) A kind of Frequent episodes method for digging of mass data
Chopra et al. A survey on improving the efficiency of different web structure mining algorithms
US10671810B2 (en) Citation explanations
Huang et al. Design a batched information retrieval system based on a concept-lattice-like structure
JP4189387B2 (en) Knowledge search system, knowledge search method and program
CN106934007B (en) Associated information pushing method and device
Rashmi et al. Deep web crawler: exploring and re-ranking of web forms
Caldeira et al. Experimental evaluation among reblocking techniques applied to the entity resolution

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161012