CN106021626A - Data search method based on data mining - Google Patents
Data search method based on data mining Download PDFInfo
- Publication number
- CN106021626A CN106021626A CN201610600204.8A CN201610600204A CN106021626A CN 106021626 A CN106021626 A CN 106021626A CN 201610600204 A CN201610600204 A CN 201610600204A CN 106021626 A CN106021626 A CN 106021626A
- Authority
- CN
- China
- Prior art keywords
- matrix
- text
- data
- approximation
- degree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data search method based on data mining. The method includes the steps that user search words are preprocessed, the text vector reverse word frequency and an approximation degree matrix are calculated, a classification result of the search words input by a user is obtained, and search is carried out according to the classification result. According to the data search method based on data mining, the intention of the user is indicated and understood through the features of the search words, the efficiency of a search engine is improved through an improved indexing mechanism, and user experience is enhanced.
Description
Technical field
The present invention relates to data search, particularly to a kind of data search method based on data mining.
Background technology
Along with the development of the Internet, search engine data volume to be processed is increasing, the property to search engine
Can require more and more higher.Efficiently for the search word of user, some potential user views can be excavated,
Wherein it is appreciated that user wants to search the result in which field, for user view according to the classification of search word
And the feature in this field can meet the application of user's request to user's recommendation, carry out Optimizing Search result.For
Raising Information Retrieval Efficiency, information retrieval system the most all uses parallel computing platform based on cloud and index
Technology.Structure design, storage mode, search operaqtion and the dynamic update algorithm effect to information retrieval of index
Rate has a great impact.But, the search engine of prior art is the most perfect to the process of search word, does not has
Especially user view is effectively predicted.And in index process, adding of index needs newly-increased
Record merges with the identical entry of original index.In HDFS, this operation needs to re-create block.Each
Copy information on DataNode is also required to update.And deletion content in the index can only delete whole piece,
And produce substantial amounts of I/O communication, waste substantial amounts of time and resource.
Summary of the invention
For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of number based on data mining
According to searching method, including:
Pretreatment user search word, calculates the reverse word frequency of text vector and degree of approximation matrix, obtains user defeated
The classification results of the term entered, retrieves according to classification results.
Preferably, the described reverse word frequency of calculating text vector and degree of approximation matrix, farther include:
In conjunction with participle characteristic and the semantic weight combining domain knowledge, all dispose point at all of DataNode
Word tool kit, obtains the term of text and this term occurrence number in this document after participle, with
Statistics word frequency, scans domain knowledge base, calculates the position weights of this word and semantic span weights, then the
One Map stage calls semantic degree of approximation function and adds up;Then each word is calculated in the Reduce stage
These information are output as the input of second MapReduce by reverse word frequency;Second Map stage,
Calculate semantic degree of approximation result, if the degree of approximation is more than presetting semantic threshold value, then return semantic weight, otherwise
Do not weight, all of weights of recombinant, obtain final semantic weight;Second Reduce stage, to vector
Normalization processes;
For each text, only calculate the numbering degree of approximation more than oneself text, whole square can be calculated
Battle array, under its MapReduce environment, parallel procedure is:
The Map stage: first carry out data segmentation, the data of segmentation are distributed to by NameNode
DataNode, numbers to DataNode simultaneously, calculates the degree of approximation of text simultaneously, terminate when the Map stage
After, intermediate object program is exported the input locally as the Reduce stage;
The Reduce stage: collect the data of every a line, and collect output matrix;
Then use the input mode optimizing scribing, higher dimensional matrix is blocked into minor matrix computing;The Map stage
Carrying out data scribing, read submatrix A and B according to scribing offset information and do multiplication in this locality, result is square
One submatrix of battle array C, output is to the position that specifies Offsets;The Reduce stage calculates Matrix C=AB;By right
Matrix is claimed to be expressed as symmetric triple-diagonal matrix by orthogonal similarity transformation;Solve the front K of symmetric triple-diagonal matrix
Individual minimal characteristic vector, is eventually converted into Laplacian Matrix and vector by the computing of symmetric triple-diagonal matrix
Multiplication operation, after obtaining minimal eigenvalue and the characteristic vector of matrix, calculates each text to a certain set
The degree of approximation.
The present invention compared to existing technology, has the advantage that
The present invention proposes a kind of data search method based on data mining, utilizes the character representation of search word
Understand user view, and improve the efficiency of search engine by the Indexing Mechanism improved, enhance user's body
Test.
Accompanying drawing explanation
Fig. 1 is the flow chart of data search method based on data mining according to embodiments of the present invention.
Detailed description of the invention
Hereafter provide one or more embodiment of the present invention together with the accompanying drawing of the diagram principle of the invention is detailed
Thin description.Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.This
Bright scope is limited only by the appended claims, and the present invention contains many replacements, amendment and equivalent.?
Middle elaboration many detail is described below to provide thorough understanding of the present invention.For exemplary purposes
And these details are provided, and can also be according to right without some in these details or all details
Claim realizes the present invention.
An aspect of of the present present invention provides a kind of data search method based on data mining.Fig. 1 is according to this
The data search method flow chart based on data mining of inventive embodiments.
The term that the present invention inputs according to user, uses the expansion to term of the training in advance good grader
Text is classified, and classification results is sent to application system, identifies attribute and the user view of term,
Obtain template and data, then be presented in search engine with the form of Web application.
Described application system includes:
Proxy server, protection built-in system port the most externally exposes, and supports high concurrent, load balancing,
Also contribute to improve access speed.
Web server: use the framework of MVC, calls each service, data is assembled, finally
Integrate with style sheet.
Caching server: information low for some renewal frequencies be stored in caching server, without instead
Multiple searching database, alleviates the pressure of data base.
Attribute identification services: identifies subclass and user view, selects to be suitable for the application of user's request
Template.
Divide Domain resource database: according to different fields, set up the resources bank of oneself.
Data base: using master library and the structure from storehouse, master library is responsible for write, is responsible for reading from storehouse, it is achieved read
Write separation.Additionally need between master library to synchronize, also to keep Tong Bu with the data of master library from storehouse.
Retrieval server: when retrieval is with certain condition, or when having the demand of sequence, make the skill of index of reference
Art quickly obtains result according to search condition.
For the classification of term, the present invention utilizes search engine that term carries out text expansion, will retrieval
The front k bar summary info that word is extended in search-engine results, k is the entry number after expanding, and will expand text
Classification results as the classification results of former search text.
At term pretreatment stage, it is that term is expanded, with the form storage of vector, abandons wide
The less results of quantity of information such as announcement, picture, Web application, filtering web page label, before extracting, k bar summary is
The expansion result of text, then carries out pretreatment to expansion text, and pretreatment is divided into participle, abandons stop words.
Each summary, i.e. one document, what the inside comprised is characteristic item set, it is represented with vector.Then
In the training stage, according to the expansion information of the text in training set, carry out feature selection, characteristic weighing, build
Vertical grader.Specifically include and be extended for the term in training set getting in term pretreatment stage
The expansion text of this term.Feature selection approach is utilized to select characteristic features item, by the big order of magnitude
Characteristic item dimensionality reduction, the improvement the being applicable to text classification reverse word frequency method using the present invention to improve carries out feature
Weighting, finally uses the method for text classification to set up grader, for test phase.At sorting phase,
Term to be sorted is classified.Specifically include and term to be sorted is first extended for expanding text, so
The feature that rear utilization extracted in the training stage carries out dimensionality reduction, by Feature Weighting Method by feature vector table
Show, finally utilize the grader that the training stage optimized that the characteristic vector of the information of expansion is classified, obtain
The classification that result is this term.
Sorter model is adjusted by the algorithm of feedback training.Grader judges which classification a text belongs to
While, provide a credibility, if this value is less than threshold value, then judges that its credibility is the highest, can remember
Record is to the data text collection of feedback training.By big data platform off-line analysis User action log, acquisition point
These words also be recorded feedback instruction by the term that the amount of hitting is corresponding less than the Web application of predetermined click volume threshold value
In the middle of the data text collection practiced.The most manually carry out screening, marking, by the term of a part of high access
It is directly placed in high frequency vocabulary, can first mate with the word in high frequency vocabulary when of search, one next time
Part is as the final data of study feedback.Successively in the vectorial and former grader in comparative training grader to
The degree of approximation of amount, if similar, just deletes this vector in former grader, substitutes with new vector, if
Do not find similar vector, be added in new grader vector.
Wherein utilize the grader that the training stage optimized that the characteristic vector of the information of expansion is carried out text classification,
It is further divided into the reverse word frequency of text vector set up and calculate degree of approximation matrix.The present invention combine participle characteristic with
In conjunction with the semantic weight of domain knowledge, all dispose participle tool kit at all of DataNode, obtain after participle
The term of text and this term in the occurrence number of this document, to add up word frequency, scanning field is known
Know storehouse, calculate the position weights of this word and semantic span weights, then call semanteme first Map stage
Degree of approximation function is added up.Then, calculate the reverse word frequency of each word in the Reduce stage, these are believed
Breath is output as the input of second MapReduce.Second Map stage, calculate semantic degree of approximation result,
If the degree of approximation is more than presetting semantic threshold value, then returning semantic weight, do not weight, recombinant is all of
Weights, obtain final semantic weight.Second Reduce stage, process that vector gauge is formatted.
For each text, it is only necessary to calculate the numbering degree of approximation more than oneself text, can calculate whole
Individual matrix.Under MapReduce environment, paralell design scheme is as follows:
The Map stage: first carry out data segmentation, the data of segmentation are distributed to by NameNode
DataNode, numbers to DataNode simultaneously, calculates the degree of approximation of text simultaneously, terminate when the Map stage
After, intermediate object program is exported the input locally as the Reduce stage;
The Reduce stage: collect the data of every a line, and collect output matrix.
Then use the input mode optimizing scribing, higher dimensional matrix is blocked into minor matrix computing.The Map stage
Carrying out data scribing, read submatrix A and B according to scribing offset information and do multiplication in this locality, result is square
One submatrix of battle array C, output is to the position that specifies Offsets.The Reduce stage calculates Matrix C=AB.By right
Matrix is claimed to be expressed as symmetric triple-diagonal matrix by orthogonal similarity transformation.Solve the front K of symmetric triple-diagonal matrix
Individual minimal characteristic vector, is eventually converted into Laplacian Matrix and vector by the computing of symmetric triple-diagonal matrix
Multiplication operation.After obtaining minimal eigenvalue and the characteristic vector of matrix, calculate each text to a certain set
The degree of approximation.
In terms of the improvement of index structure, the index structure of the present invention includes base index, increment index, difference
Subindex and dictionary.According to said structure, use the file storage strategy sorted based on word frequency size and word frequency,
And the compressed encoding of data in item.For above-mentioned index structure, devise and carry out under cloud platform accordingly
Index is set up, updates, deletes and is retrieved.
Index structure is by increment index, differential indices and the dictionary on the NameNode operating in HDFS
Constitute with the base index run on DataNode.Base index is made up of multiple documents.Document is by multiple
Index entry is constituted.Each index entry comprises a term and multiple listed files.Listed files is by file ID
The number and position information structure occurred in this document with this term.File is saved in HDFS,
DataNode different in cluster it is stored in after piecemeal.Differential indices is saved in the NameNode of HDFS
On.Document id to be deleted in all document library of invalid document list records, during response retrieval request, be
Unite according to this list, from retrieval result, filter out the file corresponding for all ID in this list.Differential indices
Generate when having file to be deleted in document library.To differential indices file reach certain threshold value time, system according to
This index file updates base index.Increment index operates on each DataNode of HDFS, is used for protecting
Deposit newly-increased index entry.When library has newly-increased file, this document generate an independent index.Solely
Lithol is incorporated in one document of record the information that all terms occurred occur in the document.Work as independence
The quantity of index arrives certain threshold value, can trigger merging process, all independent indexes are merged into one whole
Body, i.e. one increment index.When the quantity of increment index reaches certain threshold value, union operation also can be triggered,
Multiple increment indexes are merged in base index.In index structure, by the word in dictionary by frequency partition it is
High frequency words and low-frequency word, be stored in the NameNode of HDFS in different ways to the word of different frequency
In internal memory, with the form tissue of hash table.For high frequency words, each word preserves it with an independent file
Index, for low-frequency word, is saved in the table of multiple terms in same file.Will by hash function
The index information of each high frequency words is mapped in independent file, and is reflected by the index information of multiple low-frequency words
It is mapped in a file.During retrieval, according to the frequency attribute of word, carry out distributed inspection in different ways
Rope.
The structure of index entry is as follows:
<wordid,docnum,<docid,freq,<p1,p2,...,pn>>,...,<docid,freq,<p1,
p2,...pn>>>
Wordid represents that term, docnum represent term wordid and occur in docnum document
Cross.Be below docnum Doclist, each Doclist represent that wordid occurs in one file time
Number and positional information.Docid represents the reference number of a document that this term occurs, freq represents that term is at this
The number of times occurred in individual file, is followed by the list of an a length of freq, represents what term occurred every time
Positional information.
Two incremental lists of existence in said structure:
The document id number being incremented by: (doc1, doc2, doc3..., docn)。
The positional information sequence being incremented by: (p1, p2, p3..., pn)。
Retaining the Section 1 in sequence, each item below replaces with the difference of previous item with it.Process it
After sequence as follows:
Document id sequence of differences: (doc1, doc2-doc1, doc3-doc2..., docn-docn-1)。
Alternate position spike value sequence in individual file: (p1, p2-p1, p3-p2..., pn-pn-1)。
Sequence after conversion process, sample space reduces and can use more efficient compression method.
After search engine collecting to a number of webpage this locality text library, need to set up initial index.
Before setting up index, needing the text to capturing to carry out pretreatment, a most important step is i.e. word segmentation processing.
The basic process setting up index is described as follows:
First step l. carries out participle to the text grabbed, and then the file after participle is stored in HDFS
Each DataNode in.
Step 2. then creates the job processor of a MapReduce, and holding of operation is set up in responsible control
OK.
Step 3. job processor according to circumstances creates map task, map task run on DataNode,
Document in each piecemeal is respectively processed, and produces intermediate file.
After step 4.map tasks carrying is complete, multiple reduce tasks start to perform, and map task are produced
Intermediate file merges, and generates index file.
Step 5. sorts according to the frequency attribute of term and the word frequency in dictionary, uses following hash function,
The index of different terms is saved in different file.
Described hash function is expressed as SHA1 (f (r)), the frequency of low-frequency word is sorted as the input of function,
Output preserves the filename of this glossarial index.R is the word frequency sequence in dictionary of this term, and f (r) is by multiple retrievals
Root, according to word frequency ordering map a to positive integer value, is expressed as:
N is a threshold value, represents in dictionary, and the sequence term more than N belongs to low-frequency word, sorts little
In or be high frequency words equal to the word of N.Wherein α is predefined arithmetic number parameter.
When search engine constantly crawls internet information, index upgrade process is as follows:
When the newly-increased text in text library reaches some, triggering system is called and text analyzing is generated one
The independent increment index of determined number, independent increment index is stored in internal memory.
When the independent increment in internal memory indexes and increases to some, trigger independent increment index and merge journey
Many entry indexs of identical term are merged into one by sequence, are saved in corresponding increment index file.
When increment index file reaches certain quantity, trigger master file consolidation procedure.
Wherein, the operation generating single increment index is triggered by web crawlers, newly added to network crawler system
Text be analyzed, generate the item of all terms in the text.This operation is in the way of MapReduce
Operate on the different DataNode in HDFS, meet data localization, the single increment rope of generation as far as possible
Draw and be distributed on different DataNode.Increment index union operation is touched by the operation generating single increment index
Send out.When the data on a DataNode internal memory reach threshold value, single increment index can be merged into by this operation
Increment index file in HDFS.Term as the mapping relations of increment index filename by above-mentioned hash
Function completes.Increment index is merged by single increment index and increment index with the union operation of base index
Operation triggers.
The present invention carries out the strategy of information deletion based on differential indices.Trigger when increment index merges and delete
Threshold value P1 of operation is defined as follows:
P1=N1/N2
N1 represents the deleted document number that increment index contains, and N2 is invalid document list in differential indices
Middle element number.
Increment index is defined as follows with threshold value P2 triggering deletion action when merging of base index:
P2=N1/N3
N1 is element number in invalid document list in differential indices equally.N3 is all documents in document sets
Number.
When increment index union operation occurs, first judge whether to reach threshold value P1.If reached, first from
Differential indices file reads all of differential indices, by these differential indices that comprise all in increment index
Information deletion, then remerges base index.Without reaching threshold value, just it is directly merged into basic rope
Draw.After being merged into base index, then judge whether to reach threshold value P2.If reached, just by base index
Information is deleted the information in all differential indices.
So, during system response retrieval request, first retrieve result from base index and increment index acquisition of information,
The document information being deleted is fallen further according to the invalid document list filtering deleted in document.When certain
Between section differential indices be used for have updated all information in increment index and base index after, just by this time
All document ids that the differential indices of section deletes from file and this event section deleted are from invalid document list
Delete.
The retrieval of index is the key operation of search engine system response retrieval request.Search engine receives retrieval
After request, it is analyzed processing to the term information of input, then comes index with the term after process
Retrieve.Operating procedure is described as follows:
Dictionary retrieved in step l. term.First get including this term essential information according to hash function
Position in depositing, then reads essential information.Judge the low-and high-frequency field of essential information in dictionary.If
It is high frequency words, forwards step 2 to, if low-frequency word, then forward step 3 to.
Step 2. calls hash function using term as input, obtains preserving the filename of this term index.
Step 3. calls hash function, is inputted as function by the word frequency sequencing information of term, is preserved
The filename of this term index.
Base index, according to the filename of the HDFS obtained, is retrieved by step 4..First task is created
Manager, performs MapReduce task, is separately operable a number of on each DataNode of system
Map task and Reduce task.Map task processes the file block data in this locality, therefrom gets number
According to item, save as intermediate data.It is then passed to Reduce task and merges process.Reduce task
Output is the initial results of retrieval.
Step 5., by the method identical with step 4, carries out MapReduce retrieval to increment index file, looks into
The result arrived merges with initial results obtained in the previous step.
Step 6. retrieves independent increment index from internal memory, and retrieval result merges with result obtained in the previous step.
Step 7. retrieves the differential indices in NameNode, according to invalid document list, obtains previous step
Result carry out filtration treatment.
The all lists of documents obtained are ranked up by step 8. by predefined algorithm.
Step 9. returns the result after sequence.
In sum, the present invention proposes a kind of data search method based on data mining, utilizes search word
Character representation understand user view, and improve the efficiency of search engine by the Indexing Mechanism improved, increase
Strong Consumer's Experience.
Obviously, it should be appreciated by those skilled in the art, each module or each step of the above-mentioned present invention are permissible
Realizing by general calculating system, they can concentrate in single calculating system, or is distributed in many
On the network that individual calculating system is formed, alternatively, they can use the executable program code of calculating system
Realize, it is thus possible to be stored in storage system being performed by calculating system.So, this
Bright be not restricted to any specific hardware and software combine.
It should be appreciated that the above-mentioned detailed description of the invention of the present invention is used only for exemplary illustration or explains this
The principle of invention, and be not construed as limiting the invention.Therefore, without departing from the spirit and scope of the present invention
In the case of any modification, equivalent substitution and improvement etc. done, should be included in protection scope of the present invention
Within.Additionally, claims of the present invention be intended to fall into scope and border or
Whole in the equivalents on this scope of person and border change and modifications example.
Claims (2)
1. a data search method based on data mining, it is characterised in that including:
Pretreatment user search word, calculates the reverse word frequency of text vector and degree of approximation matrix, obtains user defeated
The classification results of the term entered, retrieves according to classification results.
Method the most according to claim 1, it is characterised in that the described reverse word frequency of calculating text vector
With degree of approximation matrix, farther include:
In conjunction with participle characteristic and the semantic weight combining domain knowledge, all dispose point at all of DataNode
Word tool kit, obtains the term of text and this term occurrence number in this document after participle, with
Statistics word frequency, scans domain knowledge base, calculates the position weights of this word and semantic span weights, then the
One Map stage calls semantic degree of approximation function and adds up;Then each word is calculated in the Reduce stage
These information are output as the input of second MapReduce by reverse word frequency;Second Map stage,
Calculate semantic degree of approximation result, if the degree of approximation is more than presetting semantic threshold value, then return semantic weight, otherwise
Do not weight, all of weights of recombinant, obtain final semantic weight;Second Reduce stage, to vector
Normalization processes;
For each text, only calculate the numbering degree of approximation more than oneself text, whole square can be calculated
Battle array, under its MapReduce environment, parallel procedure is:
The Map stage: first carry out data segmentation, the data of segmentation are distributed to by NameNode
DataNode, numbers to DataNode simultaneously, calculates the degree of approximation of text simultaneously, terminate when the Map stage
After, intermediate object program is exported the input locally as the Reduce stage;
The Reduce stage: collect the data of every a line, and collect output matrix;
Then use the input mode optimizing scribing, higher dimensional matrix is blocked into minor matrix computing;The Map stage
Carrying out data scribing, read submatrix A and B according to scribing offset information and do multiplication in this locality, result is square
One submatrix of battle array C, output is to the position that specifies Offsets;The Reduce stage calculates Matrix C=AB;By right
Matrix is claimed to be expressed as symmetric triple-diagonal matrix by orthogonal similarity transformation;Solve the front K of symmetric triple-diagonal matrix
Individual minimal characteristic vector, is eventually converted into Laplacian Matrix and vector by the computing of symmetric triple-diagonal matrix
Multiplication operation, after obtaining minimal eigenvalue and the characteristic vector of matrix, calculates each text to a certain set
The degree of approximation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610600204.8A CN106021626A (en) | 2016-07-27 | 2016-07-27 | Data search method based on data mining |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610600204.8A CN106021626A (en) | 2016-07-27 | 2016-07-27 | Data search method based on data mining |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106021626A true CN106021626A (en) | 2016-10-12 |
Family
ID=57113982
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610600204.8A Pending CN106021626A (en) | 2016-07-27 | 2016-07-27 | Data search method based on data mining |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021626A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528540A (en) * | 2016-12-16 | 2017-03-22 | 广州索答信息科技有限公司 | Word segmentation method and word segmentation system for seed questions |
CN107368464A (en) * | 2017-07-28 | 2017-11-21 | 深圳数众科技有限公司 | A kind of method and device for obtaining bid product information |
CN108415903A (en) * | 2018-03-12 | 2018-08-17 | 武汉斗鱼网络科技有限公司 | Judge evaluation method, storage medium and the equipment of search intention identification validity |
CN111611293A (en) * | 2020-04-24 | 2020-09-01 | 太原太工天宇教育科技有限公司 | Outlier data mining method based on feature weighting and MapReduce |
CN111639099A (en) * | 2020-06-09 | 2020-09-08 | 武汉虹旭信息技术有限责任公司 | Full-text indexing method and system |
CN112686313A (en) * | 2020-12-31 | 2021-04-20 | 江西理工大学 | Improved parallel deep forest classification method based on information theory |
EP4191434A1 (en) * | 2019-04-19 | 2023-06-07 | Fujitsu Limited | Identification method, generation method, dimensional compression method, display method, and information processing device |
-
2016
- 2016-07-27 CN CN201610600204.8A patent/CN106021626A/en active Pending
Non-Patent Citations (2)
Title |
---|
吴鹏: "基于Hadoop的领域知识库自动构建关键技术研究", 《中国优秀硕士学位论文全文数据库》 * |
高扬: "短文本分类在搜索引擎中的研究与应用", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528540A (en) * | 2016-12-16 | 2017-03-22 | 广州索答信息科技有限公司 | Word segmentation method and word segmentation system for seed questions |
CN107368464A (en) * | 2017-07-28 | 2017-11-21 | 深圳数众科技有限公司 | A kind of method and device for obtaining bid product information |
CN108415903A (en) * | 2018-03-12 | 2018-08-17 | 武汉斗鱼网络科技有限公司 | Judge evaluation method, storage medium and the equipment of search intention identification validity |
EP4191434A1 (en) * | 2019-04-19 | 2023-06-07 | Fujitsu Limited | Identification method, generation method, dimensional compression method, display method, and information processing device |
CN111611293A (en) * | 2020-04-24 | 2020-09-01 | 太原太工天宇教育科技有限公司 | Outlier data mining method based on feature weighting and MapReduce |
CN111611293B (en) * | 2020-04-24 | 2023-09-29 | 太原太工天宇教育科技有限公司 | Outlier data mining method based on feature weighting and MapReduce |
CN111639099A (en) * | 2020-06-09 | 2020-09-08 | 武汉虹旭信息技术有限责任公司 | Full-text indexing method and system |
CN112686313A (en) * | 2020-12-31 | 2021-04-20 | 江西理工大学 | Improved parallel deep forest classification method based on information theory |
CN112686313B (en) * | 2020-12-31 | 2022-05-13 | 韶关学院 | Improved parallel deep forest classification method based on information theory |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106021626A (en) | Data search method based on data mining | |
CN106156357A (en) | Text data beam search method | |
US20070294223A1 (en) | Text Categorization Using External Knowledge | |
US9305100B2 (en) | Object oriented data and metadata based search | |
EP2041669B1 (en) | Text categorization using external knowledge | |
US8131684B2 (en) | Adaptive archive data management | |
US7783626B2 (en) | Pipelined architecture for global analysis and index building | |
CN106202552A (en) | Data search method based on cloud computing | |
CN107729336A (en) | Data processing method, equipment and system | |
CN104376406A (en) | Enterprise innovation resource management and analysis system and method based on big data | |
US20110264997A1 (en) | Scalable Incremental Semantic Entity and Relatedness Extraction from Unstructured Text | |
CN107844493B (en) | File association method and system | |
Wu et al. | Efficient near-duplicate detection for q&a forum | |
CN106503274A (en) | A kind of Data Integration and searching method and server | |
CN103744954A (en) | Word relevancy network model establishing method and establishing device thereof | |
Uma et al. | Noise elimination from web pages for efficacious information retrieval | |
US9552415B2 (en) | Category classification processing device and method | |
CN108170799A (en) | A kind of Frequent episodes method for digging of mass data | |
Chopra et al. | A survey on improving the efficiency of different web structure mining algorithms | |
US10671810B2 (en) | Citation explanations | |
Huang et al. | Design a batched information retrieval system based on a concept-lattice-like structure | |
JP4189387B2 (en) | Knowledge search system, knowledge search method and program | |
CN106934007B (en) | Associated information pushing method and device | |
Rashmi et al. | Deep web crawler: exploring and re-ranking of web forms | |
Caldeira et al. | Experimental evaluation among reblocking techniques applied to the entity resolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161012 |