CN112445900A - Quick retrieval method and system - Google Patents
Quick retrieval method and system Download PDFInfo
- Publication number
- CN112445900A CN112445900A CN201910809541.1A CN201910809541A CN112445900A CN 112445900 A CN112445900 A CN 112445900A CN 201910809541 A CN201910809541 A CN 201910809541A CN 112445900 A CN112445900 A CN 112445900A
- Authority
- CN
- China
- Prior art keywords
- text
- module
- text data
- retrieval
- quantization
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000013139 quantization Methods 0.000 claims abstract description 29
- 230000011218 segmentation Effects 0.000 claims abstract description 27
- 238000007781 pre-processing Methods 0.000 claims abstract description 22
- 230000009849 deactivation Effects 0.000 claims abstract description 5
- 238000010276 construction Methods 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
Abstract
A quick retrieval method and a system comprise: s1, preprocessing the question text of the user to be retrieved and quantizing the text; s2, constructing an n-dimensional space according to the quantization result; s3, randomly selecting n points in the n-dimensional space, and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points; s4, repeating the step S3 until m points remain in the segmented area; s5, constructing a binary tree and establishing a binary tree index; s6, performing word segmentation, word deactivation and text quantization on the search text; s7, searching and traversing to leaf nodes according to the binary tree structure to obtain similar text data of the searched text, calculating the similarity between the searched text and each similar text data, and acquiring the similar text data with the highest similarity; and S8, searching in the database based on the similar text data with the highest similarity to obtain the final answer of the similar text data with the highest similarity, wherein the final answer is used as the answer of the search ontology.
Description
Technical Field
The invention belongs to the field of natural language processing information retrieval, and particularly relates to a quick retrieval method and a quick retrieval system.
Background
With the rapid development of internet technology, data becomes an important carrier for information dissemination. In the field of man-machine conversation, for massive data retrieval, the traditional method has higher time complexity, and obviously does not meet the requirements in some scenes with higher real-time requirements, so that the construction of a rapid retrieval method is very important.
Disclosure of Invention
Aiming at the problems and the defects in the prior art, the invention provides a novel rapid retrieval method and a novel rapid retrieval system.
The invention solves the technical problems through the following technical scheme:
the invention provides a quick retrieval method which is characterized by comprising the following steps:
s1, preprocessing massive user question texts to be retrieved, and quantitatively representing the preprocessed user question texts in texts, wherein the preprocessing process comprises word segmentation and word stop;
s2, constructing an n-dimensional space according to the quantization result after text quantization, wherein n > is 100;
s3, randomly selecting n points in the n-dimensional space, and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points;
s4, repeating the step S3 until m points are left in the divided region, wherein m is less than or equal to n;
s5, constructing a binary tree and establishing a binary tree index;
s6, performing word segmentation, word deactivation and text quantitative representation processing on the search text;
s7, searching and traversing to leaf nodes according to the binary tree structure to obtain similar text data of the searched text, calculating the similarity between the searched text and each similar text data, and acquiring the similar text data with the highest similarity;
and S8, searching in the database based on the similar text data with the highest similarity to obtain the final answer of the similar text data with the highest similarity, wherein the final answer is used as the answer of the search ontology.
Preferably, in step S1, Word2vec is used to perform text quantization on the preprocessed user question text.
Preferably, in step S8, the database is a MySQL database.
The invention also provides a quick retrieval system which is characterized by comprising a processing quantization module, a space construction module, a plane construction module, a calling module, an index construction module, a preprocessing module, a calculation module and a retrieval module;
the processing quantization module is used for preprocessing massive user question texts to be retrieved and carrying out text quantization representation on the preprocessed user question texts, wherein the preprocessing process comprises word segmentation and word stop;
the space construction module is used for constructing an n-dimensional space according to a quantization result after text quantization, wherein n > is 100;
the plane construction module is used for randomly selecting n points in an n-dimensional space and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points;
the calling module is used for repeatedly calling the plane construction module until m points are left in the divided region, wherein m is less than or equal to n;
the index building module is used for building a binary tree and building a binary tree index;
the preprocessing module is used for carrying out word segmentation, word stop and text quantitative representation preprocessing on the retrieval text;
the calculation module is used for performing retrieval traversal according to the binary tree structure until the leaf nodes, obtaining similar text data of the retrieval text, calculating the similarity between the retrieval text and each similar text data, and acquiring the similar text data with the highest similarity;
the retrieval module is used for retrieving in the database based on the similar text data with the highest similarity to obtain a final answer of the similar text data with the highest similarity, and the final answer is used as an answer of the retrieval body.
Preferably, the processing quantization module is configured to perform text quantization on the preprocessed user question text by using Word2 vec.
Preferably, the database is a MySQL database.
On the basis of the common knowledge in the field, the above preferred conditions can be combined randomly to obtain the preferred embodiments of the invention.
The positive progress effects of the invention are as follows:
according to the invention, for mass data retrieval, the data retrieval speed is improved by constructing the tree structure.
Drawings
FIG. 1 is a flowchart illustrating a fast search method according to a preferred embodiment of the present invention.
FIG. 2 is a block diagram of a fast search system according to a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, the embodiment provides a fast search method, which includes the following steps:
And 102, constructing an n-dimensional space according to a quantization result after text quantization, wherein n > is 100.
And 103, randomly selecting n points in the n-dimensional space, and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points.
And step 104, repeatedly executing step 3 until m points are left in the divided region, wherein m is less than or equal to n.
And 105, constructing a binary tree and establishing a binary tree index.
And 106, performing word segmentation, word deactivation and text quantitative representation processing on the search text.
And step 107, searching and traversing according to the binary tree structure until the leaf nodes, obtaining similar text data of the searched text, calculating the similarity between the searched text and each similar text data, and acquiring the similar text data with the highest similarity.
And step 108, searching in the MySQL database based on the similar text data with the highest similarity to obtain a final answer of the similar text data with the highest similarity, wherein the final answer is used as an answer of the search body.
As shown in fig. 2, the embodiment further provides a fast retrieval system, which includes a processing quantization module 1, a space construction module 2, a plane construction module 3, a calling module 4, an index construction module 5, a preprocessing module 6, a calculation module 7, and a retrieval module 8.
The processing quantization module 1 is used for preprocessing massive user question texts to be retrieved, and performing text quantization representation on the preprocessed user question texts by using Word2vec, wherein the preprocessing process comprises Word segmentation and stop Word removal.
The space construction module 2 is configured to construct an n-dimensional space according to a quantization result after text quantization, where n > is 100.
The plane construction module 3 is used for randomly selecting n points in the n-dimensional space, and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points.
The calling module 4 is used for repeatedly calling the plane construction module until m points are left in the divided region, wherein m is less than or equal to n.
The index building module 5 is used for building a binary tree and building a binary tree index.
The preprocessing module 6 is used for preprocessing the search text by word segmentation, word deactivation and text quantization representation.
The calculation module 7 is used for performing retrieval traversal according to the binary tree structure until the leaf nodes, obtaining similar text data of the retrieved text, calculating the similarity between the retrieved text and each similar text data, and acquiring the similar text data with the highest similarity.
The retrieval module 8 is configured to retrieve in the MySQL database based on the similar text data with the highest similarity, to obtain a final answer of the similar text data with the highest similarity, and the final answer is used as an answer of the retrieval body.
According to the invention, for mass data retrieval, the data retrieval speed is improved by constructing the tree structure.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that these are by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.
Claims (6)
1. A quick retrieval method is characterized by comprising the following steps:
s1, preprocessing massive user question texts to be retrieved, and quantitatively representing the preprocessed user question texts in texts, wherein the preprocessing process comprises word segmentation and word stop;
s2, constructing an n-dimensional space according to the quantization result after text quantization, wherein n > is 100;
s3, randomly selecting n points in the n-dimensional space, and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points;
s4, repeating the step S3 until m points are left in the divided region, wherein m is less than or equal to n;
s5, constructing a binary tree and establishing a binary tree index;
s6, performing word segmentation, word deactivation and text quantitative representation processing on the search text;
s7, searching and traversing to leaf nodes according to the binary tree structure to obtain similar text data of the searched text, calculating the similarity between the searched text and each similar text data, and acquiring the similar text data with the highest similarity;
and S8, searching in the database based on the similar text data with the highest similarity to obtain the final answer of the similar text data with the highest similarity, wherein the final answer is used as the answer of the search ontology.
2. The quick search method of claim 1, wherein in step S1, the preprocessed user question text is text-quantized using Word2 vec.
3. The rapid search method according to claim 1, wherein in step S8, the database is a MySQL database.
4. A quick retrieval system is characterized by comprising a processing quantization module, a space construction module, a plane construction module, a calling module, an index construction module, a preprocessing module, a calculation module and a retrieval module;
the processing quantization module is used for preprocessing massive user question texts to be retrieved and carrying out text quantization representation on the preprocessed user question texts, wherein the preprocessing process comprises word segmentation and word stop;
the space construction module is used for constructing an n-dimensional space according to a quantization result after text quantization, wherein n > is 100;
the plane construction module is used for randomly selecting n points in an n-dimensional space and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points;
the calling module is used for repeatedly calling the plane construction module until m points are left in the divided region, wherein m is less than or equal to n;
the index building module is used for building a binary tree and building a binary tree index;
the preprocessing module is used for carrying out word segmentation, word stop and text quantitative representation preprocessing on the retrieval text;
the calculation module is used for performing retrieval traversal according to the binary tree structure until the leaf nodes, obtaining similar text data of the retrieval text, calculating the similarity between the retrieval text and each similar text data, and acquiring the similar text data with the highest similarity;
the retrieval module is used for retrieving in the database based on the similar text data with the highest similarity to obtain a final answer of the similar text data with the highest similarity, and the final answer is used as an answer of the retrieval body.
5. The quick search system of claim 4 wherein the process quantization module is adapted to perform text quantization on the preprocessed user question text using Word2 vec.
6. The quick retrieval system of claim 4, wherein the database is a MySQL database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910809541.1A CN112445900A (en) | 2019-08-29 | 2019-08-29 | Quick retrieval method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910809541.1A CN112445900A (en) | 2019-08-29 | 2019-08-29 | Quick retrieval method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112445900A true CN112445900A (en) | 2021-03-05 |
Family
ID=74741293
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910809541.1A Pending CN112445900A (en) | 2019-08-29 | 2019-08-29 | Quick retrieval method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112445900A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1716256A (en) * | 2004-06-30 | 2006-01-04 | 微软公司 | Automated taxonomy generation |
CN105893362A (en) * | 2014-09-26 | 2016-08-24 | 北大方正集团有限公司 | A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points |
CN106156154A (en) * | 2015-04-14 | 2016-11-23 | 阿里巴巴集团控股有限公司 | The search method of Similar Text and device thereof |
CN108536708A (en) * | 2017-03-03 | 2018-09-14 | 腾讯科技(深圳)有限公司 | A kind of automatic question answering processing method and automatically request-answering system |
-
2019
- 2019-08-29 CN CN201910809541.1A patent/CN112445900A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1716256A (en) * | 2004-06-30 | 2006-01-04 | 微软公司 | Automated taxonomy generation |
CN105893362A (en) * | 2014-09-26 | 2016-08-24 | 北大方正集团有限公司 | A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points |
CN106156154A (en) * | 2015-04-14 | 2016-11-23 | 阿里巴巴集团控股有限公司 | The search method of Similar Text and device thereof |
CN108536708A (en) * | 2017-03-03 | 2018-09-14 | 腾讯科技(深圳)有限公司 | A kind of automatic question answering processing method and automatically request-answering system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112800170A (en) | Question matching method and device and question reply method and device | |
WO2020010834A1 (en) | Faq question and answer library generalization method, apparatus, and device | |
CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN107943786B (en) | Chinese named entity recognition method and system | |
CN111078837A (en) | Intelligent question and answer information processing method, electronic equipment and computer readable storage medium | |
CN110597966A (en) | Automatic question answering method and device | |
CN110781687B (en) | Same intention statement acquisition method and device | |
CN113094578A (en) | Deep learning-based content recommendation method, device, equipment and storage medium | |
CN112380319A (en) | Model training method and related device | |
CN111581368A (en) | Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network | |
CN114444507A (en) | Context parameter Chinese entity prediction method based on water environment knowledge map enhancement relationship | |
CN115759119B (en) | Financial text emotion analysis method, system, medium and equipment | |
CN117076693A (en) | Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus | |
CN111709236A (en) | Case similarity matching-based trial risk early warning method | |
CN113609847B (en) | Information extraction method, device, electronic equipment and storage medium | |
CN117312500B (en) | Semantic retrieval model building method based on ANN and BERT | |
CN111159334A (en) | Method and system for house source follow-up information processing | |
CN111831792B (en) | Electric power knowledge base construction method and system | |
CN113342935A (en) | Semantic recognition method and device, electronic equipment and readable storage medium | |
CN114372454A (en) | Text information extraction method, model training method, device and storage medium | |
US20210004603A1 (en) | Method and apparatus for determining (raw) video materials for news | |
CN116431746A (en) | Address mapping method and device based on coding library, electronic equipment and storage medium | |
CN112445900A (en) | Quick retrieval method and system | |
CN114003707A (en) | Problem retrieval model training method and device and problem retrieval method and device | |
CN113705192A (en) | Text processing method, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |