CN112445900A - Quick retrieval method and system - Google Patents

Quick retrieval method and system Download PDF

Info

Publication number
CN112445900A
CN112445900A CN201910809541.1A CN201910809541A CN112445900A CN 112445900 A CN112445900 A CN 112445900A CN 201910809541 A CN201910809541 A CN 201910809541A CN 112445900 A CN112445900 A CN 112445900A
Authority
CN
China
Prior art keywords
text
module
text data
retrieval
quantization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910809541.1A
Other languages
Chinese (zh)
Inventor
李霞
陈怡�
刘凤余
王驹冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhuofan Information Technology Co ltd
Original Assignee
Shanghai Zhuofan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhuofan Information Technology Co ltd filed Critical Shanghai Zhuofan Information Technology Co ltd
Priority to CN201910809541.1A priority Critical patent/CN112445900A/en
Publication of CN112445900A publication Critical patent/CN112445900A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems

Abstract

A quick retrieval method and a system comprise: s1, preprocessing the question text of the user to be retrieved and quantizing the text; s2, constructing an n-dimensional space according to the quantization result; s3, randomly selecting n points in the n-dimensional space, and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points; s4, repeating the step S3 until m points remain in the segmented area; s5, constructing a binary tree and establishing a binary tree index; s6, performing word segmentation, word deactivation and text quantization on the search text; s7, searching and traversing to leaf nodes according to the binary tree structure to obtain similar text data of the searched text, calculating the similarity between the searched text and each similar text data, and acquiring the similar text data with the highest similarity; and S8, searching in the database based on the similar text data with the highest similarity to obtain the final answer of the similar text data with the highest similarity, wherein the final answer is used as the answer of the search ontology.

Description

Quick retrieval method and system
Technical Field
The invention belongs to the field of natural language processing information retrieval, and particularly relates to a quick retrieval method and a quick retrieval system.
Background
With the rapid development of internet technology, data becomes an important carrier for information dissemination. In the field of man-machine conversation, for massive data retrieval, the traditional method has higher time complexity, and obviously does not meet the requirements in some scenes with higher real-time requirements, so that the construction of a rapid retrieval method is very important.
Disclosure of Invention
Aiming at the problems and the defects in the prior art, the invention provides a novel rapid retrieval method and a novel rapid retrieval system.
The invention solves the technical problems through the following technical scheme:
the invention provides a quick retrieval method which is characterized by comprising the following steps:
s1, preprocessing massive user question texts to be retrieved, and quantitatively representing the preprocessed user question texts in texts, wherein the preprocessing process comprises word segmentation and word stop;
s2, constructing an n-dimensional space according to the quantization result after text quantization, wherein n > is 100;
s3, randomly selecting n points in the n-dimensional space, and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points;
s4, repeating the step S3 until m points are left in the divided region, wherein m is less than or equal to n;
s5, constructing a binary tree and establishing a binary tree index;
s6, performing word segmentation, word deactivation and text quantitative representation processing on the search text;
s7, searching and traversing to leaf nodes according to the binary tree structure to obtain similar text data of the searched text, calculating the similarity between the searched text and each similar text data, and acquiring the similar text data with the highest similarity;
and S8, searching in the database based on the similar text data with the highest similarity to obtain the final answer of the similar text data with the highest similarity, wherein the final answer is used as the answer of the search ontology.
Preferably, in step S1, Word2vec is used to perform text quantization on the preprocessed user question text.
Preferably, in step S8, the database is a MySQL database.
The invention also provides a quick retrieval system which is characterized by comprising a processing quantization module, a space construction module, a plane construction module, a calling module, an index construction module, a preprocessing module, a calculation module and a retrieval module;
the processing quantization module is used for preprocessing massive user question texts to be retrieved and carrying out text quantization representation on the preprocessed user question texts, wherein the preprocessing process comprises word segmentation and word stop;
the space construction module is used for constructing an n-dimensional space according to a quantization result after text quantization, wherein n > is 100;
the plane construction module is used for randomly selecting n points in an n-dimensional space and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points;
the calling module is used for repeatedly calling the plane construction module until m points are left in the divided region, wherein m is less than or equal to n;
the index building module is used for building a binary tree and building a binary tree index;
the preprocessing module is used for carrying out word segmentation, word stop and text quantitative representation preprocessing on the retrieval text;
the calculation module is used for performing retrieval traversal according to the binary tree structure until the leaf nodes, obtaining similar text data of the retrieval text, calculating the similarity between the retrieval text and each similar text data, and acquiring the similar text data with the highest similarity;
the retrieval module is used for retrieving in the database based on the similar text data with the highest similarity to obtain a final answer of the similar text data with the highest similarity, and the final answer is used as an answer of the retrieval body.
Preferably, the processing quantization module is configured to perform text quantization on the preprocessed user question text by using Word2 vec.
Preferably, the database is a MySQL database.
On the basis of the common knowledge in the field, the above preferred conditions can be combined randomly to obtain the preferred embodiments of the invention.
The positive progress effects of the invention are as follows:
according to the invention, for mass data retrieval, the data retrieval speed is improved by constructing the tree structure.
Drawings
FIG. 1 is a flowchart illustrating a fast search method according to a preferred embodiment of the present invention.
FIG. 2 is a block diagram of a fast search system according to a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, the embodiment provides a fast search method, which includes the following steps:
step 101, preprocessing massive user question texts to be retrieved, and performing text quantitative representation on the preprocessed user question texts by using Word2vec, wherein the preprocessing process comprises Word segmentation and Word stop.
And 102, constructing an n-dimensional space according to a quantization result after text quantization, wherein n > is 100.
And 103, randomly selecting n points in the n-dimensional space, and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points.
And step 104, repeatedly executing step 3 until m points are left in the divided region, wherein m is less than or equal to n.
And 105, constructing a binary tree and establishing a binary tree index.
And 106, performing word segmentation, word deactivation and text quantitative representation processing on the search text.
And step 107, searching and traversing according to the binary tree structure until the leaf nodes, obtaining similar text data of the searched text, calculating the similarity between the searched text and each similar text data, and acquiring the similar text data with the highest similarity.
And step 108, searching in the MySQL database based on the similar text data with the highest similarity to obtain a final answer of the similar text data with the highest similarity, wherein the final answer is used as an answer of the search body.
As shown in fig. 2, the embodiment further provides a fast retrieval system, which includes a processing quantization module 1, a space construction module 2, a plane construction module 3, a calling module 4, an index construction module 5, a preprocessing module 6, a calculation module 7, and a retrieval module 8.
The processing quantization module 1 is used for preprocessing massive user question texts to be retrieved, and performing text quantization representation on the preprocessed user question texts by using Word2vec, wherein the preprocessing process comprises Word segmentation and stop Word removal.
The space construction module 2 is configured to construct an n-dimensional space according to a quantization result after text quantization, where n > is 100.
The plane construction module 3 is used for randomly selecting n points in the n-dimensional space, and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points.
The calling module 4 is used for repeatedly calling the plane construction module until m points are left in the divided region, wherein m is less than or equal to n.
The index building module 5 is used for building a binary tree and building a binary tree index.
The preprocessing module 6 is used for preprocessing the search text by word segmentation, word deactivation and text quantization representation.
The calculation module 7 is used for performing retrieval traversal according to the binary tree structure until the leaf nodes, obtaining similar text data of the retrieved text, calculating the similarity between the retrieved text and each similar text data, and acquiring the similar text data with the highest similarity.
The retrieval module 8 is configured to retrieve in the MySQL database based on the similar text data with the highest similarity, to obtain a final answer of the similar text data with the highest similarity, and the final answer is used as an answer of the retrieval body.
According to the invention, for mass data retrieval, the data retrieval speed is improved by constructing the tree structure.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that these are by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (6)

1. A quick retrieval method is characterized by comprising the following steps:
s1, preprocessing massive user question texts to be retrieved, and quantitatively representing the preprocessed user question texts in texts, wherein the preprocessing process comprises word segmentation and word stop;
s2, constructing an n-dimensional space according to the quantization result after text quantization, wherein n > is 100;
s3, randomly selecting n points in the n-dimensional space, and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points;
s4, repeating the step S3 until m points are left in the divided region, wherein m is less than or equal to n;
s5, constructing a binary tree and establishing a binary tree index;
s6, performing word segmentation, word deactivation and text quantitative representation processing on the search text;
s7, searching and traversing to leaf nodes according to the binary tree structure to obtain similar text data of the searched text, calculating the similarity between the searched text and each similar text data, and acquiring the similar text data with the highest similarity;
and S8, searching in the database based on the similar text data with the highest similarity to obtain the final answer of the similar text data with the highest similarity, wherein the final answer is used as the answer of the search ontology.
2. The quick search method of claim 1, wherein in step S1, the preprocessed user question text is text-quantized using Word2 vec.
3. The rapid search method according to claim 1, wherein in step S8, the database is a MySQL database.
4. A quick retrieval system is characterized by comprising a processing quantization module, a space construction module, a plane construction module, a calling module, an index construction module, a preprocessing module, a calculation module and a retrieval module;
the processing quantization module is used for preprocessing massive user question texts to be retrieved and carrying out text quantization representation on the preprocessed user question texts, wherein the preprocessing process comprises word segmentation and word stop;
the space construction module is used for constructing an n-dimensional space according to a quantization result after text quantization, wherein n > is 100;
the plane construction module is used for randomly selecting n points in an n-dimensional space and constructing a segmentation hyperplane of the n-dimensional space and a vertical plane of the segmentation hyperplane based on the n points;
the calling module is used for repeatedly calling the plane construction module until m points are left in the divided region, wherein m is less than or equal to n;
the index building module is used for building a binary tree and building a binary tree index;
the preprocessing module is used for carrying out word segmentation, word stop and text quantitative representation preprocessing on the retrieval text;
the calculation module is used for performing retrieval traversal according to the binary tree structure until the leaf nodes, obtaining similar text data of the retrieval text, calculating the similarity between the retrieval text and each similar text data, and acquiring the similar text data with the highest similarity;
the retrieval module is used for retrieving in the database based on the similar text data with the highest similarity to obtain a final answer of the similar text data with the highest similarity, and the final answer is used as an answer of the retrieval body.
5. The quick search system of claim 4 wherein the process quantization module is adapted to perform text quantization on the preprocessed user question text using Word2 vec.
6. The quick retrieval system of claim 4, wherein the database is a MySQL database.
CN201910809541.1A 2019-08-29 2019-08-29 Quick retrieval method and system Pending CN112445900A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910809541.1A CN112445900A (en) 2019-08-29 2019-08-29 Quick retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910809541.1A CN112445900A (en) 2019-08-29 2019-08-29 Quick retrieval method and system

Publications (1)

Publication Number Publication Date
CN112445900A true CN112445900A (en) 2021-03-05

Family

ID=74741293

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910809541.1A Pending CN112445900A (en) 2019-08-29 2019-08-29 Quick retrieval method and system

Country Status (1)

Country Link
CN (1) CN112445900A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1716256A (en) * 2004-06-30 2006-01-04 微软公司 Automated taxonomy generation
CN105893362A (en) * 2014-09-26 2016-08-24 北大方正集团有限公司 A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN106156154A (en) * 2015-04-14 2016-11-23 阿里巴巴集团控股有限公司 The search method of Similar Text and device thereof
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1716256A (en) * 2004-06-30 2006-01-04 微软公司 Automated taxonomy generation
CN105893362A (en) * 2014-09-26 2016-08-24 北大方正集团有限公司 A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN106156154A (en) * 2015-04-14 2016-11-23 阿里巴巴集团控股有限公司 The search method of Similar Text and device thereof
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system

Similar Documents

Publication Publication Date Title
CN112800170A (en) Question matching method and device and question reply method and device
WO2020010834A1 (en) Faq question and answer library generalization method, apparatus, and device
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN107943786B (en) Chinese named entity recognition method and system
CN111078837A (en) Intelligent question and answer information processing method, electronic equipment and computer readable storage medium
CN110597966A (en) Automatic question answering method and device
CN110781687B (en) Same intention statement acquisition method and device
CN113094578A (en) Deep learning-based content recommendation method, device, equipment and storage medium
CN112380319A (en) Model training method and related device
CN111581368A (en) Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network
CN114444507A (en) Context parameter Chinese entity prediction method based on water environment knowledge map enhancement relationship
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN117076693A (en) Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus
CN111709236A (en) Case similarity matching-based trial risk early warning method
CN113609847B (en) Information extraction method, device, electronic equipment and storage medium
CN117312500B (en) Semantic retrieval model building method based on ANN and BERT
CN111159334A (en) Method and system for house source follow-up information processing
CN111831792B (en) Electric power knowledge base construction method and system
CN113342935A (en) Semantic recognition method and device, electronic equipment and readable storage medium
CN114372454A (en) Text information extraction method, model training method, device and storage medium
US20210004603A1 (en) Method and apparatus for determining (raw) video materials for news
CN116431746A (en) Address mapping method and device based on coding library, electronic equipment and storage medium
CN112445900A (en) Quick retrieval method and system
CN114003707A (en) Problem retrieval model training method and device and problem retrieval method and device
CN113705192A (en) Text processing method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination