CN117591635A - Text segmentation retrieval method for large model question and answer - Google Patents
Text segmentation retrieval method for large model question and answer Download PDFInfo
- Publication number
- CN117591635A CN117591635A CN202311759473.5A CN202311759473A CN117591635A CN 117591635 A CN117591635 A CN 117591635A CN 202311759473 A CN202311759473 A CN 202311759473A CN 117591635 A CN117591635 A CN 117591635A
- Authority
- CN
- China
- Prior art keywords
- text
- data
- segmentation
- word
- adopting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 72
- 238000000034 method Methods 0.000 title claims abstract description 28
- 239000013598 vector Substances 0.000 claims abstract description 29
- 238000007781 pre-processing Methods 0.000 claims abstract description 14
- 239000013589 supplement Substances 0.000 claims abstract description 10
- 238000010276 construction Methods 0.000 claims description 13
- 238000013507 mapping Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 238000005259 measurement Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 230000008707 rearrangement Effects 0.000 claims description 5
- 241000222120 Candida <Saccharomycetales> Species 0.000 claims description 3
- 239000000047 product Substances 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000009825 accumulation Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of natural language processing, and particularly relates to a text segmentation retrieval method for large-model questions and answers, which comprises the following steps: constructing a text database and an index; acquiring text data to be retrieved; dividing the text to be retrieved; embedding and word segmentation are carried out on the segmented text data, and preprocessing is carried out on the segmented text, wherein the preprocessing comprises removal of stop words, pinyin supplement and synonym supplement; searching the preprocessed text by adopting a plurality of searching modes through a text database; rearranging the search result by adopting a rearranging model to obtain a final search result; according to the invention, the ANN algorithm and the BM25 algorithm are adopted to respectively compare and sort the embedded vector of the short sentence, the embedded vector of the paragraph and the paragraph, and then the three search results are subjected to de-duplication processing according to the paragraph ID, so that the complexity of text search is reduced, and the search accuracy is improved.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a text segmentation retrieval method for large-model questions and answers.
Background
Text Retrieval (Text Retrieval), also known as natural language Retrieval, refers to the process of retrieving, classifying, filtering, etc., a collection of Text based on the content of the Text, such as terms, semantics, etc., contained in the Text. Text retrieval and image retrieval, voice retrieval, picture retrieval, etc. are all part of the information retrieval. Generally, the result of text retrieval can be measured by two basic indexes, namely accuracy and recall. Where accuracy generally refers to the ratio of the retrieved relevant documents divided by all retrieved documents; recall, also referred to as recall, generally refers to the ratio of the relevant documents retrieved to the total number of relevant documents. Therefore, how to improve the accuracy or recall of text retrieval is a key problem that text retrieval needs to solve. In the aspect of file processing, the conventional technology only processes the text or the text identified in the picture, the picture information of the non-text class is lost, the table is also simple text extraction processing, and the hidden relationship in the table is also lost; in addition, in the aspect of content retrieval, single keyword or semantic retrieval has weaker effect on some of the more hidden content retrieval.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a text segmentation retrieval method for large model questions and answers, which comprises the following steps: constructing a text database and an index; acquiring text data to be retrieved; dividing the text to be retrieved; embedding and word segmentation are carried out on the segmented text data, and preprocessing is carried out on the segmented text, wherein the preprocessing comprises removal of stop words, pinyin supplement and synonym supplement; searching the preprocessed text by adopting a plurality of searching modes through a text database; and rearranging the search result by adopting a rearranging model to obtain a final search result.
Preferably, constructing the text database and the index includes: acquiring complete text data, and performing segmentation processing on the complete text data; dividing the divided sentences into short sentences by adopting sentence end symbols; obtaining the membership of the split short sentence and paragraph, generating a unique ID in each paragraph, and keeping the ID in the sub phrase; embedding the phrases and the paragraphs to obtain embedded vector representations of the phrases and the embedded vector representations of the paragraphs; storing the phrases, paragraphs, embedded vectors of the phrases, embedded vectors of the paragraphs and paragraph IDs of the membership of the phrases into a database, and establishing indexes.
Further, the segmentation processing of the complete text data includes: identifying titles in the complete text, and dividing the titles of the text by adopting a recursive division algorithm; segmenting the text content of the text; identifying picture information and table information in the segmented text, and replacing the picture and the table with the text according to the identified text information; judging the word number of the replaced text data, and if the word number limit is exceeded, repeatedly and alternately dividing the replaced text data until the word number of the text meets the requirement; when the number of words meets the requirement, the text block is saved.
Further, segmenting the title of the text using a recursive segmentation algorithm includes: setting a text length threshold of a title, obtaining the number of text title words to be searched, comparing the number of the title words with the text length threshold, if the number of the title words is larger than the text length threshold, not dividing the title, and if the number of the title words is larger than the text length threshold, dividing the title to obtain a primary title and the number of remaining title words; and comparing the number of the remained header words with a text length threshold until the number of the header words signs the text length threshold, and stopping segmentation.
Preferably, searching the preprocessed text by using a plurality of searching modes comprises: adopting an ANN algorithm to compare and sort the embedded vectors of the short sentences; adopting an ANN algorithm to compare and sort the embedded vectors of the paragraphs; adopting a BM25 algorithm to compare and sort the paragraphs; and de-duplicating the three search results according to the paragraph ID.
Further, the comparing and sorting the embedded vectors of the phrases by adopting the ANN algorithm comprises the following steps:
step 1, constructing data, wherein the data comprises generated text data, and the text data comprises long sentences and short sentences;
step 2, mapping the constructed data into 1024-dimensional data through an ebedding layer;
step 3, constructing a 20-layer data index through HNSW according to 1024-dimension data; the index construction process comprises the following steps: setting index construction parameters, wherein the index construction parameters comprise Euclidean distance, maximum connection number 16 and maximum number 2 of layers of each layer in a metric calculation method; traversing the construction data, randomly determining the layer number position of each point, layering the data, and connecting each point according to the nearest point position of each point by each layer to form a graph; each upper layer forms a layer-to-layer connection by finding the nearest point of the lower layer; until 20 layers of data are formed, obtaining a data index;
step 4, vectorizing the query data, namely mapping the query data to 1024 dimensions through an ebedding layer;
step 5, searching the data index from the upper layer to the lower layer by adopting an approximate nearest neighbor searching algorithm, and obtaining a searching result by taking Euclidean distance as a measurement standard;
and 6, sorting the search results according to Euclidean distance measurement.
Further, the keyword alignment sorting of paragraphs by adopting the BM25 algorithm comprises: the key words comprise Chinese characters and pinyin corresponding to the Chinese characters; word segmentation is carried out on the segmented text according to the word segmentation vocabulary; removing the matched stop words according to the stop word list; converting the residual word into pinyin; counting the occurrence times of each word segmentation; the sentence and the word segmentation are associated to form an inverted index, and the inverted index is stored in a database; inputting a problem sentence by a user, and performing word segmentation, stop word removal and pinyin conversion operation on the problem sentence; matching Chinese word segmentation and pinyin word segmentation in the problem sentence with a database, finding out a matched sentence, calculating the score of the word segmentation by adopting a bm25 calculation formula until all the word segmentation is matched, and adding the scores in all the word segmentation in the sentence to obtain the final score of the sentence; ranking is performed according to the final score values of the sentences.
Preferably, rearranging the search result using the rearrangement model includes:
step 1, primarily searching, namely searching related candidate documents by using a vocabulary matching model, and taking the first N as input;
step 2, preprocessing Candida and query text, wherein the preprocessing comprises word segmentation, lowercase conversion, stop word removal and BERT special symbol addition;
step 3, constructing a model input sequence, which comprises the following steps: querying and separating each candidate by using "[ SEP ]", adding CLS and SEP symbols, and constructing an input ID sequence;
step 4, inputting the sequence into a pre-trained BERT basic model, and outputting a hidden state sequence after 12 layers of transform codes and the last layer of output;
and 5, extracting query and candidate hidden states from corresponding positions, wherein the steps comprise: querying the state h_q, the candidate state h_c;
step 6, calculating matching score; adding parallel attribute and local feature after inner product of h_q and h_c to obtain matching score;
step 7, carrying out normalization mapping on the score through a sigmoid function to obtain a rearrangement normalization weight;
step 8, reordering the candidate documents according to the rearranged normalized weights;
and 9, outputting topN as a final result.
The invention has the beneficial effects that:
according to the invention, the ANN algorithm and the BM25 algorithm are adopted to respectively compare and sort the embedded vector of the short sentence, the embedded vector of the paragraph and the paragraph, and then the three search results are subjected to de-duplication processing according to the paragraph ID, so that the complexity of text search is reduced, and the search accuracy is improved; according to the invention, the title of the text is segmented by adopting a recursive segmentation algorithm, so that the accuracy of text classification is improved more accurately.
Drawings
FIG. 1 is a flow chart of a text segmentation search method of the present invention;
FIG. 2 is a flow chart of title segmentation according to the present invention;
FIG. 3 is a flowchart of a database construction method according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A text-segmentation retrieval method for large-model questions and answers, the method comprising: constructing a text database and an index; acquiring text data to be retrieved; dividing the text to be retrieved; embedding and word segmentation are carried out on the segmented text data, and preprocessing is carried out on the segmented text, wherein the preprocessing comprises removal of stop words, pinyin supplement and synonym supplement; searching the preprocessed text by adopting a plurality of searching modes through a text database; and rearranging the search result by adopting a rearranging model to obtain a final search result.
In this embodiment, as shown in fig. 3, constructing the text database and the index includes: acquiring complete text data, and performing segmentation processing on the complete text data; dividing the divided sentences into short sentences by adopting sentence end symbols; obtaining the membership of the split short sentence and paragraph, generating a unique ID in each paragraph, and keeping the ID in the sub phrase; embedding the phrases and the paragraphs to obtain embedded vector representations of the phrases and the embedded vector representations of the paragraphs; storing the phrases, paragraphs, embedded vectors of the phrases, embedded vectors of the paragraphs and paragraph IDs of the membership of the phrases into a database, and establishing indexes.
An embodiment of a text segmentation search method for large model questions and answers, as shown in fig. 1, includes: obtaining text data, dividing the text data, and splitting the divided text to obtain long short sentences; vectorization is carried out on long sentences and short sentences, and long sentence vectors and short sentence vectors are stored in a warehouse. Acquiring a text to be searched, and dividing the text; embedding processing and word segmentation are carried out on the segmented text data; removing stop words, pinyin supplement and synonym supplement from the word segmentation; adopting a BM25 algorithm to compare and sort the paragraphs; adopting an ANN algorithm to compare and sort the embedded vectors of the short sentences; adopting an ANN algorithm to compare and sort the embedded vectors of the paragraphs; performing de-duplication on the three search results according to the paragraph ID; and rearranging the search result by adopting a rearranging model to obtain a final search result.
As shown in fig. 2, the segmentation processing of the complete text data includes: identifying titles in the complete text, and dividing the titles of the text by adopting a recursive division algorithm; segmenting the text content of the text; identifying picture information and table information in the segmented text, and replacing the picture and the table with the text according to the identified text information; judging the word number of the replaced text data, and if the word number limit is exceeded, repeatedly and alternately dividing the replaced text data until the word number of the text meets the requirement; when the number of words meets the requirement, the text block is saved.
Segmenting the title of the text using a recursive segmentation algorithm includes: setting a text length threshold of a title, obtaining the number of text title words to be searched, comparing the number of the title words with the text length threshold, if the number of the title words is larger than the text length threshold, not dividing the title, and if the number of the title words is larger than the text length threshold, dividing the title to obtain a primary title and the number of remaining title words; and comparing the number of the remained header words with a text length threshold until the number of the header words signs the text length threshold, and stopping segmentation.
In this embodiment, the text segmentation method aims to be compatible with input length constraints of a large language model and a text embedding model, and limit text segmentation to at most 500 words per block. Titles in the original text are identified by algorithms and natural language processing techniques. In text, recursive segmentation is performed according to title and text length constraints. If the primary header meets the word count requirement, then go to the next step. If the word count requirement is not met, the secondary title is split, and so on. Segmentation is stopped when no title is present or the word count requirement is met. In each segmented text block, an image recognition and form recognition technology is applied to recognize forms and pictures in the segmented text block. And carrying out corresponding processing and replacement on the identified tables and pictures. For example, the picture is saved to a picture service and replaced with a new URL. For the text blocks after the replacement, whether the limit is exceeded or not is judged according to the word number limit (at most 500 words per block). If the limit is not exceeded, the text block may be saved directly. If the limit is exceeded, repeated cross-segmentation can be performed, and text blocks exceeding the limit are segmented and adjusted according to the specified repeated portion percentage to meet word count requirements.
In this embodiment, searching the preprocessed text by using multiple searching methods includes: adopting an ANN algorithm to compare and sort the embedded vectors of the short sentences; adopting an ANN algorithm to compare and sort the embedded vectors of the paragraphs; adopting a BM25 algorithm to compare and sort the paragraphs; and de-duplicating the three search results according to the paragraph ID.
Specifically, the keyword comparison and sorting of paragraphs by adopting the BM25 algorithm comprises the following steps: the key words comprise Chinese characters and pinyin corresponding to the Chinese characters, and both factors can be matched. The specific process is as follows: firstly, word segmentation is carried out on the segmented text according to a word segmentation vocabulary (public vocabulary and business accumulation); according to the stop word list (public word list and business accumulation), removing the matched stop words; converting the residual word into pinyin; counting the occurrence times of each word segmentation; and the sentence and the word segmentation are associated to form an inverted index (the inverted index is associated to the original sentence through the word segmentation) and stored in a database; after the user inputs the problem, the user also divides words, eliminates stop words and converts pinyin; matching Chinese word segmentation and pinyin word segmentation with a database, finding out a matched sentence, calculating the score of the word segmentation according to a bm25 (modified k1=2, b=0.75 parameter value) calculation formula, matching all word segmentation in the problem, scoring, and matching different word segmentation possibly to the same sentence, wherein the repeated matched sentence is finally scored as accumulation of multiple scoring results. And sorting according to the score values, wherein the result with high score is sorted to the front.
The method for comparing and sequencing the embedded vectors of the phrases by adopting the ANN algorithm comprises the following steps:
step 1, constructing data, wherein the data comprises generated text data, and the text data comprises long sentences and short sentences;
step 2, mapping the constructed data into 1024-dimensional data through an ebedding layer;
step 3, constructing a 20-layer data index through HNSW according to 1024-dimension data; the index construction process comprises the following steps: setting index construction parameters, wherein the index construction parameters comprise Euclidean distance, maximum connection number 16 and maximum number 2 of layers of each layer in a metric calculation method; traversing the construction data, randomly determining the layer number position of each point, layering the data, and connecting each point according to the nearest point position of each point by each layer to form a graph; each upper layer forms a layer-to-layer connection by finding the nearest point of the lower layer; until 20 layers of data are formed, obtaining a data index;
step 4, vectorizing the query data, namely mapping the query data to 1024 dimensions through an ebedding layer;
step 5, searching the data index from the upper layer to the lower layer by adopting an approximate nearest neighbor searching algorithm, and obtaining a searching result by taking Euclidean distance as a measurement standard;
and 6, sorting the search results according to Euclidean distance measurement.
In this embodiment, rearranging the search result using the rearrangement model includes:
step 1, primarily searching, namely searching related candidate documents by using a vocabulary matching model, and taking the first N as input;
step 2, preprocessing Candida and query text, wherein the preprocessing comprises word segmentation, lowercase conversion, stop word removal and BERT special symbol addition;
step 3, constructing a model input sequence, which comprises the following steps: querying and separating each candidate by using "[ SEP ]", adding CLS and SEP symbols, and constructing an input ID sequence;
step 4, inputting the sequence into a pre-trained BERT basic model, and outputting a hidden state sequence after 12 layers of transform codes and the last layer of output;
and 5, extracting query and candidate hidden states from corresponding positions, wherein the steps comprise: querying the state h_q, the candidate state h_c;
step 6, calculating matching score; adding parallel attribute and local feature after inner product of h_q and h_c to obtain matching score;
step 7, carrying out normalization mapping on the score through a sigmoid function to obtain a rearrangement normalization weight; in the mapping process, the weight is mapped to a value of 0-1, so that the calculation efficiency is improved; the formula for normalized mapping of the sigmoid function is as follows:
w=1/(1+e^(-score))
step 8, reordering the candidate documents according to the rearranged normalized weights;
and 9, outputting topN as a final result.
While the foregoing is directed to embodiments, aspects and advantages of the present invention, other and further details of the invention may be had by the foregoing description, it will be understood that the foregoing embodiments are merely exemplary of the invention, and that any changes, substitutions, alterations, etc. which may be made herein without departing from the spirit and principles of the invention.
Claims (8)
1. A text segmentation retrieval method for large model questions and answers, comprising: constructing a text database and an index; acquiring text data to be retrieved; dividing the text to be retrieved; embedding and word segmentation are carried out on the segmented text data, and preprocessing is carried out on the segmented text, wherein the preprocessing comprises removal of stop words, pinyin supplement and synonym supplement; searching the preprocessed text by adopting a plurality of searching modes through a text database; and rearranging the search result by adopting a rearranging model to obtain a final search result.
2. The text segmentation retrieval method for large model questions and answers as claimed in claim 1, wherein constructing the text database and index comprises: acquiring complete text data, and performing segmentation processing on the complete text data; dividing the divided sentences into short sentences by adopting sentence end symbols; obtaining the membership of the split short sentence and paragraph, generating a unique ID in each paragraph, and keeping the ID in the sub phrase; embedding the phrases and the paragraphs to obtain embedded vector representations of the phrases and the embedded vector representations of the paragraphs; storing the phrases, paragraphs, embedded vectors of the phrases, embedded vectors of the paragraphs and paragraph IDs of the membership of the phrases into a database, and establishing indexes.
3. A text segmentation retrieval method for large model questions and answers as claimed in claim 2, wherein the segmentation processing of the complete text data comprises: identifying titles in the complete text, and dividing the titles of the text by adopting a recursive division algorithm; segmenting the text content of the text; identifying picture information and table information in the segmented text, and replacing the picture and the table with the text according to the identified text information; judging the word number of the replaced text data, and if the word number limit is exceeded, repeatedly and alternately dividing the replaced text data until the word number of the text meets the requirement; when the number of words meets the requirement, the text block is saved.
4. A text segmentation retrieval method for large model questions and answers as claimed in claim 3, wherein segmenting the title of the text using a recursive segmentation algorithm comprises: setting a text length threshold of a title, obtaining the number of text title words to be searched, comparing the number of the title words with the text length threshold, if the number of the title words is larger than the text length threshold, not dividing the title, and if the number of the title words is larger than the text length threshold, dividing the title to obtain a primary title and the number of remaining title words; and comparing the number of the remained header words with a text length threshold until the number of the header words signs the text length threshold, and stopping segmentation.
5. The text segmentation search method for large model questions and answers as claimed in claim 1, wherein searching the preprocessed text by using a plurality of search modes comprises: adopting an ANN algorithm to compare and sort the embedded vectors of the short sentences; adopting an ANN algorithm to compare and sort the embedded vectors of the paragraphs; adopting a BM25 algorithm to compare and sort the paragraphs; and de-duplicating the three search results according to the paragraph ID.
6. The text segmentation retrieval method for large model questions and answers as claimed in claim 5, wherein the comparing and ordering the embedded vectors of the phrases using an ANN algorithm comprises:
step 1, constructing data, wherein the data comprises generated text data, and the text data comprises long sentences and short sentences;
step 2, mapping the constructed data into 1024-dimensional data through an ebedding layer;
step 3, constructing a 20-layer data index through HNSW according to 1024-dimension data; the index construction process comprises the following steps: setting index construction parameters, wherein the index construction parameters comprise Euclidean distance, maximum connection number 16 and maximum number 2 of layers of each layer in a metric calculation method; traversing the construction data, randomly determining the layer number position of each point, layering the data, and connecting each point according to the nearest point position of each point by each layer to form a graph; each upper layer forms a layer-to-layer connection by finding the nearest point of the lower layer; until 20 layers of data are formed, obtaining a data index;
step 4, vectorizing the query data, namely mapping the query data to 1024 dimensions through an ebedding layer;
step 5, searching the data index from the upper layer to the lower layer by adopting an approximate nearest neighbor searching algorithm, and obtaining a searching result by taking Euclidean distance as a measurement standard;
and 6, sorting the search results according to Euclidean distance measurement.
7. The text segmentation retrieval method for large model questions and answers as claimed in claim 5, wherein the keyword alignment ranking of paragraphs using BM25 algorithm comprises: the key words comprise Chinese characters and pinyin corresponding to the Chinese characters; word segmentation is carried out on the segmented text according to the word segmentation vocabulary; removing the matched stop words according to the stop word list; converting the residual word into pinyin; counting the occurrence times of each word segmentation; the sentence and the word segmentation are associated to form an inverted index, and the inverted index is stored in a database; inputting a problem sentence by a user, and performing word segmentation, stop word removal and pinyin conversion operation on the problem sentence; matching Chinese word segmentation and pinyin word segmentation in the problem sentence with a database, finding out a matched sentence, calculating the score of the word segmentation by adopting a bm25 calculation formula until all the word segmentation is matched, and adding the scores in all the word segmentation in the sentence to obtain the final score of the sentence; ranking is performed according to the final score values of the sentences.
8. The text segmentation retrieval method for large-model questions and answers as claimed in claim 1, wherein the rearranging the retrieval result using the rearranging model comprises:
step 1, primarily searching, namely searching related candidate documents by using a vocabulary matching model, and taking the first N as input;
step 2, preprocessing Candida and query text, wherein the preprocessing comprises word segmentation, lowercase conversion, stop word removal and BERT special symbol addition;
step 3, constructing a model input sequence, which comprises the following steps: querying and separating each candidate by using "[ SEP ]", adding CLS and SEP symbols, and constructing an input ID sequence;
step 4, inputting the sequence into a pre-trained BERT basic model, and outputting a hidden state sequence after 12 layers of transform codes and the last layer of output;
and 5, extracting query and candidate hidden states from corresponding positions, wherein the steps comprise: querying the state h_q, the candidate state h_c;
step 6, calculating matching score; adding parallel attribute and local feature after inner product of h_q and h_c to obtain matching score;
step 7, carrying out normalization mapping on the score through a sigmoid function to obtain a rearrangement normalization weight;
step 8, reordering the candidate documents according to the rearranged normalized weights;
and 9, outputting topN as a final result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311759473.5A CN117591635A (en) | 2023-12-20 | 2023-12-20 | Text segmentation retrieval method for large model question and answer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311759473.5A CN117591635A (en) | 2023-12-20 | 2023-12-20 | Text segmentation retrieval method for large model question and answer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117591635A true CN117591635A (en) | 2024-02-23 |
Family
ID=89911620
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311759473.5A Pending CN117591635A (en) | 2023-12-20 | 2023-12-20 | Text segmentation retrieval method for large model question and answer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117591635A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118210908A (en) * | 2024-05-21 | 2024-06-18 | 上海普华科技发展股份有限公司 | Retrieval enhancement method and device, electronic equipment and storage medium |
-
2023
- 2023-12-20 CN CN202311759473.5A patent/CN117591635A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118210908A (en) * | 2024-05-21 | 2024-06-18 | 上海普华科技发展股份有限公司 | Retrieval enhancement method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112035730B (en) | Semantic retrieval method and device and electronic equipment | |
CN112800170A (en) | Question matching method and device and question reply method and device | |
CN112069298A (en) | Human-computer interaction method, device and medium based on semantic web and intention recognition | |
CN116628173B (en) | Intelligent customer service information generation system and method based on keyword extraction | |
CN108038099B (en) | Low-frequency keyword identification method based on word clustering | |
CN101079025A (en) | File correlation computing system and method | |
CN117591635A (en) | Text segmentation retrieval method for large model question and answer | |
CN109614493B (en) | Text abbreviation recognition method and system based on supervision word vector | |
CN115203421A (en) | Method, device and equipment for generating label of long text and storage medium | |
CN115983233B (en) | Electronic medical record duplicate checking rate estimation method based on data stream matching | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
CN111475607A (en) | Web data clustering method based on Mashup service function characteristic representation and density peak detection | |
CN111966810A (en) | Question-answer pair ordering method for question-answer system | |
CN114970523B (en) | Topic prompting type keyword extraction method based on text semantic enhancement | |
CN113221559A (en) | Chinese key phrase extraction method and system in scientific and technological innovation field by utilizing semantic features | |
CN111191051A (en) | Method and system for constructing emergency knowledge map based on Chinese word segmentation technology | |
CN117216187A (en) | Semantic intelligent retrieval method for constructing legal knowledge graph based on terms | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN114036929B (en) | Full abbreviation matching method based on multi-model feature fusion | |
CN114064901A (en) | Book comment text classification method based on knowledge graph word meaning disambiguation | |
CN112000782A (en) | Intelligent customer service question-answering system based on k-means clustering algorithm | |
CN115809312B (en) | Search recall method based on multi-channel recall | |
CN114491001B (en) | Entity searching method in military field | |
CN116521829A (en) | Map question answering method and device, equipment and storage medium | |
CN114943285A (en) | Intelligent auditing system for internet news content data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |