CN117609418A - Document processing method, device, electronic equipment and storage medium - Google Patents

Document processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117609418A
CN117609418A CN202311665938.0A CN202311665938A CN117609418A CN 117609418 A CN117609418 A CN 117609418A CN 202311665938 A CN202311665938 A CN 202311665938A CN 117609418 A CN117609418 A CN 117609418A
Authority
CN
China
Prior art keywords
search result
document
search
query text
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311665938.0A
Other languages
Chinese (zh)
Inventor
邱奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202311665938.0A priority Critical patent/CN117609418A/en
Publication of CN117609418A publication Critical patent/CN117609418A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a document processing method, a document processing device, electronic equipment and a storage medium, and relates to the technical field of data processing. The specific implementation scheme is as follows: acquiring a document set, wherein the document set comprises a plurality of candidate documents, and determining segment paragraphs and document summaries of the candidate documents; obtaining paragraph word vectors of the segmented paragraphs and summary word vectors of the document summary, generating vector indexes based on the paragraph word vectors and the summary word vectors, and storing the vector indexes in a search engine cluster; an inverted index is generated based on the candidate documents and the document summaries and stored in the search engine cluster.

Description

Document processing method, device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of data processing, and in particular relates to a document processing method, a document processing device, electronic equipment and a storage medium.
Background
With the development of large model technology, the keyword-based retrieval cannot meet the use requirement, the keyword retrieval depends on the input of a user, if the input of the user is incomplete, the retrieval result is inaccurate, and meanwhile, the keyword-based retrieval is performed, so that the large model cannot understand the semantics and has limitation.
Disclosure of Invention
The disclosure provides a method, a device, an electronic device and a storage medium for document processing.
According to an aspect of the present disclosure, there is provided a document processing method including: acquiring a document set, wherein the document set comprises a plurality of candidate documents, and determining segment paragraphs and document summaries of the candidate documents; obtaining a paragraph word vector of the segmented paragraph and a summary word vector summarized by the document, generating a vector index based on the paragraph word vector and the summary word vector, and storing the vector index in a search engine cluster; an inverted index is generated based on the candidate documents and document summaries and stored in the search engine cluster.
According to another aspect of the present disclosure, there is provided a document processing apparatus including: the acquisition module is used for acquiring a document set, wherein the document set comprises a plurality of candidate documents, and determining segment paragraphs and document summaries of the candidate documents; the first storage module is used for acquiring a paragraph word vector of the segmented paragraph and a summary word vector summarized by the document, generating a vector index based on the paragraph word vector and the summary word vector, and storing the vector index in a search engine cluster; and the second storage module is used for generating an inverted index based on the candidate documents and the document summary and storing the inverted index in the search engine cluster.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the document processing method according to an embodiment of the above aspect.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon computer instructions for causing the computer to execute the document processing method according to the embodiment of the above aspect.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program/instruction which, when executed by a processor, implements the document processing method according to the embodiments of the above aspect.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic flow chart of a document processing method according to an embodiment of the disclosure;
FIG. 2 is a flow chart of another document processing method according to an embodiment of the present disclosure;
FIG. 3 is a flow chart of another document processing method according to an embodiment of the present disclosure;
FIG. 4 is a flow chart of another document processing method according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a structure for storing and retrieving documents provided by an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a document processing apparatus according to an embodiment of the present disclosure;
FIG. 7 is a block diagram of an electronic device for implementing a document processing method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The document processing method provided by the embodiment of the disclosure can be applied to the fields such as an intention recognition system, a recommendation system, a public opinion system, a retrieval system, a text classification system and the like.
The document processing method, device and electronic equipment of the embodiment of the disclosure are described below with reference to the accompanying drawings.
Artificial intelligence (Artificial Intelligence, AI for short) is a discipline of researching and enabling a computer to simulate certain thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a person, and has a technology at a hardware level and a technology at a software level. Artificial intelligence hardware technologies generally include computer vision technologies, speech recognition technologies, natural language processing technologies, and learning/deep learning, big data processing technologies, knowledge graph technologies, and the like.
Natural language processing (Natural Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. The natural language processing is mainly applied to the aspects of machine translation, public opinion monitoring, automatic abstracting, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition and the like.
Deep Learning (DL) is a new research direction in the field of Machine Learning (ML), and is introduced into Machine Learning to make it closer to the original goal, i.e., artificial intelligence. Deep learning is the inherent law and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art.
Large models (Large models) refer to models with a Large number of parameters and complex structures in the field of machine learning and artificial intelligence. These models typically require extensive computational resources to train and deploy, and are capable of processing and understanding more, more complex data. The large model can be applied to a plurality of application fields such as automatic writing, chat robots, virtual assistants, voice assistants, automatic translation and the like.
Fig. 1 is a schematic flow chart of a document processing method according to an embodiment of the disclosure.
As shown in fig. 1, the document processing method may include:
s101, acquiring a document set, wherein the document set comprises a plurality of candidate documents, and determining segment paragraphs and document summaries of the candidate documents.
It should be noted that, the execution body of the file processing method in the embodiment of the present disclosure may be a hardware device having information processing capability and/or software necessary for driving the hardware device to operate. Alternatively, the execution subject may include an in-vehicle terminal, a user terminal, and other intelligent devices. Optionally, the user terminal includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, etc. The embodiments of the present disclosure are not particularly limited.
In some implementations, any document may be selected from the document database as a candidate document, the document may be downloaded from the internet as a candidate document, the document may be input by the user as a candidate document, and the acquired multiple candidate documents may be generated into a document set.
In some implementations, segmented paragraphs and document summaries of candidate documents may be generated by a large model. Alternatively, the candidate document is input into a large model, which may invoke a segmentation service to segment the candidate document to obtain segmented paragraphs of the candidate document.
Alternatively, a document summary of the candidate document may be generated by the large model based on the hint terms of the document summary. Illustratively, the term "you are a university of literature, please read the following articles and generate a summary of the articles," and input the candidate documents into the large model, which outputs the document summary of the candidate documents.
S102, obtaining a paragraph word vector of a segmented paragraph and a summary word vector of a document summary, generating a vector index based on the paragraph word vector and the summary word vector, and storing the vector index in a search engine cluster.
In some implementations, the paragraph word vector for the segmented paragraph and the summary word vector for the document summary may be obtained by vector encoding the segmented paragraph and the document summary. Alternatively, the segmented paragraphs and document summaries may be input into an encoder and encoded to obtain paragraph word vectors and summary word vectors. For example, the character strings of the segmented paragraphs and document summaries may be converted into paragraph word vectors and summary word vectors, respectively, by the unbinding service in FIG. 5.
Further, the index field for constructing the vector index can be determined by configuring the paragraph word vector and the summary word vector, so that the vector indexes of the paragraph word vector and the summary word vector are respectively constructed and stored in the search engine cluster.
Illustratively, in fig. 5, the paragraph word vector and the vector index of the summary word vector may be stored in the elastic search cluster, and the paragraph word vector and the vector index of the summary word vector may be stored in the elastic search cluster based on a storage service.
It can be appreciated that the vector index is a data structure for efficiently storing and retrieving vector data, which can improve the efficiency of similarity search for large-scale vector data, and can accelerate the query operation for vector data, so that it is possible to quickly find vectors similar to the query vector in a massive data set.
S103, generating an inverted index based on the candidate documents and the document summary, and storing the inverted index in a search engine cluster.
In some implementations, the index fields used to construct the inverted index may be determined by configuring the candidate documents and document summaries, and then the inverted indexes of the candidate documents and document summaries, respectively, may be constructed and stored in the search engine cluster. Wherein, the inverted index is a data structure for quickly searching the document.
Illustratively, in FIG. 5, the candidate documents and inverted indices of document summaries may be stored into an elastomer search cluster, and the candidate documents and inverted indices of document summaries may be stored into the elastomer search cluster based on a storage service.
According to the document processing method provided by the disclosure, the segmented paragraphs and the document summary of the candidate document are obtained by acquiring the document set of the candidate document and carrying out segmentation processing and document summary on the candidate document. And further obtaining paragraph word vectors of the segmented paragraphs and summary word vectors of the document summary, constructing vector indexes and storing the vector indexes into a search engine cluster, and simultaneously constructing candidate documents and inverted indexes of the document summary and storing the candidate documents and inverted indexes of the document summary into the search engine cluster. Since the search engine cluster can provide distributed storage capability, the capacity and reliability of the index can be improved, and even if a certain node has a problem, the index cannot influence other index data.
Fig. 2 is a schematic flow chart of a document processing method according to an embodiment of the disclosure.
As shown in fig. 2, the document processing method may include:
s201, acquiring a document set, wherein the document set comprises a plurality of candidate documents, and determining segment paragraphs and document summaries of the candidate documents.
S202, obtaining paragraph word vectors of the segmented paragraphs and summary word vectors of the document summary, generating vector indexes based on the paragraph word vectors and the summary word vectors, and storing the vector indexes in a search engine cluster.
S203, generating an inverted index based on the candidate documents and the document summary, and storing the inverted index in a search engine cluster.
The relevant content of steps S201-S203 can be seen in the above embodiments, and will not be described here again.
In some implementations, each index may be associated with candidate documents in order to be able to accurately find matching documents in the search engine cluster, improving the accuracy of the search results. By determining the document identification of the candidate document and associating the vector index and the inverted index with the document identification.
Alternatively, the vector index and the inverted index may be associated with the document identification based on the document encoding (Identity document, ID) of the candidate document.
S204, receiving the query text, and searching the query file in the search engine cluster based on the vector index and the inverted index to obtain a first search result of the query text.
In some implementations, a user may enter query text for retrieving results of the retrieval of the query text from a cluster of search engines. The query text is a problem of a user or a use requirement. The query text may be referred to as a query, representing the query intent of the user.
In some implementations, for the query text, the search may be performed in a search engine cluster where the vector index and the inverted index are located, respectively, to obtain a second search result of the inverted index and a third search result of the vector index, respectively. And scoring and sorting the second search result and the third search result to obtain a first search result of the query text.
Optionally, the search service may search the query file in the search engine cluster to obtain a second search result and a third search result, and sort the query file in reverse order based on scores of the second search result and the third search result, and select a later search result as the first search result.
Alternatively, the query text itself and the inverted index may be used to perform inverted search on the search engine cluster to obtain the second search result, and the word vector and the vector index of the query text may be used to perform vector search on the search engine cluster to obtain the third search result.
According to the document processing method provided by the disclosure, the segmented paragraphs and the document summary of the candidate document are obtained by acquiring the document set of the candidate document and carrying out segmentation processing and document summary on the candidate document. And further obtaining paragraph word vectors of the segmented paragraphs and summary word vectors of the document summary, constructing vector indexes and storing the vector indexes into a search engine cluster, and simultaneously constructing candidate documents and inverted indexes of the document summary and storing the candidate documents and inverted indexes of the document summary into the search engine cluster. Since the search engine cluster can provide distributed storage capability, the capacity and reliability of the index can be improved, and even if a certain node has a problem, the index cannot influence other index data. Based on searching in the search engine cluster based on the query text, semantic searching and theme searching can be realized, the accuracy of the search result is improved, and the use experience of the user is satisfied.
On the basis of the foregoing embodiment, the process of obtaining the first search result of the query text according to the embodiment of the present disclosure may be explained, as shown in fig. 3, where the process of obtaining the first search result of the query text may include:
s301, performing inverted search in a search engine cluster based on the query text and the inverted index to obtain a second search result, wherein the second search result comprises at least one of the searched first segment paragraph and the first document summary.
In some implementations, query text may be used to perform an inverted search in a search engine cluster, an inverted index of candidate documents and/or an inverted index of document summaries may be retrieved, and at least one of the first segment paragraph and the first document summary may be retrieved as a second search result because the inverted index is associated with a document identification.
S302, acquiring word vectors of the query text.
Alternatively, the character string of the query text may be converted into a word vector, resulting in a word vector of the query text. For example, query text may be converted to word vectors based on the ebadd service.
S303, carrying out vector retrieval in a search engine cluster based on the word vector and the vector index of the query text, and determining a third retrieval result, wherein the third retrieval result comprises at least one of the retrieved first paragraph word vector and the first summary word vector.
In some implementations, a vector search may be performed in a search engine cluster using a word vector of the query text, a vector index of the paragraph word vector and/or a vector index of the summary word vector may be retrieved, and at least one of the first paragraph word vector and the first summary word vector may be retrieved as a third search result because the vector index is associated with a document identification.
S304, determining a first search result of the query text according to the second search result and the third search result.
In some implementations, the second search result may be filtered based on the query text to obtain a fourth search result, so as to improve accuracy of the search result. The similarity between the query text and the second search result can be calculated, and the second search result is screened based on the similarity to obtain a fourth search result.
Optionally, the second search result includes at least one of the searched first segmented paragraph and the first document summary, and a first similarity of the query text and the first segmented paragraph and a second similarity of the query text and the first document summary may be obtained.
Alternatively, the first similarity and the second similarity may be calculated using the edit distance, respectively. The editing distance is a measurement method for calculating the similarity between two character strings, and the smaller the editing distance is, the larger the similarity is.
Further, the first segmented paragraphs and the first document summary are ranked according to the first similarity and the second similarity, and the fourth search result is screened out from the second search result based on the first ranking result, so that the second search result can be optimized, comprehensive and accurate search results are provided, and better search experience is brought to users.
For example, suppose that the first search results are K, sorting is performed from large to small based on the first similarity and the second similarity, and a first segmented paragraph or a first document summary corresponding to the first K/2 similarities is selected as a fourth search result, where the fourth search result includes K/2 as first K/2 result items of the first search result.
In some implementations, the third search result may be filtered based on the word vector of the query text to obtain a fifth search result, so as to improve accuracy of the search result. The similarity between the word vector of the query text and the search result can be calculated, and screening is performed based on the similarity to obtain a fifth search result.
Optionally, if the plurality of search engines in the search engine cluster all return the third search result, sorting the first paragraph word vectors and the first summary word vectors returned by the plurality of search engines to obtain a second sorting result. And screening a sixth search result from the third search results returned by the plurality of search engines based on the second ranking result.
Optionally, the third search result may be ranked based on the default scores of the first paragraph word vector and the first summary word vector, to obtain a second ranked result. The default score refers to a similarity score of a search engine to a word vector of a query text, a paragraph word vector and a summary word vector, and the greater the score, the greater the similarity.
Alternatively, the ranking may be performed in order from the largest to the smallest, to obtain the second ranking result, and the top search result is selected as the sixth search result. The second sorting result can be obtained by sorting in the order from small to large, and the later retrieval result is selected as a sixth retrieval result.
Further, a target document identification associated with each word vector in the sixth search result is determined, and a second summary word vector associated with the target document identification is determined. By acquiring the third similarity of the word vector of the query text and the second summary word vector and screening the fifth search result from the sixth search result based on the third similarity, the third search result can be optimized, a comprehensive and accurate search result is provided, and better search experience is brought to users.
For example, assuming that the first search results are K, screening the third search results, reserving K/2 results as a sixth search result, determining an associated summary word vector by determining a target document identifier associated with each word vector in the sixth search result, calculating the similarity between the word vector of the query text and the summary word vector, and sequencing the word vectors in order from small to large to obtain a fifth search result, wherein the fifth search result comprises K/2 last K/2 result items serving as the first search result.
Further, based on the fourth search result and the fifth search result, a first search result of the query text is obtained. The display position of the fourth search result is earlier than the display position of the fifth search result, that is, the first search result is assumed to be K, the fourth search result is the first K/2 search results, and the fifth search result is the last K/2 search results.
It can be understood that the fourth search result based on text search is displayed before the fifth search result based on vector search, so that the search efficiency and the accuracy of the search result can be improved, and finally the accuracy and the efficiency of the search result are balanced.
In some implementations, since multiple search results may be associated with the same document identifier, in order to avoid duplication of the search results, duplicate processing needs to be performed on the fourth search result and the fifth search result.
Optionally, determining multiple search results related to the same document identifier in the fourth search result and the fifth search result, and performing deduplication processing on the multiple search results related to the same document identifier to obtain a first search result of the query text. Only one of the plurality of search results associated with the same document identification may be retained as the first search result.
According to the document processing method provided by the disclosure, the segmented paragraphs and the document summary of the candidate document are obtained by acquiring the document set of the candidate document and carrying out segmentation processing and document summary on the candidate document. And further obtaining paragraph word vectors of the segmented paragraphs and summary word vectors of the document summary, constructing vector indexes and storing the vector indexes into a search engine cluster, and simultaneously constructing candidate documents and inverted indexes of the document summary and storing the candidate documents and inverted indexes of the document summary into the search engine cluster. Since the search engine cluster can provide distributed storage capability, the capacity and reliability of the index can be improved, and even if a certain node has a problem, the index cannot influence other index data. Based on searching in the search engine cluster based on the query text, semantic searching and theme searching can be realized, the accuracy of the search result is improved, and the use experience of the user is satisfied.
Fig. 4 is a flowchart of a document processing method according to an embodiment of the present disclosure.
As shown in fig. 4, the document processing method may include:
s401, acquiring a document set, wherein the document set comprises a plurality of candidate documents, and determining segment paragraphs and document summaries of the candidate documents.
S402, obtaining a paragraph word vector of a segmented paragraph and a summary word vector of a document summary, generating a vector index based on the paragraph word vector and the summary word vector, and storing the vector index in a search engine cluster.
S403, generating an inverted index based on the candidate documents and the document summary, and storing the inverted index in a search engine cluster.
S404, determining the document identification of the candidate document, and associating the vector index and the inverted index with the document identification.
S405, receiving the query text, and performing inverted search in the search engine cluster based on the query text and the inverted index to obtain a second search result.
S406, acquiring word vectors of the query text.
S407, carrying out vector retrieval in a search engine cluster based on the word vector and the vector index of the query text, and determining a third retrieval result.
And S408, screening the second search result based on the query text to obtain a fourth search result.
S409, screening the third search result based on the word vector of the query text to obtain a fifth search result.
S410, obtaining a first search result of the query text based on the fourth search result and the fifth search result.
According to the document processing method provided by the disclosure, the segmented paragraphs and the document summary of the candidate document are obtained by acquiring the document set of the candidate document and carrying out segmentation processing and document summary on the candidate document. And further obtaining paragraph word vectors of the segmented paragraphs and summary word vectors of the document summary, constructing vector indexes and storing the vector indexes into a search engine cluster, and simultaneously constructing candidate documents and inverted indexes of the document summary and storing the candidate documents and inverted indexes of the document summary into the search engine cluster. Since the search engine cluster can provide distributed storage capability, the capacity and reliability of the index can be improved, and even if a certain node has a problem, the index cannot influence other index data. Based on searching in the search engine cluster based on the query text, semantic searching and theme searching can be realized, the accuracy of the search result is improved, and the use experience of the user is satisfied.
The structure diagram for storing and retrieving documents as shown in fig. 5 includes: a large model 51, a search engine (elastic search) cluster 52, an encoding (encoding) service 53, a segmentation service 54, a storage service 55, and a retrieval service 66.
The large model 51 is used for summarizing candidate documents and obtaining document summaries. Candidate documents are segmented by invoking the segmentation service 54.
An elastomer search cluster 52 for storing vector indexes, inverted indexes.
The labeling service 53 is used for converting the character string into a word vector. Such as converting segmented paragraphs into paragraph word vectors, converting document summaries into summary word vectors, and converting query text into word vectors of query text.
A segmentation service 54 for segmenting the document.
A storage service 55 for storing documents into the elastic search cluster.
A retrieval service 56 for retrieving the elastsearch clusters based on the query text.
Illustratively, a segmented paragraph is obtained by obtaining a document set of a plurality of candidate documents, inputting the candidate documents into the large model 51, generating a document summary of the candidate documents from the large model 51, and segmenting the candidate documents by invoking the segmentation service 54. The segmented paragraphs and the document summary are input into the segmentation service 53, and a paragraph word vector of the segmented paragraphs and a summary word vector of the document summary are acquired. Vector indexes are built for the paragraph word vectors and the summary word vectors, respectively, and stored into the elastic search cluster 52 based on the storage service 55. Inverted indexes are built for candidate documents and documents, respectively, and stored into the elastic search cluster 52 based on the storage service 55. And simultaneously, associating the vector index and the inverted index with the document identification.
By acquiring a query text (query) of the user and inputting the query text into the queuing service 53, a word vector of the query text is acquired. The search service 56 uses the word vector of the query text to perform a vector search in the elastic search cluster 52 to retrieve at least one of the first paragraph word vector and the first summary word vector. The retrieval service 56 retrieves at least one of the first segmented paragraph and the first document summary using query text for reverse retrieval in the elastic search cluster 52. The first search result can be determined by scoring and sorting the search results, and the candidate documents can be determined to be answers to the query text according to the document identifications corresponding to the indexes in the first search result.
Corresponding to the document processing methods provided in the above several embodiments, an embodiment of the present disclosure further provides a document processing apparatus, and since the document processing apparatus provided in the embodiment of the present disclosure corresponds to the document processing method provided in the above several embodiments, implementation of the above document processing method is also applicable to the document processing apparatus provided in the embodiment of the present disclosure, and will not be described in detail in the following embodiments.
Fig. 6 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present disclosure.
As shown in fig. 6, a document processing apparatus 600 of an embodiment of the present disclosure includes an acquisition module 601, a first storage module 602, and a second storage module 603.
An obtaining module 601 is configured to obtain a document set, where the document set includes a plurality of candidate documents, and determine segment paragraphs and document summaries of the candidate documents.
A first storage module 602, configured to obtain a paragraph word vector of the segmented paragraph and a summary word vector summarized by the document, generate a vector index based on the paragraph word vector and the summary word vector, and store the vector index in a search engine cluster.
A second storage module 603 is configured to generate an inverted index based on the candidate documents and the document summary, and store the inverted index in the search engine cluster.
In one embodiment of the present disclosure, the second storage module 603 is further configured to: and determining the document identification of the candidate document, and associating the vector index and the inverted index with the document identification.
In one embodiment of the present disclosure, the apparatus further comprises: and the retrieval module is used for receiving the query text, and retrieving the query file in the search engine cluster based on the vector index and the inverted index to obtain a first retrieval result of the query text.
In one embodiment of the present disclosure, the retrieving module is further configured to: performing inverted search in the search engine cluster based on the query text and the inverted index to obtain a second search result, wherein the second search result comprises at least one of a searched first segment paragraph and a first document summary; acquiring word vectors of the query text; performing vector retrieval in the search engine cluster based on the word vector of the query text and the vector index, and determining a third retrieval result, wherein the third retrieval result comprises at least one of the retrieved first paragraph word vector and the first summary word vector; and determining a first search result of the query text according to the second search result and the third search result.
In one embodiment of the present disclosure, the retrieving module is further configured to: screening the second search result based on the query text to obtain a fourth search result; screening the third search result based on the word vector of the query text to obtain a fifth search result; and obtaining a first search result of the query text based on the fourth search result and the fifth search result.
In one embodiment of the present disclosure, the retrieving module is further configured to: acquiring a first similarity between the query text and the first segmented paragraph and a second similarity between the query text and the first document summary; and sorting the first segmented paragraphs and the first document summary according to the first similarity and the second similarity, and screening the fourth retrieval result from the second retrieval result based on a first sorting result.
In one embodiment of the present disclosure, the retrieving module is further configured to: if a plurality of search engines in the search engine cluster all return a third search result, sequencing the first paragraph word vector and the first summary word vector returned by the plurality of search engines; screening a sixth search result from third search results returned by the plurality of search engines based on the second ranking result; determining a target document identifier associated with each word vector in the sixth search result, and determining a second summarized word vector associated with the target document identifier; and acquiring a third similarity of the word vector of the query text and the second summarized word vector, and screening the fifth search result from the sixth search result based on the third similarity.
In one embodiment of the present disclosure, the retrieving module is further configured to: and determining a plurality of search results related to the same document identifier in the fourth search result and the fifth search result, and performing de-duplication processing on the plurality of search results related to the same document identifier to obtain a first search result of the query text.
In one embodiment of the present disclosure, the fourth search result is displayed at a position earlier than the fifth search result.
According to the document processing device provided by the disclosure, the segmented paragraphs and the document summary of the candidate document are obtained by acquiring the document set of the candidate document and carrying out segmentation processing and document summary on the candidate document. And further obtaining paragraph word vectors of the segmented paragraphs and summary word vectors of the document summary, constructing vector indexes and storing the vector indexes into a search engine cluster, and simultaneously constructing candidate documents and inverted indexes of the document summary and storing the candidate documents and inverted indexes of the document summary into the search engine cluster. Since the search engine cluster can provide distributed storage capability, the capacity and reliability of the index can be improved, and even if a certain node has a problem, the index cannot influence other index data. Based on searching in the search engine cluster based on the query text, semantic searching and theme searching can be realized, the accuracy of the search result is improved, and the use experience of the user is satisfied.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various suitable actions and processes according to computer programs/instructions stored in a Read Only Memory (ROM) 702 or loaded from a storage unit 706 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 701 performs the respective methods and processes described above, for example, a document processing method. For example, in some embodiments, the document processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as in some embodiments of storage unit 706, some or all of the computer program/instructions may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer programs/instructions are loaded into RAM 703 and executed by computing unit 701, one or more steps of the document processing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the document processing method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs/instructions that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs/instructions running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (21)

1. A document processing method, wherein the method comprises:
acquiring a document set, wherein the document set comprises a plurality of candidate documents, and determining segment paragraphs and document summaries of the candidate documents;
obtaining a paragraph word vector of the segmented paragraph and a summary word vector summarized by the document, generating a vector index based on the paragraph word vector and the summary word vector, and storing the vector index in a search engine cluster;
An inverted index is generated based on the candidate documents and document summaries and stored in the search engine cluster.
2. The method of claim 1, wherein the method further comprises:
and determining the document identification of the candidate document, and associating the vector index and the inverted index with the document identification.
3. The method of claim 1, wherein the method further comprises:
and receiving a query text, and searching the query file in the search engine cluster based on the vector index and the inverted index to obtain a first search result of the query text.
4. The method of claim 3, wherein the retrieving the query document in the search engine cluster based on the vector index and the inverted index results in a first retrieval result of the query text, comprising:
performing inverted search in the search engine cluster based on the query text and the inverted index to obtain a second search result, wherein the second search result comprises at least one of a searched first segment paragraph and a first document summary;
acquiring word vectors of the query text;
Performing vector retrieval in the search engine cluster based on the word vector of the query text and the vector index, and determining a third retrieval result, wherein the third retrieval result comprises at least one of the retrieved first paragraph word vector and the first summary word vector;
and determining a first search result of the query text according to the second search result and the third search result.
5. The method of claim 3, wherein the determining the first search result of the query text based on the second search result and the third search result comprises:
screening the second search result based on the query text to obtain a fourth search result;
screening the third search result based on the word vector of the query text to obtain a fifth search result;
and obtaining a first search result of the query text based on the fourth search result and the fifth search result.
6. The method of claim 5, wherein the screening the second search result based on the query text to obtain a fourth search result comprises:
acquiring a first similarity between the query text and the first segmented paragraph and a second similarity between the query text and the first document summary;
And sorting the first segmented paragraphs and the first document summary according to the first similarity and the second similarity, and screening the fourth retrieval result from the second retrieval result based on a first sorting result.
7. The method of claim 5, wherein the screening the third search result based on the word vector of the query text to obtain a fifth search result comprises:
if a plurality of search engines in the search engine cluster all return a third search result, sequencing the first paragraph word vector and the first summary word vector returned by the plurality of search engines;
screening a sixth search result from third search results returned by the plurality of search engines based on the second ranking result;
determining a target document identifier associated with each word vector in the sixth search result, and determining a second summarized word vector associated with the target document identifier;
and acquiring a third similarity of the word vector of the query text and the second summarized word vector, and screening the fifth search result from the sixth search result based on the third similarity.
8. The method of any of claims 4-7, wherein the obtaining the first search result of the query text based on the fourth search result and the fifth search result comprises:
And determining a plurality of search results related to the same document identifier in the fourth search result and the fifth search result, and performing de-duplication processing on the plurality of search results related to the same document identifier to obtain a first search result of the query text.
9. The method of claim 7, wherein the fourth search result is presented at a location earlier than the fifth search result.
10. A document processing apparatus, wherein the apparatus comprises:
the acquisition module is used for acquiring a document set, wherein the document set comprises a plurality of candidate documents, and determining segment paragraphs and document summaries of the candidate documents;
the first storage module is used for acquiring a paragraph word vector of the segmented paragraph and a summary word vector summarized by the document, generating a vector index based on the paragraph word vector and the summary word vector, and storing the vector index in a search engine cluster;
and the second storage module is used for generating an inverted index based on the candidate documents and the document summary and storing the inverted index in the search engine cluster.
11. The apparatus of claim 10, wherein the second storage module is further to:
and determining the document identification of the candidate document, and associating the vector index and the inverted index with the document identification.
12. The apparatus of claim 10, wherein the apparatus further comprises:
and the retrieval module is used for receiving the query text, and retrieving the query file in the search engine cluster based on the vector index and the inverted index to obtain a first retrieval result of the query text.
13. The apparatus of claim 12, wherein the retrieval module is further to:
performing inverted search in the search engine cluster based on the query text and the inverted index to obtain a second search result, wherein the second search result comprises at least one of a searched first segment paragraph and a first document summary;
acquiring word vectors of the query text;
performing vector retrieval in the search engine cluster based on the word vector of the query text and the vector index, and determining a third retrieval result, wherein the third retrieval result comprises at least one of the retrieved first paragraph word vector and the first summary word vector;
and determining a first search result of the query text according to the second search result and the third search result.
14. The apparatus of claim 12, wherein the retrieval module is further to:
Screening the second search result based on the query text to obtain a fourth search result;
screening the third search result based on the word vector of the query text to obtain a fifth search result;
and obtaining a first search result of the query text based on the fourth search result and the fifth search result.
15. The apparatus of claim 14, wherein the retrieval module is further to:
acquiring a first similarity between the query text and the first segmented paragraph and a second similarity between the query text and the first document summary;
and sorting the first segmented paragraphs and the first document summary according to the first similarity and the second similarity, and screening the fourth retrieval result from the second retrieval result based on a first sorting result.
16. The apparatus of claim 14, wherein the retrieval module is further to:
if a plurality of search engines in the search engine cluster all return a third search result, sequencing the first paragraph word vector and the first summary word vector returned by the plurality of search engines;
Screening a sixth search result from third search results returned by the plurality of search engines based on the second ranking result;
determining a target document identifier associated with each word vector in the sixth search result, and determining a second summarized word vector associated with the target document identifier;
and acquiring a third similarity of the word vector of the query text and the second summarized word vector, and screening the fifth search result from the sixth search result based on the third similarity.
17. The apparatus of any of claims 13-16, wherein the retrieval module is further to:
and determining a plurality of search results related to the same document identifier in the fourth search result and the fifth search result, and performing de-duplication processing on the plurality of search results related to the same document identifier to obtain a first search result of the query text.
18. The apparatus of claim 16, wherein the fourth search result is presented at a location earlier than the fifth search result.
19. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.
21. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any of claims 1-9.
CN202311665938.0A 2023-12-06 2023-12-06 Document processing method, device, electronic equipment and storage medium Pending CN117609418A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311665938.0A CN117609418A (en) 2023-12-06 2023-12-06 Document processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311665938.0A CN117609418A (en) 2023-12-06 2023-12-06 Document processing method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117609418A true CN117609418A (en) 2024-02-27

Family

ID=89951386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311665938.0A Pending CN117609418A (en) 2023-12-06 2023-12-06 Document processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117609418A (en)

Similar Documents

Publication Publication Date Title
CN112507715B (en) Method, device, equipment and storage medium for determining association relation between entities
CN114549874B (en) Training method of multi-target image-text matching model, image-text retrieval method and device
CN112506864B (en) File retrieval method, device, electronic equipment and readable storage medium
CN112507091A (en) Method, device, equipment and storage medium for retrieving information
CN113590776A (en) Text processing method and device based on knowledge graph, electronic equipment and medium
CN115455161A (en) Conversation processing method, conversation processing device, electronic equipment and storage medium
CN114036322A (en) Training method for search system, electronic device, and storage medium
CN114861889A (en) Deep learning model training method, target object detection method and device
CN116401345A (en) Intelligent question-answering method, device, storage medium and equipment
CN114444462B (en) Model training method and man-machine interaction method and device
CN112699237B (en) Label determination method, device and storage medium
CN117688946A (en) Intent recognition method and device based on large model, electronic equipment and storage medium
CN112925912A (en) Text processing method, and synonymous text recall method and device
CN114201607B (en) Information processing method and device
CN115168537B (en) Training method and device for semantic retrieval model, electronic equipment and storage medium
CN116049370A (en) Information query method and training method and device of information generation model
CN114201622B (en) Method and device for acquiring event information, electronic equipment and storage medium
CN112966513B (en) Method and apparatus for entity linking
CN112860626B (en) Document ordering method and device and electronic equipment
CN115292506A (en) Knowledge graph ontology construction method and device applied to office field
CN114443864A (en) Cross-modal data matching method and device and computer program product
CN114692023A (en) Location search method, electronic device, and storage medium
CN114281990A (en) Document classification method and device, electronic equipment and medium
CN117609418A (en) Document processing method, device, electronic equipment and storage medium
CN112784600A (en) Information sorting method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination