CN115544201A - Multi-granularity full-text retrieval method and device - Google Patents

Multi-granularity full-text retrieval method and device Download PDF

Info

Publication number
CN115544201A
CN115544201A CN202211263681.1A CN202211263681A CN115544201A CN 115544201 A CN115544201 A CN 115544201A CN 202211263681 A CN202211263681 A CN 202211263681A CN 115544201 A CN115544201 A CN 115544201A
Authority
CN
China
Prior art keywords
paragraph
retrieval
sentence
document
index library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211263681.1A
Other languages
Chinese (zh)
Inventor
宋瑞霞
金友兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Zhuoshengyun Technology Co ltd
Original Assignee
Tianjin Zhuoshengyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Zhuoshengyun Technology Co ltd filed Critical Tianjin Zhuoshengyun Technology Co ltd
Priority to CN202211263681.1A priority Critical patent/CN115544201A/en
Publication of CN115544201A publication Critical patent/CN115544201A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Abstract

The invention provides a full-text retrieval method and a full-text retrieval device with multiple granularity, wherein the full-text retrieval method comprises the following steps of S1, establishing index names and constructing a plurality of internal index libraries with different granularity; s2, analyzing and saving the document as a retrieval object; s3, acquiring an index name, a retrieval range and retrieval parameters; s4, determining an internal index library of actual retrieval according to the index name and the retrieval range, and performing full-text retrieval based on the retrieval parameters to obtain a retrieval result; and S5, returning retrieval results with different formats and sequences according to the retrieval range. The invention realizes the creation of the index database with various granularities when the index database is constructed, and supports the duplication elimination of the data. During retrieval, various modes of retrieval and display can be flexibly carried out according to needs.

Description

Multi-granularity full-text retrieval method and device
Technical Field
The invention belongs to the field of computer data retrieval, and particularly relates to a multi-granularity full-text retrieval method and device.
Background
In the big data era, text data is also rapidly increasing, and the demand of full-text retrieval is increasing. After a common full-text search engine elastic search imports text data into the elastic search, full-text search can be performed through inverted indexing.
The simple full-text retrieval takes the whole article as granularity, carries out retrieval and returns the retrieval result. For example, the Elasticsearch can establish a large number of documents as a certain index library, and each record in the library is stored as a plurality of fields and contents corresponding to the fields in a Json format. When the input keywords are searched for a field (such as a text field of an article) or a plurality of fields, relevant records meeting the search condition are returned, and the results are generally sorted according to the relevance. During retrieval, a plurality of retrieval parameters are also generally provided, for example, a retrieval relation of a plurality of keywords is established, and the plurality of keywords can be in an and, or not relation; each keyword can also be in a precise matching mode or a fuzzy matching mode.
However, the whole article is taken as a granularity retrieval mode, and many application scenarios cannot be met. If the content of the articles is more, when the user only wants to see the paragraphs or sentences containing the keywords, the user needs to view the content in each article and then view the keywords in the article. It is difficult to further inquire how many paragraphs or sentences contain the keyword. In addition, if the index library is constructed by simply using paragraphs or sentences as units, if a large number of paragraphs or sentences in a certain article contain the keyword, the search result may be the paragraphs or sentences of which many previous hit entries are in the same article, which is usually not the result way that the user wants to obtain.
Disclosure of Invention
In view of this, the present invention is directed to a multi-granularity full-text retrieval method and apparatus, which establish an internal index library with multiple granularities, divide a document serving as a retrieval basis with multiple granularities and store the divided document in multiple internal indexes, thereby implementing retrieval with multiple granularities and supporting content deduplication with each granularity.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the first aspect provides a full-text retrieval method with multiple particle sizes, which comprises the following steps:
s1, establishing index names and constructing a plurality of internal index libraries with different granularities; the internal index library comprises a text index library, a paragraph set index library, a pure paragraph index library, a sentence set index library and a pure sentence index library, wherein the text index library, the paragraph set index library, the pure paragraph index library, the sentence set index library and the pure sentence index library are mutually associated with the index name;
step S2, analyzing and saving the document as the retrieval object, which comprises the following steps: step S201, analyzing the document to obtain metadata, text data, paragraph data and sentence data of the document; the metadata includes title (title), summary (abstract), annotation (annotation); the text data is the text of the document; the paragraph data and sentence data are set data;
step S202, storing the metadata and the text data into the text index database, and obtaining a record ID after storage, wherein the record ID is set as a document number (docID);
step S203, storing the metadata and the paragraph set into the paragraph set index library, and recording the ID in the library as a document number (docID); the paragraph set is obtained according to the paragraph data and is an array formed by all paragraphs in the document;
step S204, storing the metadata and the sentence sets into the sentence set index library, and recording ID in the library as document number (docID); the sentence subset is obtained according to the sentence data and is an array formed by all sentences in the document;
step S205, traversing each paragraph in the document, and storing each paragraph in the pure paragraph index library separately, wherein the record ID of each paragraph is the Hash digest value of the paragraph content;
step S206, traversing each sentence in the document, and storing each sentence into the sentence set index database independently, wherein the record ID of each sentence is the Hash abstract value of the sentence content;
s3, acquiring an index name, a retrieval range and retrieval parameters; the retrieval range comprises full text, paragraph sets, sentence subsets, pure paragraphs and pure sentences; the retrieval parameters comprise key words, relations among the key words, accurate search or fuzzy search;
s4, determining an internal index library of actual retrieval according to the index name and the retrieval range, and performing full-text retrieval based on the retrieval parameters to obtain a retrieval result;
and S5, returning retrieval results with different formats and sequences according to the retrieval range.
Further, in the step S1, each record in the text index library includes a digest field, and the digest field stores a Hash digest value of the document; when a new record is inserted, if the Hash digest value of the document already exists, the document is not imported into the internal index library.
Further, in step S202 and step S203, the paragraph set and the sentence set are stored as an array field in a record in the paragraph set index library and the sentence set index library in the form of embedded data.
Further, in step S205, regarding the pure paragraph index library, the Hash digest value is recorded as a summary (digest), the summary (digest) is used as a record ID in the pure paragraph index library, and deduplication of the paragraph contents is performed according to the summary (digest); the fields in the paragraph-only index library further include content (content), number of references (docUseNum), and citation document records (refDocs); the content (content) is the content of a paragraph, the reference times (docUseNum) is the times of the text index library for referencing the paragraph, the citation document record (refDocs) is an embedded set, and the element of the embedded set is the document number (docID) of the document.
Further, in step S205, when each paragraph in the document is traversed, a summary (digest) of the paragraph is first calculated; secondly, if the summary (digest) record of the pure paragraph index library does not exist, the number of times of reference (docUseNum) is 1, and if the summary (digest) record of the pure paragraph index library exists, the number of times of reference (docUseNum) is added with 1; and inserting the value of the document number (docID) of the current document into the collection of the refDocs document records (refDocs), wherein the refDocs document records (refDocs) are of a collection type.
Further, in step S206, regarding the pure sentence index database, the Hash digest value is marked as a summary (digest), the summary (digest) is used as a record ID of the pure sentence index database, and the duplication deletion of the sentence content is performed according to the summary (digest); the fields in the plain sentence index library further include content (content), number of references (docUseNum), and citation document records (refDocs); the content (content) is the content of a sentence, the reference times (docUseNum) are the times for referring to the sentence in a text index library, the citation document record (refDocs) is an embedded set, and the element of the embedded set is the document number (docID) of the document.
Further, in step S206, when each sentence in the document is traversed, first, a summary (digest) of the sentence is calculated; secondly, if the summary (digest) record of the pure sentence index library does not exist, the number of times of reference (docUseNum) is 1, and if the summary (digest) record of the pure sentence index library exists, the number of times of reference (docUseNum) is added with 1; and inserting the value of the current document number (docID) into the collection of the refDocs document records (refDocs), wherein the refDocs document records (refDocs) are of a collection type.
In step S202 to step S206, if any step fails, the process returns to step S201, and then the process ends in step S2.
Further, in step S5, the search result includes the following format and sorting manner:
in the first mode, the retrieval results of the text index library are sorted according to the relevancy, and each record is a complete record in a positive library and comprises metadata and text content;
in a second mode, the retrieval results of the paragraph set index library and the sentence set index library are sorted according to the relevance of the whole document, and each record returns metadata and eligible paragraphs or sentences in the array;
and thirdly, for the retrieval results of the pure paragraph index library and the pure sentence index library, sequencing according to the relevance of the paragraphs and the sentences, and returning the contents of the individual paragraphs and the sentences.
A second aspect provides a multi-granularity full-text retrieval device, including:
establishing an index name and internal index library module for establishing the index name and establishing a plurality of internal index libraries with different granularities; the internal index library comprises a text index library, a paragraph set index library, a pure paragraph index library, a sentence set index library and a pure sentence index library;
the document analyzing and saving module is used for analyzing and saving the document needing to be imported, and specifically comprises the following steps: step S201, analyzing the document to obtain metadata, text data, paragraph data and sentence data of the document; the metadata includes title (title), summary (abstract), annotation (annotation); step S202, storing the metadata and the text content into the text index database, and obtaining a record ID after storage, wherein the record ID is set as a document number (docID); step S203, storing the metadata and the paragraph set into the paragraph set index library, and recording ID in the library as document number (docID); the paragraph set is an array formed by all paragraphs in the document; step S204, storing the metadata and the sentence sets into the sentence set index library, and recording ID in the library as a document number (docID); the sentence set is an array formed by all sentences in the document; step S205, traversing each paragraph in the document, and storing each paragraph in the pure paragraph index library separately, wherein the record ID of each paragraph is the Hash digest value of the paragraph content; step S206, traversing each sentence in the document, and storing each sentence into the sentence set index database independently, wherein the record ID of each sentence is the Hash abstract value of the sentence content;
the module for obtaining the index name, the retrieval range and the retrieval parameter is used for obtaining the index name, the retrieval range and the retrieval parameter when the retrieval query is executed; the retrieval range comprises full text, paragraph sets, sentence subsets, pure paragraphs and pure sentences; the retrieval parameters comprise key words, relations among the key words, accurate search or fuzzy search;
the retrieval result obtaining module is used for determining the internal index library of actual retrieval according to the index name and the retrieval range, and carrying out full-text retrieval based on the retrieval parameters to obtain a retrieval result;
and the retrieval result returning module is used for returning the retrieval results in different formats and in different orders according to the retrieval range.
Compared with the prior art, the multi-granularity full-text retrieval method and the multi-granularity full-text retrieval device have the following advantages:
the full-text retrieval method establishes the internal index library with various granularities, divides the document serving as the retrieval basis with various granularities and correspondingly stores the document in the internal indexes, realizes retrieval with various granularities, can support content duplication elimination of each granularity, provides a more flexible full-text retrieval mode to meet the requirements of various retrieval application scenes, and solves the problem that the current pure full-text retrieval application scene is single.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
fig. 1 is a flowchart of a multi-granularity full-text retrieval method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a multi-particle-size full-text search device according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in fig. 1, an embodiment of the present invention provides a full-text retrieval method with multiple particle sizes, including the following steps:
s1, establishing index names and constructing a plurality of internal index libraries with different granularities; the internal index library comprises a text index library, a paragraph set index library, a pure paragraph index library, a sentence set index library and a pure sentence index library, wherein the text index library, the paragraph set index library, the pure paragraph index library, the sentence set index library and the pure sentence index library are mutually related to the index name; it needs to be further explained that: in order to establish the correlation between the name and the index name, the names of the plurality of internal index libraries are all formed by adding a specific suffix to the index name;
further, in the step S1, each record in the text index library includes a summary field, and the summary field stores a Hash summary value of the document; when a new record is inserted, if the Hash abstract value of the document already exists, the document is not imported into the internal index library any more; it needs to be further explained that: the Hash digest value is a compact binary value obtained by a Hash (Hash) algorithm of the related art to realize a fast access to the digest field.
It needs to be further explained that:the summary field stores the summaryIs made byText of a documentObtaining the result by adopting a correlation algorithm, wherein the correlation algorithm is a sha1 algorithm or a sha256 algorithm; wherein the sha1 Algorithm is also called Secure Hash Algorithm 1 (Secure Hash Algorithm 1) is a cryptographic Hash function; sha2 is an abbreviation of Secure Hash Algorithm 2 (Secure Hash Algorithm 2), which is a cryptographic Hash function Algorithm standard, and sha256 Algorithm is an Algorithm subdivided under sha 2; the describedText of a documentGenerating a unique abstract through a correlation algorithm, and naming a text summary (textDigest); and querying whether a text summary (textDigest) record exists in a text index library, if so, indicating that the document content is stored, and ending data import operation to realize duplicate removal. (modification Note: document parsing is not performed here, and has not been performed yetDocument dataAnd therefore should beText of a document)
S2, analyzing and saving the document as the retrieval object, specifically comprising the following steps:
step S201, analyzing the document to obtain metadata, text data and the like of the document,Paragraph data and sentence data(ii) a The metadata includes title (title), summary (abstract), annotation (annotation); the text data is the text of the document; the describedParagraph data and sentence numberAccording toIs the aggregated data. ( Modification description: paragraph data and sentence data obtained after analysis are set data )
Step S202, storing the metadata and the text data into the text index library, and obtaining the stored record ID as a document number (docID);
it needs to be further explained that: the text index library may also include fields: title (title), summary (abstract), comment (annotation), content (content), text summary (textDigest), etc.; wherein the content (content) is a text field obtained from the text data; the text summary (textDigest) is a Hash digest value of the content (content); the text index library generates a unique record ID when stored, and the record ID is named as a document number (docID) and is used as an address for future access.
Step S203, storing the metadata and the paragraph set into the paragraph set index library, and recording ID in the library as document number (docID); the paragraph set is obtained according to the paragraph data and is an array formed by all paragraphs in the document; it needs to be further explained that: the paragraph set is derived from the paragraph data; the record ID in the paragraph set index library also adopts a document number (docID), the paragraph set in one record is stored in an embedded array form, and each item in the array comprises paragraph content (content), length (size) and paragraph number (No); implementing the same document will only produce one record in the index repository. Further, the paragraph set is stored as an array field in a record in the index library of the paragraph set index library in the form of embedded data.
Step S204, storing the metadata and the sentence sets into the sentence set index library, and recording ID in the library as document number (docID); the sentence subset of the sentence set is obtained according to the sentence data and is an array formed by all sentences in the document; it needs to be further explained that: the sentence set is obtained from the sentence data; the ID recorded in the sentence set index library also adopts document number (docID), the sentence set in one record is stored in the form of an embedded array, and each item in the array comprises sentence content (content), length (size) and sentence number (No); implementing the same document will only produce one record in the index repository. Further, in step S204, the sentence set is stored as an array field in a record in the sentence set index database in the form of embedded data.
Step S205, traversing each paragraph in the document, and storing each paragraph in the pure paragraph index library separately, wherein the record ID of each paragraph is the Hash digest value of the paragraph content;
in order to further implement uniqueness of paragraph records in the deduplication, in step S205, for the pure paragraph index library, the Hash digest value is obtained from the related paragraph contents through the sha1 algorithm or the sha256 algorithm, which is herein denoted as summary (digest), and the summary (digest) is used as the record ID in the pure paragraph index library to perform deduplication of the paragraph contents according to the summary (digest); the fields in the paragraph-only index library further include content (content), number of references (docUseNum), and citation document records (refDocs); the content (content) is the content of a paragraph, the reference times (docUseNum) is the times of the text index library for referencing the paragraph, the citation document record (refDocs) is an embedded set, and the element of the embedded set is the document number (docID) of the document.
In order to further achieve uniqueness of the record, in step S205, when each paragraph in the document is traversed, a summary (digest) of the paragraph is first calculated; secondly, if the summary (digest) record of the pure paragraph index library does not exist, the number of times of reference (docUseNum) is 1, and if the digest record of the pure paragraph index library exists, the number of times of reference (docUseNum) is added with 1; the value of the current document docId is inserted in the collection of the refDocs document record (refDocs), which is of the collection type. It needs to be further explained that: the citation document records (refDocs) are of a collection type, so that the same document number (docID) can be inserted for multiple times, and the uniqueness of the records can be realized only by keeping the document number (docID) once.
Step S206, traversing each sentence in the document, and storing each sentence into the sentence set index database separately, wherein the record ID of each sentence is the Hash abstract value of the sentence content;
in step S206, for the pure sentence index library, the Hash digest value is obtained from the related sentence content by using the sha1 algorithm or the sha256 algorithm, and is denoted as summary (digest), and the summary (digest) is used as the record ID of the pure sentence index library, so as to perform deduplication on the sentence content according to the summary (digest); the fields in the sentence-only index library also include content (content), number of references (docUseNum), and citation document records (refDocs); the content (content) is the content of a sentence, the reference times (docUseNum) are the times for referring to the sentence in a text index library, the citation document record (refDocs) is an embedded set, and the element of the embedded set is the document number (docID) of the document.
In step S206, similarly to the above-mentioned saving to the pure paragraph index library, when each sentence in the document is traversed, first, an outline (digest) of the sentence is calculated; secondly, if the summary (digest) record in the pure sentence index library does not exist, the number of times of reference (docUseNum) is 1, and if the summary (digest) record in the pure sentence index library exists, the number of times of reference (docUseNum) is added with 1; the value of the current document docId is inserted in the collection of the refDocs document record (refDocs), which is of the collection type.
In order to achieve consistency among multiple internal index database data, further, if any step fails in steps S202 to S206, the process returns to step S201, and then step S2 is ended. It needs to be further explained that: when any step from the step S202 to the step S206 fails, that is, the current document analysis and storage fails, a transaction mechanism is used to import a document, so as to achieve data consistency of multiple internal index libraries.
S3, acquiring an index name, a retrieval range and retrieval parameters; the retrieval range comprises full text, paragraph sets, sentence subsets, pure paragraphs and pure sentences; the retrieval parameters comprise key words, relations among the key words, accurate search or fuzzy search;
s4, determining an internal index library of actual retrieval according to the index name and the retrieval range, and performing full-text retrieval based on the retrieval parameters to obtain a retrieval result; it needs to be further explained that: the retrieval can adopt the commonly used retrieval algorithm at present, and the retrieval result is finally obtained and then returned to the client to complete the retrieval process.
And S5, returning retrieval results with different formats and sequences according to the retrieval range. Further, in step S5, the search result includes the following format and sorting mode:
in the first mode, the retrieval results of the text index library are sorted according to the relevancy, and each record is a complete record in a positive library and comprises metadata and text content;
in the second mode, the retrieval results of the paragraph set index library and the sentence set index library are sorted according to the relevance of the whole document, and each record returns metadata and eligible paragraphs or sentences in the array;
it needs to be further explained that: for the retrieval results of the paragraph set index library and the sentence set index library, because the paragraph set and the sentence set are stored in the form of embedded data, the returned retrieval results are ordered according to the relevance of the whole document, but each record returns the metadata and the eligible paragraphs or sentences in the array; the paragraph or sentence array meeting the condition is sorted according to the respective relevance. Meanwhile, in order to prevent too many paragraphs or sentence entries from being hit in a document, generally, a document record returns hit paragraphs or hit sentence contents of a certain entry at most.
And thirdly, for the retrieval results of the pure paragraph index library and the pure sentence index library, sequencing according to the relevance of the paragraphs and the sentences, and returning the contents of the individual paragraphs and the sentences.
As one of the preferred embodiments of this implementation: for pure paragraph or sentence retrieval, individual paragraph or sentence content is returned, sorted by relevance, along with the number of references (docUseNum) and the citation document records (refDocs). This can be seen as the popularity of use of the paragraph, and the associated citation documents.
The principle of the full-text retrieval method is as follows: establishing an internal index library with various granularities, dividing the document serving as a retrieval basis with various granularities and correspondingly storing the document in a plurality of internal indexes, realizing retrieval with various granularities and supporting content deduplication of each granularity; during retrieval, an internal index library of actual retrieval is determined according to the index name and the retrieval range, and the internal index library can be respectively accessed into a text index library, a paragraph set index library, a pure paragraph index library, a sentence set index library and a pure sentence index library of corresponding various granularities for retrieval; a more flexible full-text retrieval mode is provided to meet the requirements of various retrieval application scenes, and the problem that the single full-text retrieval application scene is single at present is solved.
The full text retrieval method provided by the invention has the advantages that: the index database creation with various granularities is realized when the index database is constructed, and the deduplication of data is supported. During retrieval, a plurality of modes of retrieval and display can be flexibly carried out according to needs, so that a user can better find the relation among documents, paragraphs and sentences, the using conditions of the paragraphs and the sentences, the number of times of reference and the like.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a full-text retrieval apparatus with multiple particle sizes, as shown in fig. 2, where the apparatus includes:
an index name and internal index library establishing module 101 is used for establishing index names and establishing a plurality of internal index libraries with different granularities; the internal index library comprises a text index library, a paragraph set index library, a pure paragraph index library, a sentence set index library and a pure sentence index library;
the document parsing and saving module 102 is configured to parse and save a document that needs to be imported, and specifically includes: step S201, analyzing the document to obtain metadata, text data, paragraph data and sentence data of the document; the metadata includes title (title), summary (abstract), annotation (annotation); step S202, storing the metadata and the text content into the text index library, and obtaining a record ID after storage, wherein the record ID is set as a document number (docID); step S203, storing the metadata and the paragraph set into the paragraph set index library, and recording ID in the library as document number (docID); the paragraph set is an array formed by all paragraphs in the document; step S204, storing the metadata and the sentence sets into the sentence set index library, and recording ID in the library as document number (docID); the sentence set is an array formed by all sentences in the document; step S205, traversing each paragraph in the document, and storing each paragraph in the pure paragraph index library separately, wherein the record ID of each paragraph is the Hash digest value of the paragraph content; step S206, traversing each sentence in the document, and storing each sentence into the sentence set index database independently, wherein the record ID of each sentence is the Hash abstract value of the sentence content;
an index name, retrieval range and retrieval parameter obtaining module 103, configured to obtain an index name, a retrieval range and retrieval parameters when performing retrieval query; the retrieval range comprises full text, paragraph sets, sentence subsets, pure paragraphs and pure sentences; the retrieval parameters comprise key words, relations among the key words, accurate search or fuzzy search;
a retrieval result obtaining module 104, configured to determine the internal index library for actual retrieval according to the index name and the retrieval range, and perform full-text retrieval based on the retrieval parameter to obtain a retrieval result;
and a retrieval result returning module 105, configured to return the retrieval results in different formats and in different orders according to the retrieval range.
The full-text retrieval device with multiple granularities provided by the invention realizes the creation of index libraries with multiple granularities when constructing the index libraries, and supports the duplication elimination of data. During retrieval, a plurality of modes of retrieval and display can be flexibly carried out according to needs, so that a user can better find the relation among documents, paragraphs and sentences, the using conditions of the paragraphs and the sentences, the number of times of reference and the like.
To further illustrate the multi-granularity full-text retrieval method of the present invention, the following is exemplified: the ElasticSearch is adopted based on a full-text retrieval engine, and the ElasticSearch engine provides data adding, updating and retrieving related APIs for an index library; the specific operation method comprises the following steps:
the construction and data import of a plurality of internal index libraries with different granularities are carried out according to the steps S1 to S2 and the steps S201 to 206, and the following operations are carried out:
initializing an associated index library of an index name. For example: index name (sport), creating multiple internal index libraries of different granularity with ElasticSearch as: text index library (sport), paragraph set index library (sport-para), paragraph-only index library (sport-pure-para), sentence set index library (sport-senseces), and sentence-only index library (sport-pure-sensecence).
Relevant document contents are imported into an internal index library, and for each document, metadata, text contents, paragraph sets and sentence subsets are analyzed, the document can be the document of a user, and also can be various contents such as webpages and articles collected by crawler software, and the document is only required to obtain the relevant metadata and the text contents;
the documents are parsed, and the metadata may include a title (title) and an abstract (abstrat) for each parsed document. If a title (title) is missing, a file name, or a web page name may be used; if no summary (abstrat) exists, generating the summary (abstrat) through text by adopting a natural language technology in the prior art; in addition, a text summary (textDigest) is also generated based on the body content, and the text summary (textDigest) can be generated using the sha256 algorithm.
Through the query interface of the Elasticsearch, a record of whether a text summary (textDigest) already exists is looked up in the text index library (sport). If the document exists, the document content is stored, the data import operation of the document is ended, and the duplicate removal is realized. If no record of the text summary (textDigest) is found, the following steps are continued.
And combining the metadata and the text data into a JSON format, wherein the field name corresponding to the text is content (content), and the field names and the field contents of other metadata are unchanged, namely, the names and the corresponding contents of a title (title), an overview (abstract) and a comment (annotation) are kept. The whole JSON data is stored into a text index library (sport) through an insertion interface of the elastic search, and the ID of the record is obtained and is called as a document number (docID).
Metadata and paragraph set data are combined into JSON format, where the paragraph set field name is paragraph (paragraph), the content is an array set, and each element in the array contains 3 sub-fields, which are paragraph content (content), paragraph length (size), and paragraph number (No). And taking the metadata, the paragraphs (fragments) and the document number (docID) together as the whole JSON data, and storing a record into a paragraph set index library (sport-fragments) through an insertion interface of an Elasticissearch, wherein the document number (docID) is also the ID of the record.
Metadata and sentence set data are combined into JSON format, wherein the name of the sentence set field is sentence (sentences), the content is an array set, each element in the array contains 3 sub-fields, namely sentence content (content), sentence length (size) and sentence number (No). And saving a record into a sentence set index library (sports-sentences) by taking the metadata, sentences (sentences) and document numbers (docIDs) together as the whole JSON data through an insertion interface of an elastic search, wherein the document numbers (docIDs) are also the ID of the record, and the record IDs of the same document in the text index library, the paragraph set index library and the sentence set index library are document numbers (docIDs).
For a paragraph set, each paragraph is traversed, and a summary (digest) of the paragraph is generated by the sha256 algorithm based on the paragraph. Querying whether a record entry with an ID of summary (digest) exists in a pure paragraph index library (sport-pure-paragraph-paramph): first, if the record with ID as summary (digest) does not exist, inserting a record, the record ID as summary (digest), and using content, size, number of references (docUseNum) and quotation document record (refDocs) as field names, the corresponding content is: content corresponds to the content of the paragraph, size corresponds to the paragraph length, the number of references (docUseNum) corresponds to the number of references, which is 1 at present, a citation document record (refDocs) is an array, and the current first element in the array is the document number (docID) of the document; secondly, if a record with the ID of the summary (digest) exists, updating the record in the index library, wherein the number of times of reference (docUseNum) is added by one, and the array of the citation document record (refDocs) is added with an element with the content of document number (docID); thirdly, how many paragraphs exist in a document, namely, corresponding records are added or updated in a pure paragraph index library sport-pure-paragraph-paramph; deduplication of each paragraph is achieved and records are kept on the number of references (docUseNum) and the citation document records (refDocs).
For the sentence set, each sentence is traversed, and similarly, the corresponding record is added or updated to a pure sentence index library (sport-pure-sense) in the same way as the paragraph set.
(II) based on the content retrieval of the internal index libraries with different granularities, according to the steps S3 to S5, the following contents are specifically executed:
1. and specifying the index name and the search range of the search and providing search parameters. When searching, a user does not need to know that a plurality of associated index libraries are arranged inside, and only needs to use a uniform index name such as sport; the retrieval range comprises retrieval full text, paragraph sets, sentence subsets, pure paragraphs and pure sentences; the retrieval parameters comprise keywords, the relation among the keywords, accurate or fuzzy search and the like;
2. and determining an internal index library at the rear end based on the index name and the retrieval range, and then performing final retrieval operation based on the retrieval parameters. Since each retrieval range corresponds to a specific associated index library, the real index library name can be located through the index name and the retrieval range. And finally, related full-text retrieval can be carried out through a search interface of the elastic search, and a search result is returned to the client to complete a retrieval process. For different search scopes, different results may be produced as follows:
a) For the results of the text index library, ordered by relevance, each record is the complete record in a positive library, including metadata and text content
b) For the results of the paragraph set and sentence set index library, the returned results are firstly sorted according to the relevance of the whole document, and then each record contains eligible paragraph or sentence arrays which are also sorted according to the respective relevance instead of the paragraph sequence number. Meanwhile, in order to prevent too many hit paragraphs or sentence entries in a document, generally, at most a certain number of hit paragraphs or sentence contents are returned in a document record.
c) And for the retrieval of the pure paragraphs or the pure sentences, the returned results are sorted according to the content relevance and are the contents of the individual paragraphs or the sentences, and the citation times (docUseNum) and the citation document records (refDocs) are also contained. Thus, the use heat of the paragraph or sentence and the related reference document can be seen.
d) Through the retrieval in the mode, only one index name is adopted externally, but the index name can be dynamically displayed to have various search effects, and the full-text retrieval operation with different granularities is completed.
It needs to be further explained that: the (one) construction and data import of the internal index libraries with different granularities and the (two) content retrieval based on the internal index libraries with different granularities can be mutually independent processes; the content retrieval refers to that when a user conducts full-text retrieval, relevant keywords are output, and hit record contents are returned from some index libraries.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A full-text retrieval method with multiple particle sizes is characterized by comprising the following steps:
s1, establishing index names and constructing a plurality of internal index libraries with different granularities; the internal index library comprises a text index library, a paragraph set index library, a pure paragraph index library, a sentence set index library and a pure sentence index library, wherein the text index library, the paragraph set index library, the pure paragraph index library, the sentence set index library and the pure sentence index library are mutually related to the index name;
s2, analyzing and saving the document as the retrieval object, specifically comprising the following steps:
step S201, analyzing the document to obtain metadata, text data, paragraph data and sentence data of the document; the metadata includes title (title), summary (abstract), annotation (annotation); the text data is the text of the document; the paragraph data and the sentence data are set data;
step S202, storing the metadata and the text data into the text index library, and obtaining a record ID after storage, wherein the record ID is set as a document number (docID);
step S203, storing the metadata and the paragraph set into the paragraph set index library, and recording the ID in the library as a document number (docID); the paragraph set is obtained according to the paragraph data and is an array formed by all paragraphs in the document;
step S204, storing the metadata and the sentence sets into the sentence set index library, and recording ID in the library as document number (docID); the sentence subset is obtained according to the sentence data and is an array formed by all sentences in the document;
step S205, traversing each paragraph in the document, and storing each paragraph in the pure paragraph index library separately, wherein the record ID of each paragraph is the Hash digest value of the paragraph content;
step S206, traversing each sentence in the document, and storing each sentence into the sentence set index database separately, wherein the record ID of each sentence is the Hash abstract value of the sentence content;
s3, acquiring an index name, a retrieval range and retrieval parameters; the retrieval range comprises full text, paragraph sets, sentence subsets, pure paragraphs and pure sentences; the retrieval parameters comprise key words, relations among the key words, accurate search or fuzzy search;
s4, determining an internal index library of actual retrieval according to the index name and the retrieval range, and performing full-text retrieval based on the retrieval parameters to obtain a retrieval result;
and S5, returning retrieval results with different formats and sequences according to the retrieval range.
2. The multi-granularity full-text retrieval method according to claim 1, characterized in that: in the step S1, each record in the text index library comprises an abstract field, and the abstract field stores a Hash abstract value of the document; when a new record is inserted, if the Hash digest value of the document already exists, the document is not imported into the internal index library.
3. The multi-granularity full-text retrieval method according to claim 1, characterized in that: in step S202 and step S203, the paragraph set and the sentence set are stored as an array field in a record in the paragraph set index library and the sentence set index library in the form of embedded data.
4. The multi-granularity full-text retrieval method according to claim 1, characterized in that: in step S205, for the pure paragraph index library, the Hash digest value is recorded as a summary (digest), the summary (digest) is used as a record ID in the pure paragraph index library, and deduplication of the paragraph content is performed according to the summary (digest); the fields in the paragraph-only index library further include content (content), number of references (docUseNum), and citation document records (refDocs); the content (content) is the content of a paragraph, the reference times (docUseNum) is the times of the text index library for referencing the paragraph, the citation document record (refDocs) is an embedded set, and the element of the embedded set is the document number (docID) of the document.
5. The multi-granularity full-text retrieval method according to claim 4, wherein: in step S205, when each paragraph in the document is traversed, a summary (digest) of the paragraph is first calculated; secondly, if the summary (digest) record of the pure paragraph index library does not exist, the number of times of reference (docUseNum) is 1, and if the summary (digest) record of the pure paragraph index library exists, the number of times of reference (docUseNum) is added with 1; and inserting the value of the document number (docID) of the current document into the collection of the refDocs document records (refDocs), wherein the refDocs document records (refDocs) are of a collection type.
6. The multi-granularity full-text retrieval method according to claim 1, wherein: in step S206, regarding the pure sentence index library, the Hash digest value is recorded as a summary (digest), the summary (digest) is used as a record ID of the pure sentence index library, and the duplication of the sentence content is removed according to the summary (digest); the fields in the plain sentence index library further include content (content), number of references (docUseNum), and citation document records (refDocs); the content (content) is the content of a sentence, the reference times (docUseNum) are the times of referencing the sentence in a text index database, the citation document record (refDocs) is an embedded set, and the element of the embedded set is the document number (docID) of the document.
7. The multi-granularity full-text retrieval method according to claim 6, wherein: in step S206, when each sentence in the document is traversed, first, an outline (digest) of the sentence is calculated; secondly, if the summary (digest) record of the pure sentence index library does not exist, the number of times of reference (docUseNum) is 1, and if the summary (digest) record of the pure sentence index library exists, the number of times of reference (docUseNum) is added with 1; and inserting the value of the current document number (docID) into the collection of the refDocs document records (refDocs), wherein the refDocs document records (refDocs) are of a collection type.
8. The multi-granularity full-text retrieval method according to claim 1, characterized in that: in step S202 to step S206, if any step fails, the process returns to step S201, and then the process ends in step S2.
9. The multi-granularity full-text retrieval method according to claim 1, characterized in that: in step S5, the search result includes the following format and sorting manner:
in the first mode, the retrieval results of the text index library are sorted according to the relevancy, and each record is a complete record in a positive library and comprises metadata and text content;
in the second mode, the retrieval results of the paragraph set index library and the sentence set index library are sorted according to the relevance of the whole document, and each record returns metadata and eligible paragraphs or sentences in the array; and thirdly, for the retrieval results of the pure paragraph index library and the pure sentence index library, sequencing according to the relevance of the paragraphs and the sentences, and returning the contents of the individual paragraphs and the sentences.
10. A multi-grain full-text search device, comprising:
establishing an index name and internal index library module for establishing an index name and establishing a plurality of internal index libraries with different granularities; the internal index library comprises a text index library, a paragraph set index library, a pure paragraph index library, a sentence set index library and a pure sentence index library, wherein the text index library, the paragraph set index library, the pure paragraph index library, the sentence set index library and the pure sentence index library are mutually related to the index name;
the document analyzing and saving module is used for analyzing and saving a document to be imported, and specifically comprises: step S201, analyzing the document to obtain metadata, text data, paragraph data and sentence data of the document; the metadata includes title (title), summary (abstract), annotation (annotation); step S202, storing the metadata and the text content into the text index database, and obtaining a record ID after storage, wherein the record ID is set as a document number (docID); step S203, storing the metadata and the paragraph set into the paragraph set index library, and recording ID in the library as document number (docID); the paragraph set is an array formed by all paragraphs in the document; step S204, storing the metadata and the sentence sets into the sentence set index library, and recording ID in the library as document number (docID); the sentence set is an array formed by all sentences in the document; step S205, traversing each paragraph in the document, and storing each paragraph in the pure paragraph index library separately, wherein the record ID of each paragraph is the Hash digest value of the paragraph content; step S206, traversing each sentence in the document, and storing each sentence into the sentence set index database separately, wherein the record ID of each sentence is the Hash abstract value of the sentence content;
the module for obtaining the index name, the retrieval range and the retrieval parameter is used for obtaining the index name, the retrieval range and the retrieval parameter when the retrieval query is executed; the retrieval range comprises full text, paragraph sets, sentence subsets, pure paragraphs and pure sentences; the retrieval parameters comprise key words, relations among the key words, accurate search or fuzzy search;
the retrieval result obtaining module is used for determining the internal index library of actual retrieval according to the index name and the retrieval range, and carrying out full-text retrieval based on the retrieval parameters to obtain a retrieval result;
and the retrieval result returning module is used for returning the retrieval results in different formats and in different orders according to the retrieval range.
CN202211263681.1A 2022-10-16 2022-10-16 Multi-granularity full-text retrieval method and device Pending CN115544201A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211263681.1A CN115544201A (en) 2022-10-16 2022-10-16 Multi-granularity full-text retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211263681.1A CN115544201A (en) 2022-10-16 2022-10-16 Multi-granularity full-text retrieval method and device

Publications (1)

Publication Number Publication Date
CN115544201A true CN115544201A (en) 2022-12-30

Family

ID=84735526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211263681.1A Pending CN115544201A (en) 2022-10-16 2022-10-16 Multi-granularity full-text retrieval method and device

Country Status (1)

Country Link
CN (1) CN115544201A (en)

Similar Documents

Publication Publication Date Title
US20220197954A1 (en) System and methods for metadata management in content addressable storage
US6801904B2 (en) System for keyword based searching over relational databases
US8793231B2 (en) Heterogeneous multi-level extendable indexing for general purpose annotation systems
US20060161545A1 (en) Method and apparatus for ordering items within datasets
JP2016181306A (en) System and method for scoping searches using index keys
US5293552A (en) Method for storing bibliometric information on items from a finite source of text, and in particular document postings for use in a full-text document retrieval system
US20090210389A1 (en) System to support structured search over metadata on a web index
US20080059432A1 (en) System and method for database indexing, searching and data retrieval
CN109857898A (en) A kind of method and system of mass digital audio-frequency fingerprint storage and retrieval
Tan et al. Microsearch: When search engines meet small devices
Pan et al. Reducing ambiguity in tagging systems with folksonomy search expansion
US8650195B2 (en) Region based information retrieval system
US6687711B1 (en) Keyword and methods for using a keyword
US7949656B2 (en) Information augmentation method
Rao et al. Locating XML documents in a peer-to-peer network using distributed hash tables
Peery et al. Multi-dimensional search for personal information management systems
CN115544201A (en) Multi-granularity full-text retrieval method and device
US20050125387A1 (en) Method of joining data and its metadata using dynamic metadata in relational database
Laddha et al. Novel concept of query-similarity and meta-processor for semantic search
EP0508519B1 (en) A method for storing bibliometric information on items from a finite source of text, and in particular document postings for use in a full-text document retrieval system
Aggarwal High Performance Document Store Implementation in Rust
CN117972012A (en) Data storage method, data query method and related devices
CN115292322A (en) Data query method, device, equipment and medium
Ames et al. QUASAR: Interaction with file systems using a query and naming language
JP2000132439A (en) System for retrieving file stored in hard disk of personal computer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination