CN113449063A - Method and device for constructing document structure information retrieval library - Google Patents

Method and device for constructing document structure information retrieval library Download PDF

Info

Publication number
CN113449063A
CN113449063A CN202110708173.9A CN202110708173A CN113449063A CN 113449063 A CN113449063 A CN 113449063A CN 202110708173 A CN202110708173 A CN 202110708173A CN 113449063 A CN113449063 A CN 113449063A
Authority
CN
China
Prior art keywords
document
sample
vector
retrieved
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110708173.9A
Other languages
Chinese (zh)
Other versions
CN113449063B (en
Inventor
沈鹏
陈垚亮
王俞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rootcloud Technology Co Ltd
Original Assignee
Rootcloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rootcloud Technology Co Ltd filed Critical Rootcloud Technology Co Ltd
Priority to CN202110708173.9A priority Critical patent/CN113449063B/en
Publication of CN113449063A publication Critical patent/CN113449063A/en
Application granted granted Critical
Publication of CN113449063B publication Critical patent/CN113449063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for constructing a document structure information search library, wherein the method comprises the following steps: performing domain subdivision item judgment on the collected sample documents; aiming at each judged field subdivision item, extracting the word segmentation words of the sample document of the field subdivision item, and constructing a vectorized field subdivision item keyword library based on the extracted word segmentation words; dividing sample documents of the field subdivision items according to document types, extracting document structural information of each sample document, and generating document structural information vectors of the sample documents according to the document structural information and the field subdivision item keyword library; reducing the dimension of the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document; and constructing a document structure information retrieval library according to the domain detail item keyword library, preset domain detail item codes and the document structure information dimension reduction vector of the sample document. The accuracy for document retrieval can be improved.

Description

Method and device for constructing document structure information retrieval library
Technical Field
The invention relates to the technical field of information retrieval, in particular to a method and a device for constructing a document structure information retrieval library.
Background
With the continuous popularization of the digitization of industrial enterprises, many industrial enterprises have a large number of documents such as descriptions, processes, specifications and the like. Based on the consideration of data security, industrial enterprises generally choose to develop internal office and business systems aiming at the own fields, and share and query documents in the internal office and business systems.
However, in the current document retrieval library, documents and keywords of the documents are stored, and keyword hit is performed according to the content of the short documents input by a user, but since the document retrieval library only stores the documents and the keywords of the documents and performs query hit according to the keywords extracted from the short documents, the precision of document retrieval is low, and the requirement of refined document retrieval cannot be met.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for constructing a document structure information search library, so as to improve the accuracy of the constructed document structure information search library for document search.
In a first aspect, an embodiment of the present invention provides a method for constructing a document structure information search library, including:
performing domain subdivision item judgment on the collected sample documents;
aiming at each judged field subdivision item, extracting the word segmentation words of the sample document of the field subdivision item, and constructing a vectorized field subdivision item keyword library based on the extracted word segmentation words;
dividing the sample documents of the field subdivision items according to preset document types, extracting the document structured information of each sample document according to a document structured information extraction strategy corresponding to the document type of the sample document, and generating a document structured information vector of each sample document according to the document structured information and the field subdivision item keyword library;
reducing the dimension of the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document;
and constructing a document structure information retrieval library according to the domain detail item keyword library, preset domain detail item codes and the document structure information dimension reduction vector of the sample document.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the method further includes:
determining a field subdivision item to be retrieved and a field subdivision item code to be retrieved to which an input short document to be retrieved belongs;
acquiring a domain keyword library to be retrieved corresponding to the domain subdivision item codes to be retrieved;
extracting the structured information of the document to be retrieved in the short document to be retrieved, and generating a structured information vector of the document to be retrieved according to the structured information of the document to be retrieved and a keyword library of the field to be retrieved;
carrying out dimensionality reduction processing on the document structured information vector to be retrieved to obtain a document structured information dimensionality reduction vector to be retrieved;
and searching in the document structure information search library according to the dimension reduction vector of the document structure information to be searched to obtain a search result.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the method further includes:
and if the similarity of the hit documents in the retrieval result exceeds a preset similarity threshold, inquiring whether the short document to be retrieved is stored in a storage area corresponding to the field subdivision item to be retrieved, and if not, updating information stored in the storage area according to the short document to be retrieved.
With reference to the first aspect, the first possible implementation manner of the first aspect, or the second possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the extracting the segmented words of the sample document of the domain segmentation items, and constructing a vectorized domain segmentation item keyword library based on the extracted segmented words includes:
performing Chinese word segmentation on each sample document of the target field subdivision items to obtain word segmentation words;
aiming at each participle word, calculating a word frequency-inverse text frequency index value of the participle word;
sorting the words and phrases according to the word frequency-inverse text frequency index value;
vectorizing the word segmentation words of N positions before sequencing to construct a domain detail item keyword library aiming at the target domain detail item, wherein N is a preset natural number.
With reference to the first aspect, the first possible implementation manner of the first aspect, or the second possible implementation manner, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where the generating a document structured information vector of the sample document according to the document structured information and the domain segmentation item keyword library includes:
vectorizing the document structured information of each category in the document structured information extraction strategy;
in the domain subdivision item keyword library, if the corresponding position has no vectorized document structured information, setting the vector of the position as 0 to obtain the document structured information vector of the category;
and splicing the document structured information vectors of all categories to obtain the document structured information vector of the sample document.
With reference to the fourth possible implementation manner of the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the document type includes: txt documents, doc/docx documents, xml/html documents, and pdf documents.
With reference to the fifth possible implementation manner of the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where the categories include: the method comprises the following steps of a first category, a second category and a third category, wherein the third category comprises a whole keyword;
the document type is a txt document, and the first category comprises: the number of paragraphs and the length of the document, and the second category comprises keywords in M lines before and after the document;
the document type is doc/docx document, and the first category comprises: the second category comprises the starting position of the chart relative to the article;
the document type is an xml/html document, and the first category comprises: title labels, the number and the length of each level, wherein the second category comprises the starting position of the key content labels relative to the document;
the document type is a pdf document, and the first category comprises: title, number of stages, length, and the second category includes the starting position of the chart relative to the article.
In a second aspect, an embodiment of the present invention further provides an apparatus for constructing a document structure information search library, where the apparatus includes:
the domain judging module is used for judging domain subdivision items of the collected sample documents;
the word stock construction module is used for extracting the word segmentation words of the sample document of the field subdivision items aiming at each judged field subdivision item, and constructing a vectorized field subdivision item keyword stock based on the extracted word segmentation words;
the structure vector generation module is used for dividing the sample documents of the field subdivision items according to preset document types, extracting the document structural information of each sample document according to a document structural information extraction strategy corresponding to the document type of the sample document, and generating the document structural information vector of each sample document according to the document structural information and the field subdivision item keyword library;
the dimension reduction module is used for reducing the dimension of the document structured information vector of the sample document to obtain the document structured information dimension reduction vector of the sample document;
and the retrieval base construction module is used for constructing a document structure information retrieval base according to the domain subdivision item keyword base, the preset domain subdivision item codes and the document structure information dimension reduction vector of the sample document.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the steps of the above method when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, performs the steps of the method described above.
According to the method and the device for constructing the document structure information retrieval base, provided by the embodiment of the invention, the judgment of the field subdivision items is carried out on the collected sample documents; aiming at each judged field subdivision item, extracting the word segmentation words of the sample document of the field subdivision item, and constructing a vectorized field subdivision item keyword library based on the extracted word segmentation words; dividing the sample documents of the field subdivision items according to preset document types, extracting the document structured information of each sample document according to a document structured information extraction strategy corresponding to the document type of the sample document, and generating a document structured information vector of each sample document according to the document structured information and the field subdivision item keyword library; reducing the dimension of the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document; and constructing a document structure information retrieval library according to the domain detail item keyword library, preset domain detail item codes and the document structure information dimension reduction vector of the sample document. Therefore, the document structure information retrieval library is constructed by fusing the document structure information and the semantic keyword information and converting the document structure information and the semantic keyword information into vectors, and the precision of the constructed document structure information retrieval library for document retrieval can be effectively improved.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart illustrating a method for constructing a document structure information search library according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an apparatus for constructing a document structure information search library according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a computer device 300 according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a method and a device for constructing a document structure information search library, which are described by embodiments below.
Fig. 1 is a schematic flow chart illustrating a method for constructing a document structure information search library according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step 101, judging the domain subdivision items of the collected sample documents;
in the embodiment of the present invention, as an optional embodiment, the domain detail item determination of the sample document may be performed based on a user interactive operation.
In the embodiment of the invention, the field and the field subdivision item are judged for the sample document, so that a more refined search library can be constructed. As an alternative embodiment, fields include, but are not limited to: the field of literature, the field of information processing, the art, the biological science, the medical field, and the like. As another alternative embodiment, one or more levels of domain subdivision items may be included for each domain, for example, for the field of literary works, one level of domain subdivision items includes: novel, song, etc. for the first level domain subdivision item novel, the corresponding second level domain subdivision item includes: speech, swordsman, science fiction, etc.
In the embodiment of the present invention, as an optional embodiment, the domain division may be performed according to "classification of library of china", and for each domain or domain subdivision item, corresponding codes are respectively set, where the coding format may refer to "table-domain subdivision item and coding sample".
In the embodiment of the present invention, as an optional embodiment, for a batch of sample documents, a domain detail item to which each sample document belongs is interactively set, and the domain detail item is the last level of the domain. As another alternative embodiment, the domain and the domain detailed items to which the sample document belongs may also be determined by extracting keywords in the sample document and matching the keywords with a preset domain keyword library and a preset subdivided item keyword library, respectively. Table 1 is a table of fields and field detail items and coding schemes according to embodiments of the present invention.
TABLE 1
Figure BDA0003132306820000071
Figure BDA0003132306820000081
In table 1, the industrial technology is a field, the automation technology and the computer technology are first-level field details, the information processing is second-level field details, and the text information processing is third-level field details. In the embodiment of the invention, the text information processing is the last stage of the industrial technology in the field.
Step 102, aiming at each judged field subdivision item, extracting the word segmentation words of the sample document of the field subdivision item, and constructing a vectorized field subdivision item keyword library based on the extracted word segmentation words;
in the embodiment of the present invention, as an optional embodiment, extracting the word segmentation terms of the sample document of the field subdivision items, and constructing a vectorized field subdivision item keyword library based on the extracted word segmentation terms includes:
a11, performing Chinese word segmentation on each sample document of the target field subdivision items to obtain word segmentation words;
a12, calculating the word frequency-inverse text frequency index value of each participle word;
a13, sorting the words according to the word frequency-inverse text frequency index value;
and A14, vectorizing the word segmentation words of N positions before the sorting to construct a domain detail item keyword library aiming at the target domain detail items, wherein N is a preset natural number.
In the embodiment of the invention, aiming at each field subdivision item contained in a field, all batch sample documents contained in the field subdivision item are obtained, the obtained sample documents are subjected to word segmentation, the word Frequency-Inverse text Frequency index (TF-IDF, Term Frequency-Inverse Document Frequency) value of the word segmentation words is calculated, the word segmentation words are sequenced according to the TF-IDF value, the first N word segmentation words are obtained as full-scale keywords, the full-scale keywords are vectorized, and a field subdivision item keyword library of the field subdivision item to which the batch sample documents belong is obtained, so that the construction of the field subdivision item keyword library is completed.
In the embodiment of the present invention, as an optional embodiment, the domain detail term keyword library is represented by a full amount keyword vector, wherein a vector dimension is the number of the full amount keywords. As an alternative embodiment, the vector dimension takes on values of 512, 1024, 2048, etc., and 2048 is used by default.
In the embodiment of the present invention, taking a field subdivision item as an example of text information processing, a corresponding code is TP391.111, and for a batch sample document of the field subdivision item, it is assumed that a total number of keywords obtained according to a TF-IDF value is as follows:
[ natural language, processing, algorithm.
Vectorizing the full amount of keywords:
[1,1,1,...]。
the constructed keyword library of the domain segmentation items corresponding to the text information processing is as follows:
{TP391.111,[1,1,1,...,1]}。
103, dividing the sample documents of the field subdivision items according to preset document types, extracting the document structured information of each sample document according to a document structured information extraction strategy corresponding to the document type of the sample document, and generating a document structured information vector of each sample document according to the document structured information and the field subdivision item keyword library;
in this embodiment of the present invention, as an optional embodiment, the document types include: text (txt) documents, doc/docx documents, web page (xml/html) documents, and pdf documents. Each document type corresponds to a document structured information extraction policy for extracting three types of document structured information of the document type, and each type of document structured information corresponds to a document structured information vector, as shown in table 2.
TABLE 2 document type to document structured information vector lookup Table
Figure BDA0003132306820000091
Figure BDA0003132306820000101
In table 2, the vector dimension of the document structured information vector is the same as the vector dimension of the full keyword vector in the domain keyword library, and in the extraction process, zero padding is performed on the vectors which are less than N terms, and the vectors which are beyond N terms are discarded. Wherein, the whole keywords are determined according to the TF-IDF value sequence of each word segmentation word of the current document.
In this embodiment, as an optional embodiment, generating a document structured information vector of the sample document according to the document structured information and the domain segmentation term keyword library includes:
a21, vectorizing the document structured information of each category in the document structured information extraction strategy;
a22, in the domain segmentation item keyword library, if the corresponding position has no vectorized document structured information, setting the vector of the position as 0 to obtain the document structured information vector of the category;
and A23, splicing the document structured information vectors of all the categories to obtain the document structured information vector of the sample document.
In the embodiment of the invention, the field detailed items are used as text information processing and coding: TP391.111, taking as an example that the whole keyword dimension N of the constructed domain fine item keyword library is 2048, assuming that the sample document is: txt, the extracted document structured information of partial categories is:
-number of paragraphs 20;
-a text length 400;
-the first M lines and the last M lines of the document content.
Taking document structural information as a whole keyword and keywords in M rows before and after the document as examples, regarding the whole keyword, after word segmentation processing is carried out on a sample document, TF-IDF values of word segmentation words of the sample document are calculated by using a TF-IDF algorithm, and whole keyword sequencing is carried out. And then generating a whole keyword vector according to the sorted whole keyword and the domain detail item keyword library. And for the keywords in M lines before and after the text, performing word segmentation according to 6 lines of content in total of 3 lines before and after the text, calculating TF-IDF values of all word segmentation words by using a TF-IDF algorithm, sequencing the TF-IDF values, and generating a keyword vector according to the sequenced word segmentation words and the domain detail segmentation keyword library.
The document structured information vectors of each category are assumed as follows:
n-dimensional vector, category 1: [20,400.., 0 ];
n-dimensional vector, category 2: [1,1,.., 0 ];
n-dimensional vector, category 3: [1,1,...,0].
The document structured information vector of the sample document is:
[(20,400,...,0),(1,1,...,0),(1,1,...,0)]。
in the embodiment of the present invention, as an alternative embodiment, the parameter M is set to 3, and as another alternative embodiment, M is set to 1/2 not exceeding the total number of lines of the sample document in order to reduce the operation amount.
In the embodiment of the invention, for the sample document, the number of keywords contained in the sample document is large, and the number of the keywords cannot be effectively increased due to an overlarge M value.
104, reducing the dimension of the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document;
in the embodiment of the invention, for sample documents of each document type, after document structured information vectors of 3 types are obtained, N x 3 dimensional vectors (document structured information vectors) are generated by splicing. Using a dimension reduction algorithm, for example, Principal Component Analysis (PCA), the document structured information vector of the sample document is subjected to dimension reduction processing to be reduced to an N × 1 dimension vector (document structured information dimension reduction vector of the sample document), so that each sample document can be converted into a corresponding field and vector in an elastic search (elastic search).
As an alternative embodiment, the document structured information dimensionality reduction vector of the sample document is:
[-297.24811933,148.62405966,148.62405966,...,0]
and 105, constructing a document structure information retrieval library according to the domain detail item keyword library, preset domain detail item codes and the document structure information dimension reduction vector of the sample document.
In the embodiment of the invention, information integration and association are carried out based on the domain subdivision item keyword library, the domain subdivision item codes and the document structured information dimension reduction vector of the sample document, and a document structure information retrieval library is constructed. For example, the document structure information retrieval base is constructed step by step according to the domain subdivision item keyword base, the domain subdivision item codes and the document structure information dimension reduction vector of the sample document.
In the embodiment of the present invention, as an optional embodiment, the document structure information search library uses an elastic search to store, and exemplarily, the storage format is as follows:
mapping # # established for document index in Elasticissearch is annotated content
"mappings":{"properties":{
Title { ## document title name
"type":"text",
"analyzer", "ik _ max _ word", # # word segmentation mode for index
"search _ analyzer": ik _ smart "# # is for the participle mode of the query
},
Document vector { ####document vector
"type":"dense_vector",
512# vector dimension in "dims" for example 512
}
Field _ coding { ## document field coding
"type":"keyword"
},
...
}
}
In the embodiment of the present invention, taking the field detailed item as an example of text information processing, the batch sample document collection included is as follows:
1. natural language war.
2. Html, natural language;
3. txt natural language processing and algorithm
Wherein N is 2048, and M is 3;
aiming at a batch sample document collection, calculating TF-IDF values after Chinese word segmentation, and constructing a domain detail item keyword library:
{ TP391.111, [ Natural language, processing, Algorithm. ] } in a computer-readable medium
For the sample document: txt, the extracted document structured information is:
number of paragraphs 20;
a text length 400;
the key words are generated by using the TF-IDF algorithm according to the content of the total 6 lines of the front line and the back line of 3 lines as follows:
[ Natural language, processing, Chinese, … ]
The generation of the full keyword according to the TF-IDF algorithm is as follows:
[ Natural language, processing,. . . ]
The structured document information vectors corresponding to the categories are constructed as follows:
n-dimensional vector, category 1: [20,400.., 0 ];
n-dimensional vector, category 2: [1,1,.., 0 ];
n-dimensional vector, category 3: [1,1,...,0].
After dimensionality reduction using PCA, the document structured information vector of the generated sample document (natural language processing and algorithm. txt) is as follows:
[-297.24811933,148.62405966,148.62405966,...,0]
then the sample document: txt stores the information in the document structure information repository as follows:
{
title { "natural language processing and algorithm. txt" },
"document_vector":{[-297.24811933,148.62405966,148.62405966,...,0]}
"field_coding":{"TP391.111"}
}
in this embodiment of the present invention, as an optional embodiment, the method further includes:
b11, determining the field subdivision item to be retrieved and the field subdivision item code to be retrieved to which the input short document to be retrieved belongs;
in the embodiment of the invention, after the input short document to be retrieved is obtained, interactive operation is carried out to judge the subdivision items of the field to be retrieved, and the corresponding subdivision item codes of the field to be retrieved are obtained according to the judged subdivision items of the field to be retrieved.
In the embodiment of the invention, when the subdivision items of the field to be retrieved are judged, the selection of the subdivision items of the field to be retrieved can be completed by interactive operation, by utilizing a pull-down list and looking up the table.
In the embodiment of the present invention, the short document to be retrieved may be a specific document, for example, the document may be input: txt, through interactive operation, determining that the detail items of the field to be retrieved of the document are as follows: text information is processed, and corresponding domain subdivision items to be retrieved are coded as follows: TP 391.111.
B12, acquiring a domain keyword library to be retrieved corresponding to the domain subdivision item code to be retrieved;
in the embodiment of the invention, the obtained keyword library of the field to be retrieved is as follows: [ natural language, algorithm, process, … ].
B13, extracting the structured information of the document to be retrieved in the short document to be retrieved, and generating a structured information vector of the document to be retrieved according to the structured information of the document to be retrieved and the keyword library of the field to be retrieved;
in the embodiment of the invention, the structured information of the document to be retrieved is extracted from the short document to be retrieved, and the extracted structured information of the document to be retrieved is as follows:
number of paragraphs 10;
a text length of 100;
the front and back 3 rows of the document total 6 rows of content.
In the embodiment of the present invention, as an optional embodiment, for all keyword categories, performing word segmentation on short documents to be retrieved, sorting word-segmented words of the documents to be retrieved by using a TF-IDF algorithm, and extracting top N total keywords, where the list of the total keywords is as follows:
[ natural language, algorithm, process, … ].
For the content of 6 lines in total in the front and rear 3 lines of the document, a key word list is generated by using a TF-IDF algorithm according to the front and rear 3 lines of the document as follows:
[ Natural language, processing, Chinese, … ]
The generated document structured information vector to be retrieved is as follows:
class 1N-dimensional vector: [10,100.., 0 ];
class 2N-dimensional vector: [1,1,.., 0 ];
class 3N-dimensional vector: [1,0,...,0].
After PCA dimensionality reduction is carried out, the following document vector is generated:
[-297.2,148.6,148.6,...,0]。
b14, performing dimensionality reduction processing on the document structured information vector to be retrieved to obtain a document structured information dimensionality reduction vector to be retrieved;
b15, according to the structural information dimension reduction vector of the document to be retrieved, retrieving in the document structural information retrieval library to obtain the retrieval result.
In the embodiment of the invention, an elastic search is used for carrying out vectorization search on a heterogeneous document structure information retrieval library, and retrieval is carried out according to the similarity between the dimension reduction vector of the document structured information to be retrieved and the dimension reduction vector of the document structured information of each document.
In the embodiment of the invention, the elastic search provides a cosine similarity function in the native script language, so that the rank of the dimensionality reduction vector of the structured information of the document to be retrieved and the rank of the similarity of all documents in the document structure information retrieval library can be realized, and the document structure information retrieval library can be retrieved. As an alternative embodiment, the program code segments for retrieval are as follows:
document vector query _ vector ═ 1,0,0,.., 0] Elasticsearch query sample:
{
"script_score":{
"query":{"match_all":{}},"script":{
"source":"cosineSimilarity(params.query_vector,'document_vector')+1.0","params":{"query_vector":query_vector}
# vector to be queried
}
}
}
In the embodiment of the present invention, a program code segment of a sample for performing vectorization search by using an elastic search is as follows:
{
"script_score":{
"query":{"match_all":{}},"script":{
"source":"cosineSimilarity(params.query_vector,'document_vector')+1.0","params":{"query_vector":[-297.2,148.6,148.6,...,0]}
}
}
}
and (4) retrieval results:
{
title { "natural language processing and algorithm. txt" },
"document_vector":{[-297.24811933,148.62405966,148.62405966,...,0]}
"field_coding":{"TP391.111"}
"score":99
}
{
"title":{"..."},
"document_vector":{...}
"field_coding":{"TP391.111"}
"score":97
}
in the embodiment of the present invention, as an optional embodiment, the search result includes titles, field details, and the like of similar documents.
In the embodiment of the invention, the returned retrieval result is X pieces of data which have the most similar similarity scores and contain document titles and names in the Elasticissearch.
In this embodiment of the present invention, as an optional embodiment, the method further includes:
and if the similarity of the hit documents in the retrieval result exceeds a preset similarity threshold, inquiring whether the short document to be retrieved is stored in a storage area corresponding to the field subdivision item to be retrieved, and if not, updating information stored in the storage area according to the short document to be retrieved.
In the embodiment of the invention, if the retrieval result is hit and the similarity of the hit document exceeds the preset similarity threshold, the short document to be retrieved can be used as a part of the domain subdivision item document to supplement the domain subdivision item keyword library and the document structure information retrieval library. For example, in the above example, if the score (score) of the first item in the search result exceeds a preset similarity threshold, for example, 98, indicating that "natural language. txt" and "natural language processing and algorithm. txt" are similar and belong to the text information processing domain detail item, then the domain detail item keyword library update and the document structure information search library update can be performed. Taking a keyword vector in a document structured information vector as an example, segmenting words in front and back 3 rows of a short document to be retrieved, sequencing the obtained segmented words and each keyword in the keyword vector by TF-IDF values, and updating the keyword vector according to a sequencing result. Therefore, the field subdivision item content can be improved, and the upgrade evolution is realized.
In the embodiment of the invention, the document structured information and the semantic keyword information are fused and converted into the vector, so that the constructed document structure information retrieval library is based on the document structured information, retrieval models of various document types can be quickly constructed for various industrial enterprises, the document retrieval function under different field subdivision items is realized, the enterprise informatization system can realize the efficient retrieval of heterogeneous documents through the search function, and the document retrieval precision is high. Furthermore, the accuracy of document retrieval is improved by introducing an interactive technical scheme, the vocabulary richness of the domain detail item keyword library is improved, and the document retrieval capability aiming at the domain detail items is improved. Moreover, based on the interactive domain detail item keyword library and the heterogeneous document structured information, the search accuracy can be improved for the documents of different domain detail items.
Fig. 2 is a schematic structural diagram of an apparatus for constructing a document structure information search library according to an embodiment of the present invention. As shown in fig. 2, the apparatus includes:
a domain judging module 201, configured to perform domain subdivision item judgment on the collected sample documents;
in the embodiment of the present invention, as an optional embodiment, the domain detail item determination of the sample document may be performed based on a user interactive operation. The field division can be performed according to the classification method of Chinese library, and corresponding codes are respectively set for each field or field subdivision item, wherein the coding format can refer to the table-field subdivision item and the coding sample.
A word bank building module 202, configured to, for each determined field subdivision item, extract word-segmentation words of the sample document of the field subdivision item, and build a vectorized field subdivision item keyword bank based on the extracted word-segmentation words;
in the embodiment of the invention, aiming at each field subdivision item contained in a field, all batch sample documents contained in the field subdivision item are obtained, the obtained sample documents are subjected to word segmentation, TF-IDF values of the word segmentation words are calculated, the word segmentation words are sequenced according to the TF-IDF values, the first N word segmentation words are obtained as full-scale keywords, and the full-scale keywords are vectorized to obtain a field subdivision item keyword library.
The structure vector generation module 203 is configured to divide the sample documents of the domain subdivision items according to preset document types, extract, for each sample document, document structured information of the sample document according to a document structured information extraction policy corresponding to the document type of the sample document, and generate a document structured information vector of the sample document according to the document structured information and the domain subdivision item keyword library;
the dimension reduction module 204 is configured to perform dimension reduction on the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document;
and the search library construction module 205 is configured to construct a document structure information search library according to the domain subdivision item keyword library, the preset domain subdivision item codes and the document structure information dimension reduction vectors of the sample documents.
In the embodiment of the invention, information integration and association are carried out based on the domain subdivision item keyword library, the domain subdivision item codes and the document structured information dimension reduction vector of the sample document, and a document structure information retrieval library is constructed.
In this embodiment of the present invention, as an optional embodiment, the apparatus further includes:
a retrieval module (not shown in the figure) for determining the domain subdivision item to be retrieved and the domain subdivision item code to be retrieved to which the input short document to be retrieved belongs;
acquiring a domain keyword library to be retrieved corresponding to the domain subdivision item codes to be retrieved;
extracting the structured information of the document to be retrieved in the short document to be retrieved, and generating a structured information vector of the document to be retrieved according to the structured information of the document to be retrieved and a keyword library of the field to be retrieved;
carrying out dimensionality reduction processing on the document structured information vector to be retrieved to obtain a document structured information dimensionality reduction vector to be retrieved;
and searching in the document structure information search library according to the dimension reduction vector of the document structure information to be searched to obtain a search result.
In this embodiment of the present invention, as an optional embodiment, the apparatus further includes:
and an updating module (not shown in the figure), if the similarity of the hit document in the retrieval result exceeds a preset similarity threshold, inquiring whether the short document to be retrieved is stored in the storage area corresponding to the field subdivision item to be retrieved, and if not, updating the information stored in the storage area according to the short document to be retrieved.
In the embodiment of the present invention, as an optional embodiment, the lexicon constructing module 202 is specifically configured to:
performing Chinese word segmentation on each sample document of the target field subdivision items to obtain word segmentation words;
aiming at each participle word, calculating a word frequency-inverse text frequency index value of the participle word;
sorting the words and phrases according to the word frequency-inverse text frequency index value;
vectorizing the word segmentation words of N positions before sequencing to construct a domain detail item keyword library aiming at the target domain detail item, wherein N is a preset natural number.
In this embodiment of the present invention, as an optional embodiment, the structure vector generation module 203 is specifically configured to:
vectorizing the document structured information of each category in the document structured information extraction strategy;
in the domain subdivision item keyword library, if the corresponding position has no vectorized document structured information, setting the vector of the position as 0 to obtain the document structured information vector of the category;
and splicing the document structured information vectors of all categories to obtain the document structured information vector of the sample document.
In this embodiment of the present invention, as an optional embodiment, the document types include: txt documents, doc/docx documents, xml/html documents, and pdf documents.
In the embodiment of the present invention, as an optional embodiment, the categories include: the method comprises the following steps of a first category, a second category and a third category, wherein the third category comprises a whole keyword;
the document type is a txt document, and the first category comprises: the number of paragraphs and the length of the document, and the second category comprises keywords in M lines before and after the document;
the document type is doc/docx document, and the first category comprises: the second category comprises the starting position of the chart relative to the article;
the document type is an xml/html document, and the first category comprises: title labels, the number and the length of each level, wherein the second category comprises the starting position of the key content labels relative to the document;
the document type is a pdf document, and the first category comprises: title, number of stages, length, and the second category includes the starting position of the chart relative to the article.
As shown in fig. 3, an embodiment of the present application provides a computer device 300 for executing the method for constructing a document structure information search library in fig. 1, the device includes a memory 301, a processor 302 and a computer program stored on the memory 301 and operable on the processor 302, wherein the processor 302 implements the steps of the method for constructing a document structure information search library when executing the computer program.
Specifically, the memory 301 and the processor 302 can be general-purpose memories and processors, and are not specifically limited herein, and the method for constructing the document structure information search library can be performed when the processor 302 runs a computer program stored in the memory 301.
Corresponding to the method for constructing the document structure information search library in fig. 1, the embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to perform the steps of the method for constructing the document structure information search library.
In particular, the storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, or the like, and when executed, the computer program on the storage medium can execute the above method for constructing the document structure information search library.
In the embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and there may be other divisions in actual implementation, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of systems or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for constructing a document structure information search library is characterized by comprising the following steps:
performing domain subdivision item judgment on the collected sample documents;
aiming at each judged field subdivision item, extracting the word segmentation words of the sample document of the field subdivision item, and constructing a vectorized field subdivision item keyword library based on the extracted word segmentation words;
dividing the sample documents of the field subdivision items according to preset document types, extracting the document structured information of each sample document according to a document structured information extraction strategy corresponding to the document type of the sample document, and generating a document structured information vector of each sample document according to the document structured information and the field subdivision item keyword library;
reducing the dimension of the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document;
and constructing a document structure information retrieval library according to the domain detail item keyword library, preset domain detail item codes and the document structure information dimension reduction vector of the sample document.
2. The method of claim 1, further comprising:
determining a field subdivision item to be retrieved and a field subdivision item code to be retrieved to which an input short document to be retrieved belongs;
acquiring a domain keyword library to be retrieved corresponding to the domain subdivision item codes to be retrieved;
extracting the structured information of the document to be retrieved in the short document to be retrieved, and generating a structured information vector of the document to be retrieved according to the structured information of the document to be retrieved and a keyword library of the field to be retrieved;
carrying out dimensionality reduction processing on the document structured information vector to be retrieved to obtain a document structured information dimensionality reduction vector to be retrieved;
and searching in the document structure information search library according to the dimension reduction vector of the document structure information to be searched to obtain a search result.
3. The method of claim 2, further comprising:
and if the similarity of the hit documents in the retrieval result exceeds a preset similarity threshold, inquiring whether the short document to be retrieved is stored in a storage area corresponding to the field subdivision item to be retrieved, and if not, updating information stored in the storage area according to the short document to be retrieved.
4. The method of any one of claims 1 to 3, wherein the extracting the segmented words of the sample documents of the domain segmentation, and constructing a vectorized domain segmentation keyword library based on the extracted segmented words, comprises:
performing Chinese word segmentation on each sample document of the target field subdivision items to obtain word segmentation words;
aiming at each participle word, calculating a word frequency-inverse text frequency index value of the participle word;
sorting the words and phrases according to the word frequency-inverse text frequency index value;
vectorizing the word segmentation words of N positions before sequencing to construct a domain detail item keyword library aiming at the target domain detail item, wherein N is a preset natural number.
5. The method of any one of claims 1 to 3, wherein generating the document structured information vector of the sample document according to the document structured information and the domain segmentation term keyword library comprises:
vectorizing the document structured information of each category in the document structured information extraction strategy;
in the domain subdivision item keyword library, if the corresponding position has no vectorized document structured information, setting the vector of the position as 0 to obtain the document structured information vector of the category;
and splicing the document structured information vectors of all categories to obtain the document structured information vector of the sample document.
6. The method of claim 5, wherein the document types include: txt documents, doc/docx documents, xml/html documents, and pdf documents.
7. The method of claim 6, wherein the categories comprise: the method comprises the following steps of a first category, a second category and a third category, wherein the third category comprises a whole keyword;
the document type is a txt document, and the first category comprises: the number of paragraphs and the length of the document, and the second category comprises keywords in M lines before and after the document;
the document type is doc/docx document, and the first category comprises: the second category comprises the starting position of the chart relative to the article;
the document type is an xml/html document, and the first category comprises: title labels, the number and the length of each level, wherein the second category comprises the starting position of the key content labels relative to the document;
the document type is a pdf document, and the first category comprises: title, number of stages, length, and the second category includes the starting position of the chart relative to the article.
8. An apparatus for constructing a document structure information search library, comprising:
the domain judging module is used for judging domain subdivision items of the collected sample documents;
the word stock construction module is used for extracting the word segmentation words of the sample document of the field subdivision items aiming at each judged field subdivision item, and constructing a vectorized field subdivision item keyword stock based on the extracted word segmentation words;
the structure vector generation module is used for dividing the sample documents of the field subdivision items according to preset document types, extracting the document structural information of each sample document according to a document structural information extraction strategy corresponding to the document type of the sample document, and generating the document structural information vector of each sample document according to the document structural information and the field subdivision item keyword library;
the dimension reduction module is used for reducing the dimension of the document structured information vector of the sample document to obtain the document structured information dimension reduction vector of the sample document;
and the retrieval base construction module is used for constructing a document structure information retrieval base according to the domain subdivision item keyword base, the preset domain subdivision item codes and the document structure information dimension reduction vector of the sample document.
9. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when a computer device is running, the machine-readable instructions when executed by the processor performing the steps of the method of constructing a document structure information retrieval library according to any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program for performing, when executed by a processor, the steps of the method of constructing a document structure information search library according to any one of claims 1 to 7.
CN202110708173.9A 2021-06-25 2021-06-25 Method and device for constructing document structure information retrieval library Active CN113449063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110708173.9A CN113449063B (en) 2021-06-25 2021-06-25 Method and device for constructing document structure information retrieval library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110708173.9A CN113449063B (en) 2021-06-25 2021-06-25 Method and device for constructing document structure information retrieval library

Publications (2)

Publication Number Publication Date
CN113449063A true CN113449063A (en) 2021-09-28
CN113449063B CN113449063B (en) 2023-06-16

Family

ID=77812699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110708173.9A Active CN113449063B (en) 2021-06-25 2021-06-25 Method and device for constructing document structure information retrieval library

Country Status (1)

Country Link
CN (1) CN113449063B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023236257A1 (en) * 2022-06-07 2023-12-14 来也科技(北京)有限公司 Document search platform, search method and apparatus, electronic device, and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020032693A1 (en) * 2000-09-13 2002-03-14 Jen-Diann Chiou Method and system of establishing electronic documents for storing, retrieving, categorizing and quickly linking via a network
WO2012119339A1 (en) * 2011-03-04 2012-09-13 中兴通讯股份有限公司 Retrieval method and apparatus
CN102890711A (en) * 2012-09-13 2013-01-23 中国人民解放军国防科学技术大学 Retrieval ordering method and system
WO2019153551A1 (en) * 2018-02-12 2019-08-15 平安科技(深圳)有限公司 Article classification method and apparatus, computer device and storage medium
CN111460090A (en) * 2020-03-04 2020-07-28 深圳壹账通智能科技有限公司 Vector-based document retrieval method and device, computer equipment and storage medium
CN112883165A (en) * 2021-03-16 2021-06-01 山东亿云信息技术有限公司 Intelligent full-text retrieval method and system based on semantic understanding

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020032693A1 (en) * 2000-09-13 2002-03-14 Jen-Diann Chiou Method and system of establishing electronic documents for storing, retrieving, categorizing and quickly linking via a network
WO2012119339A1 (en) * 2011-03-04 2012-09-13 中兴通讯股份有限公司 Retrieval method and apparatus
CN102890711A (en) * 2012-09-13 2013-01-23 中国人民解放军国防科学技术大学 Retrieval ordering method and system
WO2019153551A1 (en) * 2018-02-12 2019-08-15 平安科技(深圳)有限公司 Article classification method and apparatus, computer device and storage medium
CN111460090A (en) * 2020-03-04 2020-07-28 深圳壹账通智能科技有限公司 Vector-based document retrieval method and device, computer equipment and storage medium
CN112883165A (en) * 2021-03-16 2021-06-01 山东亿云信息技术有限公司 Intelligent full-text retrieval method and system based on semantic understanding

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023236257A1 (en) * 2022-06-07 2023-12-14 来也科技(北京)有限公司 Document search platform, search method and apparatus, electronic device, and storage medium

Also Published As

Publication number Publication date
CN113449063B (en) 2023-06-16

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
TWI536181B (en) Language identification in multilingual text
US20100205198A1 (en) Search query disambiguation
CN110741376B (en) Automatic document analysis for different natural languages
CN107844493B (en) File association method and system
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN110765761A (en) Contract sensitive word checking method and device based on artificial intelligence and storage medium
CN111209372B (en) Keyword determination method and device, electronic equipment and storage medium
CN106844482B (en) Search engine-based retrieval information matching method and device
CN113094519B (en) Method and device for searching based on document
CN101088082A (en) Full text query and search systems and methods of use
Krishnan et al. Bringing semantics in word image retrieval
CN112836008B (en) Index establishing method based on decentralized storage data
CN113449063B (en) Method and device for constructing document structure information retrieval library
CN110020436A (en) A kind of microblog emotional analytic approach of ontology and the interdependent combination of syntax
CN111133429A (en) Extracting expressions for natural language processing
Mahdi et al. A citation-based approach to automatic topical indexing of scientific literature
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
US11842152B2 (en) Sentence structure vectorization device, sentence structure vectorization method, and storage medium storing sentence structure vectorization program
CN112732743B (en) Data analysis method and device based on Chinese natural language
JP4567025B2 (en) Text classification device, text classification method, text classification program, and recording medium recording the program
JP6181890B2 (en) Literature analysis apparatus, literature analysis method and program
CN111859066A (en) Query recommendation method and device for operation and maintenance work order
CN111339272A (en) Code defect report retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant