CN117355827A - Method for organizing document search in unstructured database of application program - Google Patents

Method for organizing document search in unstructured database of application program Download PDF

Info

Publication number
CN117355827A
CN117355827A CN202280008696.5A CN202280008696A CN117355827A CN 117355827 A CN117355827 A CN 117355827A CN 202280008696 A CN202280008696 A CN 202280008696A CN 117355827 A CN117355827 A CN 117355827A
Authority
CN
China
Prior art keywords
document
memory
keyword
keywords
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280008696.5A
Other languages
Chinese (zh)
Inventor
库尔马甘贝托夫·阿努阿尔·莱哈诺维奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ku ErmaganbeituofuAnuaerLaihanuoweiqi
Original Assignee
Ku ErmaganbeituofuAnuaerLaihanuoweiqi
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from RU2022106802A external-priority patent/RU2792584C1/en
Application filed by Ku ErmaganbeituofuAnuaerLaihanuoweiqi filed Critical Ku ErmaganbeituofuAnuaerLaihanuoweiqi
Publication of CN117355827A publication Critical patent/CN117355827A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Abstract

The technical solution applied relates to the field of using large application databases (hereinafter referred to as databases) containing unstructured or weakly structured data. The proposed data structure allows searching a document through a plurality of keyword sets, accurate to pages of the document, without reading the document from a database, and presenting search results in the form of an answer table, thereby simplifying the process of screening the document according to specified conditions. In the method, the program generates an auxiliary table: a keyword table, a document table, a binary string table of document numbers, and a binary string table of supplemental reverse indexes. After the keyword list is selected from the inverted index, the document list presented in the form of a binary string is processed, which is located in a module that processes the result list. Meanwhile, the binary character string is loaded into the dual memory for processing, wherein the numerical value (memory register) of the character string is equal to 2n, wherein n is the number of bits of the memory register, and if all specified keywords appear in the document, the document number is equal to the line number (register) in the dual memory. In a keyword module that processes document pages, answer tables are generated for each document. The dual memory is a logic memory structure, which comprises a data input channel, an input SM, an addressing module for switching a circuit from a common memory mode to a dual memory mode, a bit number in a memory unit, a switch and a logic switch. The outputs CS, RD, OE of the modules controlling writing/reading are connected to the outputs of the bit number selection module (column address).

Description

Method for organizing document search in unstructured database of application program
A method for searching documents in a large database of application programs based on unstructured data and dual memory hardware versions for implementing the method.
The technical solution relates to use in large databases (hereinafter referred to as databases) with unstructured or weakly structured data, such as libraries. The invention aims at searching and selecting the documents which are most suitable for the inquiry of users from a large document database. In this case, the document refers to a printed product such as a book, a magazine, an article, a booklet, a report, or the like.
The ever-increasing amount of data accumulation, coupled with the ever-increasing information demand by people, requires ever-improving related information systems to obtain documents (information) that best meet the needs of users (related information). In ambiguous query conditions, a user is required to interact with the system to conduct the necessary document search. Furthermore, the document displayed on the screen must meet the following requirements:
the system aims to meet the requirements of users to the greatest extent under the condition that unnecessary information of the users is not excessively loaded, and is realized through a convenient search result analysis mechanism, provided information visualization and query interactivity. At the same time, the information system should cover the entire database and the query processing time should not exceed 20 seconds (based on different evaluations). For information systems with structured data (tables), the search and processing of data has been quite efficient and embedded in business processes, these are relational database management systems (RDBMS), such as Oracle, microsoft, mySQL. However, in processing unstructured data, there is still a long way to get good results.
Unstructured data oriented information search systems (NoSQL) (shashachank tiware, technical nosql. -John Wiley & Sons Inc, 2011-384 pages-ISBN 9780470942246, [1 ]) and search engines stored on the internet, such as Google, yandex, rambler, and library search systems, can be categorized as information search systems. The main disadvantages of these systems include:
documents have low relevance to queries.
The number of query result documents is excessive.
It is difficult to filter out non-professional and unreliable information.
The existing search engine has developed an information search mechanism built based on reverse index (Manning k.d., ragavan p., "schu tze h.," Introduction to Information Retrieval ". Pi.c. a.c. -m, O < < b > < c/, я c > >,2011-528 c., [2 ]). The mechanism creates a word vocabulary based on information stored in a Database (DB). The vocabulary may contain any number of words sufficient to screen out the desired information at the discretion of the developer. At the same time, the application program is used to perform vocabulary, morphology, grammar and semantic analysis on the words. Numerous debugged methods are used for word selection, including alphabetic based selection, B-trees, topic segmentation, and other methods.
In constructing an inverted index (a data structure) for document searching, the following two main approaches are generally employed:
in units OF documents (US 2004030686A1"Method AND SYSTEM OF searching a database OF records" (CARDNO ANDREW JOHN; MULGAN NICHOLAS JOHN, 12 th 2004), [3]; US2009030892"System OF EFFECTIVELY SEARCHING TEXT FOR KEYWORD, AND METHOD THEREOF" (IBM, 29 th 2009), [4 ]). In these previous references [3,4], each index word corresponds to a list of documents that contain the document in which the word appears. And then carrying out full text retrieval processing on the whole document to find the document containing all the keywords.
In words (US 5696963"System, method and computer program product for searching through an individual document and a group of documents" (SMARTPATENTS INC, 12/9/1997), [5]; US2004177064"Selecting effective keywords for database searches" (IBM, 9/2004), [6 ]). In these previous references [5,6], each index word corresponds to a list of documents that contain the document in which the word appears and indicates the location of the word in the document.
In both methods, the proposed data structure may be attached with various parameters (words, frequency of documents, semantics, etc. features). A list is created containing the list of documents, the word locations, and these lists are analyzed to determine intersections and form a new list in which all keywords appear.
As a prototype of the present invention, a method in units of documents [4] was employed.
The first method has the advantage of a simple and minimal amount of indexing.
The disadvantage is that a large amount of processing is required when analyzing a large number of keywords. The sensitivity to the order in which words are processed is high. For example, after the first keyword combination is selected, the next additional word needs to be specified, but the search will be performed on the set of documents that have been found. The order of keyword processing is important in analyzing, correcting, and modifying requests.
In the second scheme, a list of documents containing the word is first extracted for each word, and a list of position numbers accurate to the word position, start and end positions (or lengths) is also extracted as in the first scheme, and then an operation of analyzing their intersections is performed.
The second solution has the advantage of being very convenient when dealing with small numbers of keywords. In case of text loss, the text can be restored to some extent, but the quality is lost.
The disadvantage is the large amount of space to store the index, which may exceed the memory occupied by processing the document. Depending on the number of keywords to be optimized. The complexity of storing and processing variable length lists increases. As the number of keywords to be analyzed increases and the database grows, the processing amount becomes large.
As a prototype of the device of the application, a typical classical memory was used for reading and writing words (E.Tannenbaum, T.Austin. computer architecture, page 200, fig. 3.27, [7 ]). The operation of the memory is based on an example consisting of 4 memory cells (registers), each consisting of 3 bits. In general, memory is not limited by the number of register bits nor by the number of memory cells (registers). Existing chips may have 8, 16, 32, and 64 bits, and 512, 1024, or more memory cells [7], page 204, section 3.30. As the memory increases, its operating principle will not change, and as shown in fig. 3.27, the scheme of [7] will replicate multiple times, correspondingly increasing the number of register bits, inputs/outputs and addresses. Here, each flip-flop stores one bit of information (4 rows of flip-flops having 3 flip-flops per row are shown in the figure). The memory contains four 3-bit words. Each operation reads or writes a complete 3-bit word. The logic circuit comprises 8 input lines, including 3 data inputs-I0, I1 and I2;2 address inputs-A0 and A1;3 control inputs-CS (chip select-select memory element), RD (ReaD-ReaD, this signal allows to distinguish between ReaD and write) and OE (output enable-enable output signal), and 3 data output lines-O0, O1 and O2. The state of the address input determines the memory bits that allow the input or output value. The process logic is to write the binary string < I0, I1, I2> (where Ii is 0 or 1, i=0, 1, 2) to the memory cells (registers) of the binary string < A0, A1> address in accordance with the entered < CS, RD, OE > command. Further, according to the < CS, RD, OE > command, information in the output binary string < O0, O1, O2> can be read from the register specified in the address < A0, A1 >.
The logic circuit of fig. 3.27[7] is shown in generalized form as fig. 13, leaving only elements important to understanding: external pins (input, output, address, control signals) and memory flip-flops.
The prototype of fig. 3.27[7] does not allow the recording of binary strings by position numbers in memory cells (registers) (fig. 13 and 14) (in practice, by columns C0, C1, C2 of fig. 14).
The technical problem to be solved is to process the document keywords and represent the processing results in an intuitive and compressed form-accurate to the pages of the document. This allows a quick assessment of how well a document matches a requested document without reading the document from the database. In this process, the user will independently make a final decision within the scope of the page found by the analysis, or interactively add/modify keywords and view the change results on the screen.
The technical result provided by the claimed technical solution is a hardware implementation process that lists an ordered set of binary strings < Str1, str2, &..strk > in a fixed length tabular form, where each string Strj is a column of the table. The table row-to-register mapping < R1, R2,., rn > -which is also a set of binary strings, is supported for row-by-row reading and analysis, when the register Rj of each number j corresponds to the j-th bit in the list < Str1, str2,., strk >. This eliminates the crossover operation of the lists < Str1, str2, & gt, strk > for different keywords, assuming that each ith list Stri is associated with an ith keyword, and the jth bit in the string Stri reflects the document/page number, thereby speeding up the document/document page screening process, since the entire string < R1, R2, & gt, rn > is associated with the same jth document/page.
Technical achievements include proposed data structures including keyword tables (index dictionary), document tables, binary list tables, and document binary string tables. The proposed data structure allows to search documents precisely according to a set of keywords up to the pages of the document without reading the documents from the database and to present the search results in the form of answer tables, simplifying the process of screening the documents according to given conditions.
Other technical achievements include:
when searching the document, the document itself does not need to be read from the database, and only the page which exactly meets the query requirement is needed to be checked, and the document does not need to be read;
the system can be interacted with in an interactive way, and the query is continuously and circularly refined or modified by adding and removing keywords in the process of analyzing the page, so that the document is added to or from a list of processed documents, and words are added to a list of document search keywords;
the document searching process has good expandability and parallelism;
the method allows priorities to be determined in the keyword list and weights to be given to each keyword, and the system can refine and suggest suggestions based on collected statistics, including personal statistics, to construct a personal user model;
Different sets (e.g., vitamin set, acid set, tree set, etc.) are specified on the keyword set { Si } in order to refine the document meaning and find other answer choices.
The binary character string length of the document page number is fixed at 128 bytes, so that the ordering of the page list can be abandoned, and a simple document page visual analysis mechanism is realized;
the binary string length is fixed at 128KB or more, DM memory is realized when the binary string is used for a large database (more than tens of millions of documents), and the ordering of the document list can be abandoned. In this case, the length of the binary string should correspond to the maximum number of documents in the database. Because the bit numbers in the binary string should correspond to the document numbers in the database. For example, a 128KB binary string corresponds to 100 ten thousand documents;
multiple (tens and hundreds) keyword sets can be processed in parallel;
the personalized importance of the keywords is set for the user by freely looking at the keyword processing results in arbitrary order-order, and is determined by simply adjusting the order of the columns in the answer table. This allows these priorities to be taken into account when screening pages and culling less important words if necessary;
The answer form finally selected by the user is a convenient tool for constructing a formal algorithm for applying document analysis, constructing various metrics and document classification spaces and constructing a user personal information space model;
the processing accuracy is limited by the pages of the document, while the pages will be presented to the user for final relevance evaluation. In addition, the processing keywords can be easily supplemented to the extent of one page, if necessary.
The essence of the claimed method is that the user constructs a query and specifies keywords and their associated logical operations. Then, extracting the keywords in the query keyword processing block, using the inverted index and the program for forming, developing and maintaining the inverted index, and simultaneously, the mentioned program interacts with the index block, the index block interacts with the database, and the keyword list selecting program is characterized in that:
the program generates an auxiliary table: keywords, documents, binary strings of document numbers, and binary strings, these tables supplement the inverted index. At the same time, the method comprises the steps of,
the keyword list contains a list of keywords, each keyword having references to a document binary string and a document list row, and further specifying other information: the number of documents using the keywords, other data of the keywords: terms, abbreviations, collections, or collection elements;
The document table contains a list of pairs related to the keywords, a pair referring to the document number using the keywords and the reference in the binary string table;
the binary character string table comprises a binary character string list with fixed length, each row is associated with a document number in the document table, each bit of each binary character string corresponds to a text page of a document, the bit number in the character string corresponds to the page number of the document, and 1 or 0 on the bit indicates the existence or absence of a given keyword on a specific position;
the document binary string table consists of a plurality of binary strings with fixed lengths, wherein bit numbers in the binary strings correspond to document numbers, and 1 on the bit indicates that a specified keyword appears in a document with a given number;
after selecting a keyword list from the inverted index, processing a document list expressed as a binary string in a processing result list block, loading the binary string into a double memory for processing, wherein the number of a line is equal to 2n, where n is the number of bits of a memory register, and if all specified keywords appear in the document and the document number is equal to the line number in the double memory, no list intersection operation is required, and furthermore
In the keyword document page processing block, an answer table is generated for each document, binary strings are loaded from the binary string table into a dual memory, and are ordered and analyzed. In the answer table, each column corresponds to a keyword, the columns are ordered according to their importance, and each keyword column corresponds to a binary string in the binary string table. The answer table allows keywords in the document to be accurately reflected by the page without loading the document itself from the database.
The essence of the device of the application is that the dual memory is used for organizing the search of documents in an application unstructured database, which is a memory logic scheme capable of writing data at a given cell address and reading the written data from the memory cells at the given cell address and outputting the read data to the output lines according to the control signals CS, RD, OE and CM. The device is characterized in that:
a second data input channel for writing a bit string to all memory cells in a specified bit (column number) of the memory cells, wherein the length of the bit string is equal to the number of memory cells, and the number and number of the memory cells are limited by microelectronics technology;
The CM input end has the function of switching from a normal memory mode (writing input data into memory units) to a dual memory mode (writing input data into the designated bit of each memory unit);
an address block (column number) containing a memory cell bit number, a common bit number for which data is to be written being specified for all memory cells;
a switch is arranged behind the address unit block, and the switch passes or stops the address signal of the memory unit according to the control signal CM;
the logic switch is arranged at the input ends of all memory triggers and is used for switching channels from the memory cell address to the bit address (row) in the memory cell according to the control signal CM;
the logic switch is arranged at the data input ends of all the memory triggers and is used for switching the data receiving channels according to the control signal CM;
the outputs of the write/read control blocks CS, RD, OE are connected to the outputs of the bit number selection blocks (column addresses).
The drawings are briefly described.
FIG. 1 shows a generalized scheme of a large database document search organization method; FIG. 2-worksheet data structure; FIG. 3-binary string; FIG. 4-an example of a document reply table displayed by column, binary string document number ASDi; FIG. 5-page reply table example, displaying by column the string Stri corresponding to the keyword; FIG. 6-compression and ordering reply table example; fig. 7-reply table example, +=0.2; FIG. 8-text segment to be analyzed example; FIG. 9-example 1 block diagram of a text processing stage to be analyzed; FIG. 10-example 1 pad table; FIG. 11-conditional answer table overview; FIG. 12-a representation of six binary columns of 12 bits long; FIG. 13-4x3 is a logic block diagram of memory; FIGS. 14-13 are general block diagrams of the 4x3 memory; FIG. 15-memory general block diagram of an r memory cell with n-bit resolution; FIG. 16-4x4 DM memory general block diagram; FIG. 17-memory general block diagram of an r memory cell with n-bit resolution; fig. 18-4x4 DM memory logic diagram example.
Symbol list of fig. 13-17:
i0, I1, I2, I3-input data written into the register;
a0, A1-two inputs for addressing the memory cells;
CS-memory element selection;
RD-read (to distinguish between read and write);
the OE-data output allows for the data,
j0, J1, J2, J3-input data for writing columns;
CL0, CL 1-two inputs for addressing columns;
t.0, t.1, t.2, t.3-row (number of bits in memory cell);
o0, O1, O2, O3-are used for the output data read from the memory cells.
The data output allows CS.RD.OE.
The method is implemented.
A method of organizing document searches in a large database (fig. 1) includes the following typical steps:
a user (4), as an interested consumer of information, generates a query (5) in a language close to SQL, the user specifying keywords and logical operations related thereto, the query being limited primarily to a simple word list;
keywords in the query are extracted in a query keyword processing block (6).
Programs for generating, developing and supporting the reverse index work form a reverse index (3), the programs interact with an index block (2), the latter interacts with a database (1), a document list (7) is selected according to keywords, the keyword list is subjected to cross search to find a document set simultaneously containing specified keywords, and specified logic conditions are executed on the document set to obtain a result list. This list of documents is already an intermediate result of the search and may be displayed to the user. Currently, search systems supplement this list by performing a full text search of all documents in the list, which allows all found keywords to be highlighted in color when the document text is displayed, thereby facilitating user viewing.
As shown in the first through seventh blocks of fig. 1, these schemes cover the typical functions of building and using inverted indexes in existing information search systems.
In block (8) of the processing result document list, the address of the binary character string is read from the document table (12 of fig. 2) according to the document number, and then the document binary character string of each keyword is read from the binary character string table (13 of fig. 2).
The document page processing module generates a reply table (FIG. 4) for each document based on the keywords (9), where the binary strings are loaded into dual memory, then the DM (FIG. 17) is ranked and analyzed, and the matching of the softened or enhanced replies to the query is performed based on the specified parameters. The user may interactively select and view document pages, alter queries, keyword importance, and other page screening parameters.
The final filtered list of documents (10) is output as an answer to the submitted query.
In existing search systems, a list of documents is selected based on keywords (block 7 of FIG. 1). The interleaving of these lists is then performed by the program to find a list of result documents that each contain all keywords.
In the proposed scheme, the interleaving operation of the document list may be performed in the DM memory. For this reason, it is necessary to represent the document list as a binary list (as shown in fig. 3), just as the binary page list stra is handled, with a larger capacity (millions of bits). This would require a larger size DM memory, thereby increasing cost. For example, processing a binary page list requires only 128 bytes of binary strings, while processing a binary document list requires 128KB or more strings in order to load a list of 100 ten thousand documents simultaneously. In this case, the bit number in the binary string will correspond to the document number of the occurrence key. Processing a list of 1400 tens of thousands of documents will result in 14 cyclic loads of DM memory. All operations performed in the DM memory are identical to those for processing the document page list. In this case, as in the present system, the document is selected from the document reply table (15) and inserted in the result document list in block 7 of fig. 1, and in the case where all keywords exist in the document row, for example, in fig. 4, the documents 8 and 11 are located in the rows 8 and 11.
The database (1) is a repository of all accumulated documents. The format and structure of the data is determined by the standard data management system selected. All necessary software is provided: tools to record documents, manipulate data, read data, languages to interact with data (data description, data manipulation, queries).
The program (3) that generates, develops and supports the reverse index work performs recording, reading, sorting, modifying, generating data work structures and generating keyword work vectors (pointers to documents), dividing or merging indexes, containing or excluding words, analysis, and the like.
The proposed method is mainly aimed at finding the fact information of printed publications in professional libraries (such as biology, genetic engineering, pharmacological-pharmaceutical information, mechanism of action of drugs on organisms, methods of treating diseases, etc.).
Let Ω= { Dj } be the document set Dj, where j=1, m. Where m represents the number of documents, for example 700,000 documents or more. The document may be a book, magazine, article, archive, research protocol, results, etc., i.e., any printed publication suitable for processing and display on a computer screen.
Unlike prototype [2], the proposed method comprises the following data structures for blocks 8, 9 and 10: a keyword table (11), a document table (12), a binary string table (13) and a binary document string table (14). The structure of the above table is shown in fig. 2.
The keyword table (11) contains a list of r rows. Each row comprises:
the keywords Si, i=1, r,
the number Cj, j=1, k, represents the number of documents containing a particular keyword,
the link address ASDi, i=1, n, points to a row in the binary document string table (14),
the link address { Adi }, i=1, n points to a row in the document table (12).
Here, the number of words is greater than the links. For example, the word "house" and the words "cabin", "house" may share a link.
The keyword table (11) may be arbitrarily divided into a plurality of partial-sub-tables. These parts relate to the semantics of the keywords: literature vocabulary, topic domain vocabulary, abbreviations, synonyms, names, etc. The keyword table (11) may be supplemented by adding columns to reflect the properties of the vocabulary and its different groupings: synonyms, anti-ambiguities, different categories (vocabulary sets), topic terms (mathematics, chemistry, biology, etc.), semantic categories, artificial metrics, etc.
The document table (12) contains t rows { < Sp > i }, i=1, t, where each row Spi consists of a set of pairs:
Spi={<Nd1,Ast1>1,<Nd2,Ast2>2,<Nd3,Ast3>3,...<Ndt,Astt>t}i,
wherein Ndj, j=1, the number of keywords in the t-keyword table (11) in the document;
an Astri-link address pointing to a binary string in a binary string table (13).
In each row of the document table (12), these pairs are ordered according to the document number Ndj. The number Ci in the keyword table (11) indicates the number of pairs in the document table (12) associated with the specified keyword. The document table (12) lists all document numbers containing keywords in the keyword table (11).
This allows the amount of information for each keyword to be evaluated and the limits for eliminating redundant keywords to be determined or new keywords to be constructed. Words that appear in 95% of the documents, which are called stop words, cannot effectively distinguish the documents and extract the desired content from them. New keywords may be formed from a combination of existing keywords.
The binary string table (13) contains m fixed-length binary string lists { Stri }, i=1, m. The structure of the binary string Stri is shown in FIG. 3, where each cell is a bit corresponding to a page of document text, the bit number in the string corresponds to the page number of the document, and a 1 or 0 indicates the presence or absence of a given word on the page at a particular location.
Suppose that the document Dj has 200 ten thousand characters, approximately equivalent to 1000 pages of text. It will correspond to a binary string of 1000 bits or 128 bytes in length. The number of bits in the binary string will be larger (being a power of 2) -producing spare bits. Suppose there are few books with a large number of pages, and the books encountered are divided into volumes (possibly artificially divided).
The proposed data structure is open, and can be used for supplementing parameters required by processing at any time, and reflecting the characteristics of vocabulary, document frequency, semantics, logic, morphology and the like. This is manifested in the addition of dimensions to the table and/or in additional tables containing additional features reflecting documents and vocabulary. For example, table 11 may be supplemented by additional tables, abbreviations, etc., linked to the double/triple keywords. The structure may be supplemented by a keyword set/subset table, e.g. a vitamin set comprising a vitamin list, a tree set comprising a tree species list, etc. Tables 12 and 13 may also include additional parameters, especially a 24 byte reserved space in each binary string.
The steps for generating tables (11), (12) and (13) are shown in FIGS. 8,9 and 10.
In constructing the query (5) (fig. 1), the user designates a list of keywords < S1, S2, …, sk >, ordered by importance to the user. Classical query languages using the logical symbols AND, OR, NOT can be constructed. Keywords are set by the user based on personal knowledge of the desired document content. The importance degree is the perception (numerical value or simple word order) of the user individual to the set keywords, and can be as follows:
setting through a linear function or other functions, and independently constructing a personalized topic field model for each user;
The user sets the words by himself when listing (inputting) the words;
and automatically setting according to the frequency characteristics of the vocabulary.
The degree of importance allows discarding less important words, focusing more on the combination of more important words. The number of keywords set is limited only by the user's logic and level of awareness, and may be set to 64 words, although for some applications more may be required-when processing a keyword set.
Based on the query, an answer table (16) is constructed for each document (FIG. 5), with each column corresponding to a keyword.
The columns are ordered by importance. Each keyword column corresponds to one binary string in the binary string table (13). Thus, in constructing the answer table, the program of module (9) reads and collects binary strings of a fixed length from the binary string table (13) for each document and each keyword, and places them in the columns of the answer table (16) according to the importance order specified by the user (fig. 5). At this time, the recording order of the vocabulary is as follows: the more important the left (by default) words in the answer table are. The word si+1 is more important than the word Si (fig. 5).
FIG. 5 shows an example of an answer table (16) with a number of keywords 16, which contains m pages of a document. Each i-line of the answer table (16) (FIG. 5) reflects all keywords encountered when viewing the i-page of document Dj and is represented as a 16-bit string that can be read as an integer ranging from 0 to 65536 or 2-16.
These numbers (bit strings per page) can well explain and show which keywords are encountered on a given page (line). The maximum value of the number corresponds to a string that is completely filled with 1's, i.e., all keywords are encountered on a given page. This allows the construction of simple, straightforward and adaptive algorithms:
a) The table is reviewed and page-line binary numbers matching all keywords are selected to be equal to 2 n, where n is the number of keywords or 65536 for 2 16. In most user queries (up to 98%), they do not use logical search criteria, but just enumerate keywords;
b) The table may be compressed for visual analysis of the selected document. Zero lines or lines with a user specified low screening level are excluded from the visualization process. The remaining non-excluded rows may be ordered according to the order of keyword importance level and the number of 1. An example conversion of answer table (16) in fig. 5 is shown in fig. 6. Any text processing algorithm may be included: highlighting words, phrases, handling word arrangements, interpretations, synonyms, etc. over distance;
c) Keywords are chosen by excluding the smallest combination of least significant words so that the total number of remaining words is filled with 1's. For example, with the answer table (16) in fig. 5, if the keywords S1, S6, S8, S14 are excluded, a document will be obtained on page 1 of which complete matching of the remaining keywords will be achieved with the minimum amount of exclusion. The corresponding number will be equal to 2A 12. Various semantic algorithms may be used, taking into account the semantic value of the word, the document topic area and the characteristics of the user's needs.
g) Personalized browsing settings. When a large number of keywords are included in a query (typically more than 20, depending on the knowledge domain and skill level of the user), the user may specify a cutoff threshold +.. Here, the
0<£≤1,
Where ∈represents the degree of demand for the viewed material to conform to the user's query.
Pages with a keyword duty cycle below the specified level x n will not be included in the final answer table (16) (result rounded). Where n is the number of keywords in the query. For example, for answer table (16) in fig. 5, with 16 keywords, a cutoff threshold of +.0.2 (16x0.2=3.2 or 3 keywords) may be set, then answer table (16) (fig. 5) will be as shown in fig. 7.
d) The results can be viewed at any stage, new keywords are added, existing keywords are excluded, and different orders of keyword importance are determined (the order determines a preference system, changes calculation indexes, document classification and the amount of output information).
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
Example 1. Taking a book text segment as an example, the steps of generating the auxiliary tables 1, 2 and 3 are explained (fig. 8). A generalized exemplary scenario for processing documents and descriptions and examples thereof are shown in fig. 9.
In the stage of reading a document, all keywords S1, S2, S3.. Sn in the document are written into a temporary file W:
The file w= { s1= "introduction", s2= "data", s3= "foreign", s3= "document", s4= "numerous", s5= "study", s6= "geninuowa"... The lower case and upper case are identical and are not distinguished.
The words are processed to meet the requirements of building a dictionary.
The keywords in the temporary file W are written into a keyword list 11. If the word is a new word, a row is added to the keyword table (11). Next, information is appended to the document table (12) and the binary string table (13).
In one row in the keyword table (11) (fig. 9): { < introduction, 1> } the first 1 represents the number of documents containing the word. Since the first document is being loaded, fill in 1. The second 1 represents the address link of the first list in the document table (12).
4) In the first list Sp1 of the document table (12), the document number (in the document table (12), this is the first document) and the address link of the binary character string in the binary character string table (13) are recorded:
the list Sp1- { <1,1> }. Here, the first 1 represents a document number, and the second 1 represents a first binary string.
In the first binary string of the binary string table (13), the first bit position is set to 1, indicating that the word from table 11 (introduction) appears in the first page. In the binary string, the number of each bit corresponds to the page number in the document (e.g., binary string <1,0,0,0,1,1,0,0,0, … > indicates that a given word is present on the first, fifth, and sixth pages).
The next word is read from the temporary file W list and step 2 is performed. At the same time, the length of the word list in the file W is controlled. After all the words in the file W have been processed, it goes to read a new document, i.e. step 6 is performed.
The end of processing the document list is checked. Binary string table (13) 1 at the first position of all displayed binary strings in fig. 10 indicates that all keywords in the keyword table (11) are located at the first page of the document.
Example 2. Answer table (16) fig. 5 may be formed in columns in a dual memory microchip DM. Each column corresponds to a keyword, and the register address corresponds to the document page number. Thus, one DM loading can process one document from the database (1). The keyword set for a given topic area can be optimized by setting a keyword cutoff boundary in the universal dictionary of the keyword table (11). For example, a frequency boundary may be set that specifies that the frequency of keywords that occur in a document (whole database or database portion) is not more than 20%. Keywords that occur more frequently in the document are excluded from the keyword table (11).
Boundaries for document filtering and page display may be set. For example, only those pages whose number of keywords exceeds the set boundary, for example, 80% (in the case where the keyword weight is determined, the cutoff boundary may be a number). The page that is required to display all the keywords appearing may be set to 100%, i.e., the page that is equal to 16 keywords (for example, fig. 5) is screened out.
Any semantic rule may be set: the necessary and replaceable words are set, the necessary existence of an indivisible word combination (for example, "oxygen concentration", "carbon dioxide permeation increase") is determined, and the distance (number of characters) between the words is set.
All classical logical operations are implemented, such as AND, OR, non-AND, non-OR of the keywords.
Example 3. The result of the document processing is presented in the form of answer table (16) fig. 6. Pages that do not encounter keywords are not listed here. All pages in the table are ordered according to the number of encountered keywords, and if the number of keywords is the same, the pages are displayed according to the arithmetic sequence of the importance of the keywords and the page numbers.
This allows for an intuitive view of the overall match between the document and the query. Hovering a cursor over:
on the column index (Si), the complete name of the keyword is displayed on the screen.
On the page number, the text of the whole page is displayed on the screen, and a user can check whether the text meets the requirement.
All keywords on the page displayed on the screen are highlighted in color, with the importance of the word represented in color. The color of a word depends on the color spectrum (blue corresponds to low importance, red corresponds to the most important word).
Additional personal access settings. The user may specify the answer table (16) a cut-off level +..
Here, the
0<£≤1
The number ∈reflects how well the material the user is required to view matches the query. In answer table (16) fig. 5, pages with a keyword count ratio below a given level ≡ x n will not be included in the result (result rounded). Where n is the number of keywords in the query. For example, for a table with 16 keywords in fig. 5, the user may specify a cut-off level of +.0.2 (16x0.2=3.2 or 3 keywords), then the answer table (16) fig. 5 will be as shown in fig. 7.
For additional analysis of the information on the found page, all conventional text processing algorithms (semantic analysis elements) can be used.
The selected document pages are displayed on a screen for evaluation by the user according to the screening criteria and new phrase combinations are formulated taking into account possible refinement options-including additional keywords, excluding the specified keywords.
In practice, the answer table (16) is the basis for designing the DM memory logic scheme.
Device implementation.
Dual Memory (DM) allows handling of separate binary string lists Str1, str2. For visual purposes, the binary string is represented in the form of a condition table (17) (fig. 11), where a column is the binary string and a row is the number of bits in the binary string. A bit level (row in the table) binary string cross recognition operation is performed.
Here the binary string str1= < 01001000001..1 >. The string length is m bits. In the table shown, the number of rows is specified based on the number of bits of the longest string in the list of strings being processed, with shorter strings being zero-padded.
Consider the ith row of a given table. It shows which columns have 1's and which rows have 0's in the ith bit position. For example, in row 11, for all columns, 1 is at the 11 th bit position (consider 1 from row 6 to row n-1). The rows of integers (expressed in decimal numbers) allow the entire binary string to be "encrypted". Thus, for row 11, this number will be equal to 2n.
Fig. 12 shows an example of the same table (17) consisting of 6 columns of 12 bits long. The same 11 th row (when all values in a row are 1) will have an integer value equal to 26=64. The integer value of row 3 is equal to 4, therefore, by reading the row of integer values obtained in the table, it is possible to know exactly which positions are 1. At the same time, the user sets the conditional importance of the columns by specifying the particular order of the columns in the table, symbolicallyAnd (3) representing. Then column number +.>It is interpreted that column strj+1 is more important than column Strj. This allows semantic algorithms of pre-evaluated pages using declaration methods.
A logical block diagram of the 4x3 memory is described in prototype [7] page 200 and is shown in FIG. 13. Fig. 14 shows a general diagram of the same classical memory consisting of 4 registers (memory cells), each consisting of 3 bits. The actual circuit construction is also similar, except that the number of bits of the memory cells (registers) may be 8, 16, 32 or 64 bits, and the number of memory cells may be from hundreds of thousands or more (circuit multiple extensions), but this is sufficient for understanding the logic of all example and generalization-based processes.
Next we present a general memory pattern-this makes we unnecessary to focus on the element execution of flip-flops and switching elements, as they can be performed in a variety of ways depending on the preferences of the developer and the application technology, but retains the logic of all main memory functions-writing/reading a row from a memory cell (register) of a given address.
In the general diagram of fig. 14, only important elements-12 storage flip-flops, each storing 1-bit information, input/output and control signals, are fixed. Each flip-flop may be in one of two states 1 or 0. The flip-flops are arranged in a structure-4 rows (registers) with 3 flip-flops per row. It is assumed that information input in the form of binary strings is written into memory cells (register-rows). Register numbers, which are called addresses of memory locations (registers). The logic procedure is as follows-binary string < I0, I1, I2> (where the value of Ii equals 0 or 1, i=0, 1, 2) is written to a register of the address of binary string < A0, A1> according to a command on input < CS, RD, OE >. Further, according to the < CS, RD, OE > command, information is read from the register specified in the address < A0, A1> to the output binary string < O0, O1, O2>.
Fig. 15 shows a general classical memory logic diagram for r memory cells (registers), where each register consists of n bits for recording input signals I0, I1, in and reading output signals O0, O1, on.
Known examples of random access memory microchips can be found in page 204 of FIG. 3.30, similar to page 7.
Fig. 16 shows a general block diagram of the proposed 4x4 DM memory example. Each horizontal line consists of 4 flip-flops, constituting a word. 4 memory locations (registers or 4 words) are shown.
Unlike prototype [7], DM can either work like classical memory-reading writing by register (its address), or additionally allows binary strings Str1, str2, str4 to be sequentially input to inputs < J0, J1, J2, J3 >. Here, the binary string Stri, = < J0, J1, J2, J3> i, where i = 1,4, is written to column i at input < CL0, CL1> according to address C t.0, c.1, c.2, c.3 in the c.i list. At the same time, the information is read in a standard manner-output to < O0, O1, O2, O3> through the register (word). Here input CM (Change Memory) -switches the memory operating mode from normal to DM. In the normal memory mode, it operates according to the classical memory mode in fig. 13.
Fig. 17 shows an example of a general logic diagram of DM memory, where each register has n bits, and there are r registers. Additional data inputs J0, J1, jm, where m.ltoreq.r, are shown. For easier understanding of the logic diagram, reference may be made to the response table in fig. 11, where rows correspond to memory registers and columns correspond to binary strings. Of course, the DM may also use any size register, such as 16, 32, 64 bits or more, and any number of memory locations (registers), as well as conventional memory, from 1024 (e.g., pages in the example) to millions (for handling large lists of documents).
And (5) operation description.
Fig. 18 shows an example of the execution of a 4x4 DM memory, which is shown in general form in fig. 17. The input CM-switching the circuit from normal memory mode to DM mode is shown. In the normal memory mode (CM output of 1), the logic diagram is the same as that shown in fig. 13. In enhanced DM memory mode (CM output of 0), the logic diagram switches to input data from inputs J0, J1, J2, J3, see FIGS. 18 and 17. The logic diagram of figure 13 is supplemented,
column address blocks CL0 and CL1 are used to set the current column address to which data J0, J1, J2, J3 is to be written.
In the register address blocks A0 and A1, drivers are installed on the outputs of the logical and elements, depending on the CM control signal (1-enable register address signal, 0-block register address signal).
A multiplexer for inputting the C-sync trigger signal (the data input of the signal-on trigger) is installed on all inputs of the storage trigger. According to the CM control signal, the multiplexer switches the address input from the register address to the column address (points T0, T1, T2, T3).
A multiplexer is installed on all data inputs D of the storage flip-flop. The CM control signal switches the data reception channels from I0, I1, I2, I3 (when cm=1) to J0, J1, J2, J3 (when cm=0) on the D inputs of all flip-flops.
The junction T4 at the output of the element after the write/read control block (CS, RD, OE) is extended to the column address selection block.
When the CM input switches (output 0) to DM memory mode, the column address is implemented at the top of the circuit by CL0 and CL1 address lines, similar to register addresses, with input lines I0, I1, I2, I3 being locked by the CM signal on the multiplexer before the input of the flip-flop D signal. To select a memory rank, external logic needs to set the CS signal to 1 and the RD signal to 1 for reading and 0 for writing. The column address lines should indicate which four-bit column information is to be written to. During reading, all data input lines are not used. Upon writing, bits on data input lines J0, J1, J2, J3 are loaded into the selected memory column; at this time, the output line is not used.
The memory shown in fig. 18 operates as follows. The four AND gates in the upper part of the figure are used to select the columns, which form a decoder. The location of the input inverters is then such that each AND gate is driven by a particular address. Each and gate activates a column select line. When the chip needs to perform a write operation, the vertical line cs·rd takes a value of 1, and one of the four write and gates, points T0, T1, T2, T3, is activated. The selection of the AND gate depends on which column select line has a value equal to 1. The output signal of the write AND gate activates all the C signals (flip-flop inputs) of the selected column, loading the input data into the flip-flops of that column. Only when the CS signal is equal to 1 and RD is equal to 0, a write operation is performed, at which time only the column selected by the addresses CL0 and CL1 is written.
The read process is similar to the standard read process in fig. 13-CM switches to normal memory mode (output 1), specifying register addresses (A0, A1), cs·rd line takes a value of 0, so all write and gates are blocked and the flip-flop does not change. Instead, the word select line activates an and gate associated with the Q bits (trigger output) of the selected word. Thus, the selected word transmits its data into a 4-input or gate located at the bottom of the circuit, while the other three words output 0. Thus, the output of the OR gate is the same as the value stored in the word. The other three words do not affect the output data.
Thus, after a column is written into the DM memory, the information stored in the DM register can be read through the register address. As a result, the contents of the ith register show the ith location in the binary list, which is written by column, and the integer values of the registers contain all of this information in compressed form, which is completely consistent with the response table (16) in FIG. 5, for the four keyword examples and four document pages.
DM memories may be manufactured in a variety of forms. As a best option it may contain 1024 registers (words) and 64 bits (number of column-keys). The memory can be independently arranged or used as a cache memory of the processor, so that the operation speed of the algorithm is increased.
Modification of logic circuits:
the same channel is used for recording the input data in the registers (e.g. I0, I1, I2, I3) and the input data in the columns (e.g. J0, J1, J2, J3). As these channels are not used simultaneously.
Increasing the bit width of the register to handle more keys-1024 128-bit registers;
for special applications-any number of registers (tens of millions level) to handle a large number of documents.
Multiple input ports (register sets) are constructed to enable parallel processing with multiple lists.
Therefore, the application proposes a search method based on accurate pages. In addition to the list of documents containing keywords, a binary list of document lines is generated (FIG. 3), and their processing replaces the stage of full text analysis of the document. The document analysis is replaced with a page-by-page view of the response table in FIG. 4.
The technical problems are solved by the following modes:
the binary list of the document pages has a fixed length, so that the processing is convenient;
according to a given keyword, all pages of a document are represented by a binary line;
the hardware implementation of the DM memory can eliminate the ordering operation of different keyword pages;
DM memory allows multiple (tens) keys (binary lists) to be processed simultaneously to find their intersection. The DM may enable a quick conversion of a set of binary lists of specified keywords into page-by-page display documents (one page of the document for each memory register) for quick viewing and screening of document pages containing the specified keywords without the need for list intersection operations.
The application interpretation of the binary string is not limited at all. The results may be read in integer form (e.g., rows in the response table of fig. 4), depending on: the sequence of writing the binary string into the DM memory (the sequence of keywords in the query) with the binary value of the number located in the same position of the binary string;
The searched page is not bothered to the user due to the size;
in the search algorithm, there is no need to construct various complicated logic structures such as distances between keywords, order of mention of keywords, their application in the range of sentences, paragraphs, etc. All keywords on the page are highlighted with background, which is sufficient for the user to make a topic evaluation.

Claims (2)

1. A method of organizing document searches in an unstructured database of an application, comprising the steps of: the user generates a query, the user specifies keywords and logical operators, extracts the keywords in the query into keyword processing blocks, uses an inverted index with index building, maintenance, and expansion programs that interact with an index block that interacts with a database, and uses a program of keyword selection lists. The method is characterized in that:
-the program generates an auxiliary table: keyword table, document table, binary string document number and binary string to supplement inverted index, wherein
The keyword list contains a list of keywords, each keyword having a binary string pointing to the document and a reference to the document list, and further other information, such as text, abbreviations, a plurality of or a single element of the keywords, is specified.
The document table contains a set of key-value pair lists associated with the keywords, wherein the key-value pairs include document numbers using the keywords and references to the binary string table;
the binary string table contains a fixed-length binary string list, each string is associated with a document number in the document table, wherein each bit of each binary string corresponds to a page text of the document, the bit number in the string corresponds to a page number in the document, and a 1 or 0 placed in the position indicates whether the page contains a given keyword;
the binary string table of a document consists of a plurality of fixed-length binary strings, where each bit number corresponds to a document number, and a 1 on that bit indicates that the document contains a given keyword.
After the keyword list is selected from the inverted index, the document list expressed as a binary string is processed in the processing result list block. In this process, a binary string is loaded into the dual memory for processing, where the string has a value equal to 2n, where n is the number of bits in the memory register. If all specified keywords appear in the document with the same serial number as the character string in the memory, the list crossing operation is not required.
In a block that processes document pages by keywords, an answer table is generated for each document. In this process, binary strings are loaded from a binary string table into a dual memory, and then sorted and analyzed. In the answer table, each column corresponds to a keyword, and the columns are ranked by their importance. Each keyword column corresponds to one binary string in the binary string table. The answer table allows the keywords contained in the document to be reflected with page accuracy without the need to upload the document itself from the database.
2. A dual memory for organizing file searches in unstructured databases according to point 1, which is a logical memory structure allowing data to be written according to a memory cell address, while allowing written data to be read according to the memory cell address and output to an output line, according to control signals CS, RD, OE and SM, characterized in that:
the inclusion of a second data input channel allows a bit string to be written to designated bits of all memory cells simultaneously. In this process, the length of the bit string is equal to the number of memory cells, which are limited by the manufacturing process of microelectronics;
The SM input has the function of a switching circuit, can switch the working mode of the circuit from a common memory mode to a dual memory mode, and writes input data into the designated bit of each memory cell;
an addressing block for addressing the specified bit number in a memory cell is included, the block specifying a common bit number for specifying the write of data to all memory cells.
Setting a switch behind the address block, and allowing or prohibiting the address signal of the storage unit to pass through according to the SM control signal;
the memory device comprises logic switches, which are arranged at the input ends of all memory triggers and are used for switching from the addresses of the memory units to the addresses of designated bits in the memory units according to SM control signals;
the logic switch is arranged at the data input end of all the memory triggers and is used for switching from an input data channel to an input channel with a specified bit according to an SM control signal;
the outputs CS, RD, OE of the write/read control block are connected to the outputs of the bit select block.
CN202280008696.5A 2022-03-16 2022-09-28 Method for organizing document search in unstructured database of application program Pending CN117355827A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
RU2022106802A RU2792584C1 (en) 2022-03-16 Method for organizing the search for documents in applied unstructured data bases and a hardware version of dual memory for its implementation
RU2022106802 2022-03-16
PCT/RU2022/050305 WO2023177321A1 (en) 2022-03-16 2022-09-28 Method of organizing a document search in applied databases

Publications (1)

Publication Number Publication Date
CN117355827A true CN117355827A (en) 2024-01-05

Family

ID=88023677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280008696.5A Pending CN117355827A (en) 2022-03-16 2022-09-28 Method for organizing document search in unstructured database of application program

Country Status (2)

Country Link
CN (1) CN117355827A (en)
WO (1) WO2023177321A1 (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6980976B2 (en) * 2001-08-13 2005-12-27 Oracle International Corp. Combined database index of unstructured and structured columns
FR2836573A1 (en) * 2002-02-27 2003-08-29 France Telecom Computer representation of a data tree structure, which is representative of the organization of a data set or data dictionary, has first and second total order relations representative of tree nodes and stored data items
US7600001B1 (en) * 2003-05-01 2009-10-06 Vignette Corporation Method and computer system for unstructured data integration through a graphical interface
US7136851B2 (en) * 2004-05-14 2006-11-14 Microsoft Corporation Method and system for indexing and searching databases
US20070203893A1 (en) * 2006-02-27 2007-08-30 Business Objects, S.A. Apparatus and method for federated querying of unstructured data
US8046353B2 (en) * 2007-11-02 2011-10-25 Citrix Online Llc Method and apparatus for searching a hierarchical database and an unstructured database with a single search query
RU2409849C2 (en) * 2008-07-24 2011-01-20 Закрытое Акционерное Общество "ТЕЛЕФОРМ" Method of searching for information in multi-topic unstructured text arrays
US9355152B2 (en) * 2013-12-02 2016-05-31 Qbase, LLC Non-exclusionary search within in-memory databases

Also Published As

Publication number Publication date
WO2023177321A1 (en) 2023-09-21

Similar Documents

Publication Publication Date Title
CA2796392C (en) Associative memory
KR100756921B1 (en) Method of classifying documents, computer readable record medium on which program for executing the method is recorded
US5926811A (en) Statistical thesaurus, method of forming same, and use thereof in query expansion in automated text searching
US5995962A (en) Sort system for merging database entries
US20110055233A1 (en) Methods, Computer Systems, Software and Storage Media for Handling Many Data Elements for Search and Annotation
AU2001282106A1 (en) Associative memory
NO338518B1 (en) Multiple register-based information retrieval system
US5895463A (en) Compression of grouped data
JPH0675992A (en) Limited-state transducer in related work pattern for indexing and retrieving text
US20120296933A1 (en) Associative memory
CN109885641B (en) Method and system for searching Chinese full text in database
JP2693914B2 (en) Search system
CN117355827A (en) Method for organizing document search in unstructured database of application program
RU2792584C1 (en) Method for organizing the search for documents in applied unstructured data bases and a hardware version of dual memory for its implementation
JPH05250416A (en) Registering and retrieving device for data base
JP2009181524A (en) Document search system and document search method
KR100289332B1 (en) Automatic Word Construction System for Electronic Documents and Method
JPS5820073B2 (en) Thesaurus construction method
AU2013205566B2 (en) Associative memory
Oladele et al. Archival System for Projects Using Association Approach
Singh et al. INTELLIGENT WORD AND PHRASE ANALYSIS AND PREDICTION TOOL
Norman How Shall We Store it?—Datatypes
CN106055121A (en) Input method, and information search method and system
JPH04205561A (en) Document retrieving system using vocabulary dictionary
JPS61128366A (en) &#39;kana&#39;/&#39;kanji&#39; converter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination