WO2023177321A1

WO2023177321A1 - Method of organizing a document search in applied databases

Info

Publication number: WO2023177321A1
Application number: PCT/RU2022/050305
Authority: WO
Inventors: Ануар Райханович КУЛМАГАМБЕТОВ
Original assignee: Ануар Райханович КУЛМАГАМБЕТОВ
Priority date: 2022-03-16
Filing date: 2022-09-28
Publication date: 2023-09-21
Also published as: CN117355827A

Abstract

The invention relates to the field of computing. In the proposed data structure, auxiliary tables of key words, documents, document number bit strings and bit strings are generated, which complement an inverted index. Lists of documents represented by bit strings in a result list processing unit are processed, wherein the bit strings are uploaded to a dual memory for processing. In a unit for processing document pages according to key words, an answer table is generated for each document. The dual memory is a logical memory schema and contains data input channels, a CM input for switching the working mode of the schema from conventional memory to dual memory, a unit for addressing bit numbers in the memory cells, a switch, and logic switches. An output of a write/read control unit (CS, RD, ОЕ) is connected to an output of a bit number selection unit. The invention makes it possible to carry out keyword searches for documents with page-level precision without reading the documents in a database and to present the search results in the form of an answer table to simplify the document selection procedure.

Description

METHOD FOR ORGANIZING DOCUMENT SEARCH IN APPLICATION DATABASES

The claimed technical solution relates to the use of large application databases (hereinafter referred to as DBs) with unstructured or weakly structured data, for example, libraries. The invention is intended for searching and selecting documents from large databases of documents that best meet the user's needs. In this case, the documents are printed publications, such as books, magazines, articles, brochures, reports, etc.

The rapid growth in the volume of accumulated data, combined with the increasing information needs of people, requires constant improvement of related information systems in order to obtain documents (information) that best meet the needs of the consumer (pertinent information). The requirement to search for documents needed by users, with vague and unclearly formulated queries, requires the interactive participation of the system user. In this case, the documents displayed on the screen must:

- do not overload the user with unnecessary information (hundreds of relevant documents found);

- meet the needs of the user as much as possible, which is ensured by a convenient mechanism for analyzing search results, the clarity of the information provided and the ability to interactively adjust queries.

And at the same time, the information system must work with full coverage of the entire database (DB) and not take much time to process the request (according to various estimates, no more than 20 seconds). For information systems with structured data (tables), searching and processing data already works quite effectively and is built into business processes; these are relational database management systems (DBMS), for example, Oracle, Microsoft, MySQL. However, when working with unstructured data, there is still a long way to go before obtaining a high-quality result.

Information retrieval systems focused on working with unstructured NoSQL data (Shashank Tiwari. Professional NoSQL. - John Wiley & Sons Inc, 2011. - 384 p. - ISBN 9780470942246, [1]) stored on the Internet include, for example, Google, Yandex, Rambler, library search engines. TO The main disadvantages of the method of organizing the search for documents of these systems include:

- low degree of relevance of documents to the request.

- a large number of documents issued upon request.

- difficult screening of unprofessional and unreliable information.

In existing search systems, to find the necessary information, an information search mechanism has been developed, based on the construction of an inverted index (Manning K.D., Raghavan P., Schutze X. “Introduction to information retrieval.” Translated from English - M, Williams LLC ", 2011 - 528 pp., [2]). A dictionary of keywords is created based on a selection of words from information stored in a database (DB). The dictionary can include any number of words sufficient, from the developer’s point of view, to select the necessary information. In this case, application programs for conducting lexical, morphological, syntactic, and semantic analyzes are used. To select words, a wide range of well-established methods are used, based on alphabetical sampling, B-trees, thematic division and other methods.

When constructing an inverted index (data structure) for searching documents, two main approaches are used:

1) accurate to the document (US2004030686A1 “Method and system of searching a database of records” (CARDNO ANDREW JOHN; MULGAN NICHOL AS JOHN, 02.12.2004), [3]; US2009030892 “SYSTEM OF EFFECTIVELY SEARCHING TEXT FOR KEYWORD, AND METHOD THEREOF" (IBM, 01/29/2009), [4]). In the indicated analogues [3, 4], each index word is associated with a list of documents where this word occurs. Next, full-text processing of the document is carried out using all keywords.

2) accurate to the word (US5696963 “System, method and computer program product for searching through an individual document and a group of documents” (SMARTPATENTS INC, 12/09/1997), [5]; US2004177064 “Selecting effective keywords for database searches” (IBM, 09.09.2004), [6]). In the indicated analogues [5, 6], each index word is associated with a list of documents where this word occurs, and the position of the word in the document is also indicated.

In both approaches, the proposed data structures can be supplemented with various parameters (frequency, semantic and other characteristics of words and documents). Lists of documents and word positions are compiled and these lists are analyzed for the intersection of the lists and the formation of a new list in which all keywords appear. As a prototype for the proposed method, an approach is adopted that is accurate to the document [4].

The advantage of the first option is its simplicity and minimal index size.

The disadvantage is the large amount of processing required when the number of keywords analyzed increases. High sensitivity to word processing order. For example, after selecting documents using the first combination of keywords, you need to specify the following additional words, but the search will be carried out on many already found documents. The order in which keywords are processed when analyzing, adjusting and changing a query is important.

In the second option, for each word, lists of documents where this word occurs are immediately retrieved, as in the first option, and additionally lists of position numbers accurate to the position of the word, the beginning and end of words (or length), after which the operation of analyzing their intersection is performed.

The advantage of the second option is that it is convenient when processing a small number of keywords. The ability to restore text, with some loss of quality, if it is lost.

The disadvantage is the large volume of the stored index, which may exceed the amount of memory occupied by the processed documents. It depends on the number of keywords that need to be optimized. The complexity of storing and processing variable-length lists increases. Large volume of processing with an increase in the analyzed number of keywords and a growing database.

As a prototype for the claimed device, a typical classical memory for writing and reading words was adopted (E. Tanenbaum, T. Austin. Computer Architecture, p. 200, Fig. 3.27, [7]). The operation of RAM is considered using an example consisting of 4 memory cells (registers), each register consisting of 3 bits. In general, memory is not limited either by the width of the registers or by the number of memory cells (registers). There are 8, 16, 32 and 64 bit microcircuits with 512, 1024 or more memory cells [7] p. 204, ref. 3.30. As the memory increases, the principle of its operation does not change, but the one shown in Fig. 3.27, [7] the circuit is replicated many times with a corresponding increase in the register capacity, the number of inputs/outputs and addresses. Here, each trigger stores one bit of information (the figure shows 4 rows of triggers, 3 in each row-register). The memory contains four 3-bit words. Each operation reads or writes an entire 3-bit word. The logic circuit contains 8 input lines, in particular 3 data inputs - Io, Ii, h; 2 inputs for addresses - Ao and Ats 3 control inputs - CS (Chip Select - selection of a memory element), RD (ReaD - reading, this signal allows distinguish between reading and writing) and OE (Output Enable - permission to issue output signals), as well as 3 output lines for data - Oo, Oi and Og. The state of the address input determines which four bits of memory are allowed to input or output a value. The logic of the process is such that the binary string <1o, E, 1r> (here the value of E is 0 or 1, i=0,l,2) is written to a memory cell (register) at the address of the binary string <Ao, Ai> in accordance with commands arriving at the inputs <CS,RD,OE>. And also information can be read in accordance with the commands <CS,RD,OE> from the registers specified in the address <Ao,Ai> into the output binary string <Oo, O1,Og> •

Let's imagine the logical diagram in Fig. 3.27 [7] in a generalized form of Fig. 13, leaving only the elements that are important for understanding: external pins (inputs, outputs, addresses, control signals) and memory triggers.

Prototype Fig. 3.27 [7] does not allow writing binary strings according to position numbers in memory cells (registers) of Fig. 13 and 14 (in fact, according to columns St.0, St.1, St.2 of Fig. 14).

The technical problem being solved is the ability to process document keywords and present the processing results in a visual and concise form - accurate to the pages of documents. This allows you to quickly, without reading documents from the database, assess the document’s compliance with the formulated request. In this case, the user will make the final decision within the analysis of the found page independently or interactively add/change keywords and see the results of the changes on the screen.

The technical result provided by the claimed technical solutions is the hardware implementation of the process of representing an ordered set of lists of binary strings of fixed length <Stn,Str2,. . .,Stfk> into a tabular form, where each row of Stfj is a table column. The ability to read and analyze table rows line by line is provided - register mapping <RI,R2, . . . , R _n > - which is also a set of binary strings, when each number j of register Rj corresponds to the j-th bit position in the lists <Stn,Str2,. . .,Stfk>,. This allows you to eliminate the operation of crossing the lists <Stn,Str2,. . .,Stfk>, related to various keywords, considering that each i-th list Stn is associated with the i-th keyword, and the j-ro bit number in the Stn line reflects the document/page number, which speeds up the document/page selection process documents, because entire line <RI,R2, . . . , R _n > refers to one j-th document/page. The technical result of the method is the proposed data structure, including a table of keywords (index dictionary), a table of documents, a table of binary lists and a table of binary document strings. The proposed data structure allows you to search for documents using sets of keywords accurate to the pages of documents, without reading the documents themselves from the database and presenting the search result in the form of an Answer Table, which facilitates the procedure for selecting documents in accordance with specified criteria.

Also technical results are:

- When searching for documents, there is no need to read the documents themselves from the database; in this case, only pages that exactly meet the requirements of the request are selected for viewing without reading the documents;

- The ability to interact interactively with the system and constantly cyclically refine or change the request by adding and excluding keywords in the process of analyzing pages, while documents are inserted or excluded into the list or from the list of processed documents, and words - into the list of keywords for searching documents;

- The document search process is highly scalable and parallelizable;

- The method allows you to determine the priority in the list of keywords and assign importance to each keyword, the system can clarify and offer options based on the collected statistics, including personal ones for building individual user models;

- Setting various sets on a set of keywords {Si} (for example, a set of vitamins, acids, trees, etc.) in order to clarify the meaning of documents and search for additional answer options;

- Fixing the length of binary strings for document page numbers at 128 bytes allows you to abandon sorting lists of pages and implement a simple mechanism for visual analysis of document pages;

- Fixing the length of binary strings at 128 kB or more for large databases (more than tens of millions of documents) when implementing DM memory makes it possible to avoid sorting lists of documents. In this case, the length of the binary string must correspond to the maximum number of documents in the database. because the bit number in the binary string must correspond to the document number in the database. For example, a 128 kB binary string corresponds to 1 million documents;

- Possibility of parallel processing of sets (tens and hundreds) of keywords; - Free viewing of the results of processing keywords in any sequence - the sequence specifies the individual importance of keywords for the user and is determined by a simple rearrangement of columns in the response table, which allows you to take into account these priorities when selecting pages and, if necessary, discard less important words;

- The final answer tables selected by the user are a convenient tool for constructing applied formal algorithms for document analysis, constructing various metrics and document classification spaces, and constructing personal models of the user’s information space;

- The accuracy of processing is limited to the page of the document, while the page will be displayed to the user for his final assessment regarding its relevance; in addition, if necessary, you can easily supplement the processing of keywords in the amount of one page.

The essence of the claimed method is that the user forms a request and specifies keywords and logical operations with them. Then the keywords in the query are selected in the query keyword processing block, the inverted index is used with programs for the formation, development and maintenance of the inverted index, while the mentioned programs interact with the indexing block, which interacts with the database, programs for selecting lists by keywords, characterized in that:

- programs generate auxiliary tables: keywords, documents, binary strings of document numbers and binary strings that complement the inverted index, while

- the keyword table contains a list of keywords, each keyword has a link to a binary string of documents and a link to a string of the document table, and additional information is also indicated here: the number of documents where the keyword is used, other data about the keyword: term, abbreviation , set or element of a set;

- the document table contains lists of pairs associated with a keyword, the pair is the number of the document in which the keyword is used and a link to the binary string in the binary string tables;

- the table of binary strings contains a list of binary strings of a fixed length, each line is associated with a document number from the document table, with each bit of the binary string corresponding to one page of document text, the bit number in the string corresponds to the page number of this document, and 1 or 0 standing on a certain positions indicate the presence or absence of a given keyword on a given page;

- a table of binary document strings consists of many binary strings of a fixed length, with each bit number in the binary string corresponding to a document number, and 1 in this bit means that the specified keyword occurs in the document with this number;

- after selecting lists by keywords from the inverted index, they process lists of documents presented in the form of binary strings in the resulting list processing block, while the binary strings are loaded for processing into double memory, in which the numerical value of the string is 2 ^P , where n - capacity of the memory register, if all the specified keywords occur in a document with a number equal to the line number in double memory, this allows you not to resort to the operation of crossing lists, while

- in the block for processing document pages based on keywords, a response table is formed for each document, while loading binary strings from the table of binary strings into double memory, sorting and analyzing them, while in the response table each column corresponds to its own keyword, the columns are ordered in according to their importance, with each keyword column corresponding to a binary row of the binary string table, the response table allows you to reflect the keywords contained in the document with page accuracy without loading the document itself from the database.

The essence of the claimed device is that dual memory for organizing document searches in applied unstructured databases is a logical memory circuit that provides the ability to write data at a given memory cell address, as well as the ability to read from memory cells at a given cell address of the recorded data and output read data into the output lines in accordance with the control signals CS, RD, OE and SM, characterized in that:

- contains a second data input channel that ensures that bit strings are written to a given bit (column number) of a memory cell in all memory cells simultaneously, while the length of the bit string is equal to the number of memory cells, and the number and width of memory cells is limited by the technological capabilities of microelectronics;

- SM input, configured to switch the operation of the circuit from a regular memory mode - writing input data to memory cells, to a dual memory mode - writing input data to a given bit of each memory cell; - contains a block for addressing bit numbers in memory cells (column numbers), which specify the general bit number for all memory cells into which data is written;

- after the cell addressing block, a switch is installed that passes or blocks the memory cell address signal depending on the control signal CM;

- contains logical switches installed at all inputs of memory flip-flops to switch the channel for receiving addresses from the addresses of memory cells to the addresses of bits in memory cells (columns), depending on the control signal CM;

- contains logical switches installed at all data inputs of memory triggers to switch data acquisition channels depending on the CM control signal;

- the output of the write/read control block CS, RD, OE is connected to the output of the bit number selection block (column address).

Brief description of the drawings.

In fig. Figure 1 shows a generalized diagram of a method for organizing document searches in large databases; in fig. 2 - data structure of work tables; in fig. 3 - binary string; in fig. 4 - example of a document response table, in columns there are binary strings of ASDi document numbers; in fig. 5 - an example of a page response table, the columns show the Stri lines corresponding to the keywords; in fig. 6 is an example of a compacted and ordered response table; in fig. 7 - example of a response table, £=0.2; in fig. 8 - example of a fragment of the analyzed text; in fig. 9 - block diagram of the stages of processing the analyzed text of example 1; in fig. 10 - filling out the tables of example 1; in fig. 11 - general view of the conditional response table; in fig. 12 is an example of a table of six binary columns, each 12 bits long; in fig. 13 - logical block diagram for 4 x 3 memory; in fig. 14 is a generalized block diagram of the 4x3 memory shown in FIG. 13; in fig. 15 - generalized block diagram of a memory circuit for g memory cells with a width of n bits; in fig. 16 is a general block diagram of a 4x4 DM memory circuit; in fig. 17 - generalized block diagram of a memory circuit for g memory cells with a width of n bits; in fig. 18 is an example of the implementation of a 4x4 DM memory logic circuit.

List of symbols of figs. 13 - 17:

Io, Ii, h, 1з - input data for writing to registers;

Ao, A1 - two inputs for addressing memory cells;

CS - memory element selection;

RD - read (allows you to distinguish reading from writing);

OE - permission to issue data,

Jo, Ji, J2, J3 - input data for writing columns; CLo, CLi - two inputs for addressing columns;

St.0, St.1, St.2, St.Z - columns (bit numbers in the memory cell);

Oo, 01, Og, Oz - output data for reading from memory cells.

Data output resolution CS RD OE.

Implementation of the method.

The method of organizing a search for documents in large databases (Fig. 1) includes the following typical steps:

- User (4), who is an interested consumer of information, generates a query (5) in a language close to SQL, while the user specifies keywords and logical operations with them, and the query is mainly limited to a simple listing of words;

- In the query keyword processing block (6), keywords in the query are selected.

- Programs for the formation, development and maintenance of the inverted index form an inverted index (3), while the mentioned programs interact with the indexing block (2), which interacts with the database (1), select lists of documents by keywords (7), and intersect lists of keywords words in order to search for a set of documents in which the given keywords simultaneously appear and the given logical conditions are fulfilled over them, the resulting list is obtained. This list of documents is already an intermediate search result and can be displayed to the user. Currently, search engines supplement the processing of this list with a full-test search of all documents in this list, which allows the text of documents to be highlighted in color when displaying information on the screen, which makes it easier for the user to view it.

As noted in the block diagram of FIG. 1 from the first to the seventh provide typical functions for constructing and using inverted indexes in existing information retrieval systems.

- In the block for processing the resulting list of documents (8), the address of the binary string is read from the resulting list by document number from the Document Table (12 Fig. 2), and then the binary document string itself from the Binary String Table (13 Fig. 2) for each key word.

- The block for processing document pages by keywords (9) generates a response table for each document (Fig. 4), while loading binary strings into double memory, then DM (Fig. 17), sorting and analyzing them in accordance with the specified parameters to soften or tighten the correspondence of responses to the request. User can interactively select and view document pages, change the query, keyword importance and other page selection parameters.

- A final list of selected documents (10) is displayed as a response to the submitted request.

In existing search systems, lists of documents are selected using keywords (block 7, Fig. 1). Next, a programmatic operation is carried out to intersect these lists to find the resulting list of documents in which all the keywords were present in each document.

In the proposed version, it is possible to carry out the operation of intersecting lists of documents in DM memory. To do this, it is necessary to represent lists of documents in the form of binary lists of Fig. 3 is the same as in the case of binary lists of Stn pages, only of a larger size (millions of bits). This will require a much larger DM memory, which will be reflected in its cost. For example, to process binary lists of pages, a binary string length of 128 bytes is sufficient, and to process binary lists of documents, a string of about 128 KB or more will be required to simultaneously load a list of 1 million documents. In this case, the bit number in the binary string will correspond to the document number in which the keyword occurred. Then processing a list of, for example, 14 million documents will lead to 14 cyclic loads of DM memory. All operations performed in DM memory are completely identical to operations with lists of document pages. In this case, as is currently done in systems, the document is selected from the Document Response Table (15) and added to the resulting list of documents in block 7 of FIG. 1 in the case of the complete presence of all keywords in a line - document, for example, in FIG. 4 is documents 8 and 11 in lines 8 and 11.

The database (1) is a repository of all accumulated documents. The format and structure of the data is determined by the selected standard data management system. Accompanied by all necessary software: tools for writing documents, manipulating data, reading data, languages for working with data (data description, data manipulation, query).

Programs for the formation, development and maintenance of the inverted index (3) carry out recording, reading, sorting, making changes, creating a working data structure and generating working vectors of keywords (pointers to documents), dividing or merging indexes, including or excluding words, analyzing and etc. The proposed method is focused, first of all, on searching printed publications for factual information in professional libraries (for example, biology, genetic engineering, pharmacology - information on drugs, the mechanisms of their effect on the body, methods of treating diseases, etc.).

Let Q = {Dj } be the set of documents Dj, where j=l,m. Here m is the number of documents, for example, 700’000 documents or more. A document is a book, journal, articles, archival documents, protocols and research results, etc., i.e. any printed publication convenient for processing and display on a computer screen.

Unlike the prototype [2], the proposed method includes the following data structure, which is used in blocks 8, 9 and 10: tables of keywords (I), tables of documents (12), tables of binary strings (13) and tables of binary strings of documents (14 ). The structure of the mentioned tables is shown in Fig.2.

The keyword table (11) contains a list of r strings. Each line contains:

- keyword { Si}i=i. _r ,

- number {Cj}j=i.k, indicating the number of documents in which this keyword occurs,

- link address {ASDi}i=i, _n to a line from the table of binary document strings (14),

- link address { Adi}i=i, _n to a line from the document table (12).

Moreover, there are more words than links. For example, the word “house” and the words “house”, “house”, etc. may have one common link.

The table of keywords (11) can be divided into an arbitrary number of sections - subtables. These sections are related to the semantics of keywords: literary words, words of the subject area, abbreviations, synonyms, names, etc. The table of keywords (11) can be supplemented with columns reflecting the properties of words and their various groupings: synonyms, antonyms, various classes (sets of words), thematic terms (mathematics, chemistry, biology, etc.), semantic classes, artificial metrics and etc.

The document table (12) contains t rows {<Sp>i }i=i,t, where each row Spi consists of a list of pairs:

Spi = { <Ndi, A _s ti>i, <Nd2, A _s t2>2, <Nds, A _s t3>3, . . . <Ndt, A _s tt> t}i, where Ndj, j= 1 ,t is the number of the document in which the keyword from the keyword table (11) is used;

Astri - address of the link to the binary string in the binary string table (13).

In each row of the document table (12), pairs are ordered in accordance with document numbers Ndj. The Ci number from the keyword table (11) indicates the number of pairs in the document table (12) associated with a given keyword. The document table (12) lists all document numbers that contain keywords from the keyword table (11).

This allows you to evaluate the information content of each keyword and set a cutoff limit for unnecessary keywords or build new ones. A word that appears in 95% of documents does not allow you to effectively distinguish between documents and select among them the required one, which is classified as a stop word. New keywords can also form combinations of existing keywords.

The binary string table (13) contains a list of n binary strings of fixed length {Stri}i=i, _m . The structure of the binary string Stn is shown in Fig. 3, where each cell is one bit corresponding to one page of document text, the bit number in the line corresponds to the page number of this document, and 1 or 0 standing at a certain position indicate the presence or absence of a given word on this page.

Let the document Dj have 2 million characters, which corresponds to approximately 1000 pages of text. Then it will correspond to a binary string of 1000 bits or 128 bytes in length. The number of bits in a binary string will be greater (a multiple of the power of 2) - reserve bits appear. It is assumed that there are very few books with a large number of pages, and those found are divided into volumes (an artificial division is possible).

The proposed data structure is open and can always be supplemented with the parameters necessary for processing, reflecting the frequency, semantic, logical, morphological and other characteristics of words and documents. This is reflected in the increase in the dimension of tables and/or the inclusion of additional ones, reflecting additional characteristics of documents and words. For example, Table 11 could be supplemented with links to an additional table of double/triple keywords, abbreviations, etc. The structure can be supplemented with a table of sets/subsets of keywords, for example, a set of vitamins with a list of vitamins, a set of trees with a list of species, etc. Tables 12 and 13 may also include additional parameters, especially since each binary string has a reserve of 24 bytes.

The stages of generating tables (11), (12) and (13) are shown in Fig. 8, 9, 10.

When forming a request (5) (Fig. 1), the user specifies a list of keywords <S1,S2,. . .,Sk>, ordered by their importance to the user. It is possible to build a classic query language using the logical symbols AND, OR, NOT. Keywords are set by the user himself based on his personal understanding of the required content of the document being sought. Level of importance - individual The user's perception of the keywords he sets (numerical value or simply word order) of keywords can be:

- set a linear or other function that can be calculated by building individual models of the subject area for each user individually;

- to the user himself when listing (entering) the arrangement of words:

- automatically based on the frequency characteristics of words.

The degree of importance allows you to discard the least important words and focus more on combinations of more important words. The number of specified keywords is limited only by logic and the level of user perception, you can conditionally stop at 64 words, although for individual applications it may be necessary to significantly more - when processing multiple keywords.

Based on the query for each document, a response table (16) is built (Fig. 5), in which each column has its own keyword.

The columns are ordered according to their importance. Each keyword column corresponds to a binary row in the binary string table (13). Therefore, when generating the response table, the block program (9) reads and collects for each document and for each keyword into the response table (16) FIG. 5 ready-made binary strings of a fixed length from the table of binary strings (13) and arranges them in accordance with the user-specified order of importance in the columns of the response table (16) of FIG. 5. In this case, the order of writing words is such that the further to the left (conditionally, by default) a word is located in the answer table, the more important it is. The word Si+i is more important than the word Si (Fig. 5).

In fig. Figure 5 shows an example of a response table (16) for one document of six pages with sixteen keywords. Each i-th row of the response table (16) of FIG. 5 reflects all the keywords that were found on this i-th page of the document in question Dj and is a sixteen-bit string that is read as an integer in the range from 0 to 65536 or 2 ¹⁶ .

These numbers (a bit string for each page) are well interpreted and show which keywords were found on a given page (row). The maximum value of the number corresponds to a line completely filled with units, i.e. all keywords appeared on this page. This allows you to build simple, understandable and adaptive algorithms: a) View the table and select pages that have a match for all keywords - the binary number of the line is 2 ^P , where n is the number of keywords or 65536 for 2 ¹⁶ . A document with more such pages is more consistent with the request, taking into account the given logical constructs. In most user requests (up to 98%) they do not resort to logical search conditions, but limit themselves to listing keywords; b) It is possible to compress the table for visual analysis of selected documents. Null rows or rows with a user-specified low dropout rate are excluded for visualization. The remaining, not excluded, rows can be ordered depending on the number of 1s and on the specified order by the degree of importance of the keywords. An example of the transformed response table (16) shown in FIG. 5 is shown in FIG. 6 It is possible to enable any text processing algorithms on the page: highlighting words at a distance, phrases, processing word permutations, appearing explanations, synonyms, etc. c) Selection of keywords by eliminating minimal combinations of the least important words in order to obtain a complete filling of the remaining number of words with units. For example, for the response table (16) in FIG. 5 if the keywords SI, S6, S8, S 14 are excluded, then you get a document in which on page 1 there will be a complete match of the remaining keywords with minimal exceptions. The corresponding number will be 2 ¹² . It is possible to use various semantic algorithms that take into account the semantic meanings of words, features of the subject area of documents and user requirements. d) Individual viewing settings. If there are a large number of keywords in the query (the individual score is usually more than 20 and depends on the user’s field of knowledge and qualifications), the user can specify the cut-off level £ of pages shown in the response table (16) in Fig. 5. Here

0 < £ < 1, where the number £ reflects the level of requirement for compliance of the viewed materials with the user’s request.

Pages on which the share of the number of keywords is below the specified level £ x n will not be included in the resulting answer table (16) (the result is rounded). Here n is the number of keywords in the query. For example, for the response table (16) of FIG. 5 with 16 keywords, he can specify the cut-off level £ = 0.2 (16x0.2=3.2 or 3 keywords), then the answer table (16) FIG. 5 will look like shown in FIG. 7. e) at any stage it is possible to view the results, add new keywords, exclude existing ones, determine different orders of importance of keywords (the order determines the system of preferences, changes the calculated metrics, classification of documents and the amount of information displayed).

Examples of specific implementation. Example 1. Stages of forming service tables 1, 2 and 3 using the example of a fragment of book text (Fig. 8). A generalized typical document processing scheme with its description and examples is shown in Fig. 9.

1) at the stage of reading the document, write down all the keywords of this document S1, S2, S3, . . . Sn to temporary file W:

File W = {Sl= “INTRODUCTION”, S2=“Data”, 83=“foreign”, S3= “literature”, 84=“numerous”, 85=“research”, S6= “ghrelin”, . . . }. Lowercase and uppercase letters are the same, indistinguishable.

2) Processing the word for requirements for the formation of a dictionary.

3) Keywords from the temporary file W are written into the keyword table line (11). If the word is new, then a line is added to the keyword table (11). Next, the information is added to the document table (12) and the binary string table (13).

Row in the table of keywords (11) (Fig. 9): {<Introduction,1,1>} The first 1 is the number of documents in which this word appeared. Since the first document is loaded, one is set. The second 1 is the address of the link to the first list in the document table (12).

4) In the first list Spl of the document table (12), record the document number (in the document table (12) this is the first document) and the address of the link to the binary string in the binary string table (13):

List Spl - {<1,1> }. Here the first 1 is the document number, and the second 1 is the first binary string.

5) In the first binary row of the table of binary strings (13) put 1 on the first bit, which means that this word from Table 11 (Introduction) was found on the first page. In a binary string, each bit number corresponds to a page number in the document (for example, the binary string <1,0,0,0, 1,1, 0,0,0,...> indicates that the given word occurred on the first, fifth, and sixth pages).

6) Reading the next word from the list of temporary file W and proceeding to step 2. In this case, the length of the list of words in file W is monitored. When all words from file W are processed, we proceed to reading a new document, i.e. to fulfill point 6.

7) Monitor the completion of the list of processed documents. The ones in the first position of all binary strings shown in FIG. 10 binary string table (13) indicates that all keywords from the keyword table (11) are on the first page of the document. Example 2. Answer table (16) Fig. 5 can be formed (columnwise) in a dual memory microchip DM. Each column corresponds to one keyword, and the register address corresponds to the page number of the document. Thus, one DM load allows you to process one document from the database (1). The set of keywords of a given subject area can be optimized by setting the keyword cutoff boundary in the general dictionary of the keyword table (I). For example, a limit is set on the frequency with which a given keyword occurs in documents (the entire database or a section of the database) in no more than 20% of documents. Keywords that appear more frequently are excluded from the keyword table (11).

You can set boundaries for selecting documents and displaying pages on the screen. For example, only those pages of the document are displayed where the number of keywords on the page is more than the established limit, for example, 80% (in the case of determining the weight of keywords, the cutoff limit can be in the form of a number). You can set the requirement to display pages on which all keywords occur 100%, i.e. pages are selected (for example, Fig. 5) with a number equal to 65536 for 16 keywords.

You can set arbitrary semantic rules: set mandatory and replaceable words, determine the mandatory presence of non-separable combinations of words (for example, “oxygen concentration”, “increased penetration of carbon dioxide”), set the distance (number of characters) between words.

Implement all classic logical operations on the keywords AND, OR, NAND, NOR.

Example 3. The result of document processing is displayed in the form of a response table (16) Fig. 6. Here pages on which the keywords are not found are not listed. All pages given in the table are ordered in accordance with the number of keywords encountered; if the same number of keywords are encountered, then the pages are listed in order of the importance of the keywords, and then in the arithmetic order of page numbers.

This allows you to clearly see the overall picture of the document’s compliance with the formulated request. When the cursor is on:

- column index (Si), the full name of the keyword appears on the screen.

- page number, the entire text of this page appears on the screen, and the user can view it to see if it meets his needs.

- all keywords on the page displayed on the screen are highlighted in color, and the importance of the words is reflected in color. The color tone of a word is reflected in the frequency word color spectrum (blue represents low importance and red represents the most important word).

Additional individual access settings. The user can specify the cutoff level £ of pages shown in the Answer Table (16) of FIG. 5. Here

0 < £ < 1

The number £ reflects the level of requirement for compliance of the viewed materials with the user's request. Pages on which the share of the number of keywords is below a given level £ x n will not be included in the resulting response table (16) FIG. 5 (result is rounded). Here n is the number of keywords in the query. For example, for the table of Fig. 5 with 16 keywords, he can indicate the cut-off level £ = 0.2 (16x0.2=3.2 or 3 keywords), then the answer table (16) FIG. 5 will look like shown in FIG. 7.

For additional analysis of information on the found page, you can use all traditional text processing algorithms (elements of semantic analysis).

The found document page, selected in accordance with the selection criteria, is entered onto the screen for the user to evaluate it and consider possible clarification options - including additional keywords, excluding specified ones, formulating new variants of word combinations.

Essentially, the response table (16) is the basis for the design of the DM memory logic circuit.

Implementation of the device.

Double Memory (hereinafter referred to as DM) allows you to process independent lists of binary strings Stri, Str2, . . . _Str . For clarity, binary strings are presented in the form of a conditional table (17) (Fig. 11), where the columns are binary strings, and the rows are the numbers of bits in the binary strings. An operation is performed to recognize the bit level (table row) of the intersection of binary strings.

Here the binary string Stri = <01001000001... 1>. The length of the string is m bits. In the presented table, the number of lines is indicated by the number of bits in the longest line from the processed list of lines; short lines are padded with zero bits.

Consider the i-th row of this table. It shows which columns contain ones at the i-th bit position, and which rows contain zeros. For example, in line 11 for all columns in the 11th position there are ones (we consider that in lines from 6 to n-1 there are ones). The integer (decimal) representation of a string allows "encrypt" the entire binary string. So for the eleventh line, this number will be equal to 2 ^P.

An example of the same table (17) formed from 6 columns of 12 bits in length is shown in Fig. 12. The same eleventh line (with all units in the line), its integer value will be equal to 2 ⁶ = 64. The integer value of the third line is 4.

Thus, by reading a table row using the resulting integer, you can know exactly in which positions the units are located. Moreover, the user, by specifying a certain order of arrangement of columns in the table, sets a conditional degree of importance of the columns, which is indicated by the symbol. Then the column with number (j+1) j will be interpreted as the column Stfj+i is more important than the column Str. This allows the use of semantic algorithms for preliminary evaluation of pages of the claimed method.

The logical block diagram for 4 x 3 memory is known from the prototype [7] p. 200 and is shown in Fig. 13. A generalized diagram of the same classical memory, consisting of 4 registers (memory cells), each register of which consists of 3 bits, is shown in Fig. 14. Real circuits are also built with the only difference that the width of memory cells (registers) can be 8, 16, 32 or 64 bits, and the number of memory cells can be hundreds of thousands or more (the circuit has been increased many times), but for understanding the logic of all processes the examples and generalizations presented are sufficient.

Below we present a generalized memory diagram - this allows you not to focus on the elemental execution of triggers and switching elements, because they can be performed in many ways depending on the preferences of the developer and the technology used, but retains the logic of all basic memory functions - writing/reading a string into a memory cell (register) at a given memory cell address.

In the generalized diagram of Fig. 14 only important elements are recorded - 12 memory triggers, each trigger stores 1 bit of information, inputs/outputs and control signals. Each trigger can be in one of two states 1 or 0. The triggers are arranged in a structure - 4 lines (registers) with 3 triggers in each line. It is assumed that input information, arriving in the form of binary strings, is written into memory cells (registers - strings). The registers are numbered and their numbers are called addresses of memory cells (registers). The logic of the process is as follows - the binary string <Io, Ii, b> (here the value of p is 0 or 1, i=0, 1,2) is written to the register at the address of the binary string <Ao, Ai> in accordance with the commands received at the inputs <CS,RD,OE>. And also information can be read in accordance with commands <CS,RD,OE> from the registers specified in the address <Ao,Ai> to the output binary string <0о,01,0г>.

In fig. Figure 15 shows a generalized logic diagram of classical memory for r memory cells (registers), where each register consists of n bits intended for recording input signals Io, Ii, . . . ,I _n and reading the output signals Oo,Oi, . . . ,O _P .

Examples of RAM chips are known from analogue [7] in Fig. 3.30 p. 204.

In fig. Figure 16 shows the proposed generalized block diagram of a 4x4 DM memory example. Each horizontal row consists of 4 triggers that make up one word. 4 memory locations (register or 4 words) are shown.

Unlike the prototype [7], DM can work like classical memory - write and read words by registers (their addresses), and also additionally allows you to write binary strings Stn, Stu, . . . Stu supplied sequentially to inputs <Jo, Ji, J2, 1з>. Here the binary string Stn, = < Jo, Ji, J2, J ₃ >i where i=l,4 is written in the i-th column at the address from Std from the list of addresses St.0, St.1, St.2, St. 3, indicating the column address at the inputs <CLo, Cli>. In this case, the information is read in a standard way - by registers (words) to the outputs <Oo, 01, Og, O ₃ >. Here, the CM (Change Memory) input switches the memory operating mode from normal to DM. In conventional memory mode, it operates in accordance with the classical memory circuit of Fig. 13.

An example of a generalized logical circuit of DM memory with r registers of n bits each is shown in Fig. 17. An additional data input Jo,Ji,...,J _m is shown, where m < g. For ease of perception of the logic circuit, you can consider the response table of Fig. 11, where the rows correspond to memory registers, and the columns correspond to binary strings. Of course, DM, just like regular memory, can be performed with registers of any size 16, 32, 64 bits or more, as well as with any number of memory cells (registers) from 1024 (as in the example with pages) to hundreds of millions - for processing large lists of documents.

Description of work.

Example of DM memory 4 x 4 FIG. 18 and its generalized form is shown in FIG. 17. It is shown that the CM input switches the operation of the circuit from regular memory mode to DM mode. In normal memory mode (1 at the CM output), the logic circuit operates in the same way as the circuit shown in Fig. 13. In the supplemented DM memory mode (0 at the CM output), the logic circuit switches to inputting data from inputs Jo, Ji, J2, J ₃ , see shown in Fig. 18 and fig. 17. The logical diagram of Fig. 13 is supplemented, see Fig. 18: - block of addressing columns CLo and CLi, specifying the current address of the column into which the data Jo, Ji, J2, J3 will be written.

- in the addressing block of registers Ao and At at the output of logic elements AND, drivers are installed, which, depending on the control signal CM (1 - pass the register address signal, 0 - blocks the register address signal).

- multiplexers installed at the C input of the trigger synchronizing signal (the signal opens the trigger data input) on all inputs of memory triggers. Multiplexers switch the channel for receiving addresses from register addresses to column addresses (points To, Ti, T2, T3) depending on the control signal CM.

- multiplexers installed at all data inputs of D memory triggers. The control signal CM switches the channels for receiving data from 1о, Ii, h, 1з (at CM=1) to Jo, Ji, J2, J3 (at CM=0) on all inputs D of memory triggers.

- the output of the AND element after the write/read control block (CS, RD, OE) is extended by point T4 to the column address selection block.

When switching the CM input (0 at the output) to the DM memory mode, the columns are addressed using the address lines CL0 and CL1 in the upper part of the circuit, similar to the addressing of registers, and the input lines Io, Ii, I2, 13 are locked by the CM signal on multiplexers installed in front of the signal input D on trigger. To select a memory column, the external logic must set the CS signal to 1 and also set the RD signal to 1 for reading and 0 for writing. The column address lines should indicate which of the four 4-bit columns the information should be written to. When reading, all input data lines are not used. When writing, the bits found on the Jo, Ji, J2, J3 data input lines are loaded into the selected memory column; the output lines are not used.

The memory shown in FIG. 18 works as follows. Four AND gates for selecting columns at the top of the circuit form a decoder. Next, the input inverters are arranged so that each gate is triggered by a specific address. Each gate drives a column selection line. When the microcircuit must write, the vertical line CS' RD receives the value 1, triggering one of the four write gates - points To, Ti, T2, T3. The choice of gate depends on which column select line is 1. The output of the write gate drives all C signals (flip-flop input) for the selected column, loading the input data into that column's flip-flops. Writing is performed only if the CS signal is 1 and RD is 0, and only the column selected by addresses CL0 and CL1 is written. The reading process is similar to the standard reading process as shown in FIG. 13 - The CM switches to normal memory mode (1 at the output), the register address (Ao, Ai) is indicated, the CS RD line takes the value 0, so all write gates are blocked, and none of the flip-flops change. Instead, the word select line fires the AND gates associated with the Q bits (flip-flop output) of the selected word. Thus, the selected word passes its data to the 4-input OR gates located at the bottom of the circuit, and the remaining three words output 0. Therefore, the output of the OR gates is identical to the value stored in the given word. The remaining three words have no effect on the output.

Thus, after writing the columns to the DM memory, it is possible to read the information stored in the DM registers according to the scheme - we read the contents of the registers at the register addresses. As a result, the contents of the i-ro register show the entire i-th position in the binary lists written in columns, and the integer value of the register contains all this information in a compressed form, which fully corresponds to the answer table (16) of Fig. 5 for an example of four keywords and four document pages.

DM memory can be manufactured in various versions. As an optimal option, it can contain 1024 registers (words) by 64 bits (columns - the number of keywords). The memory can be installed stand-alone or as a cache memory with the processor, which will speed up the algorithm.

Logic circuit modifications:

- use of the same channels for input data for writing to registers (for example, Io, 11, 1g, 1z) and for input data for writing to columns (for example, Jo, Ji, J2, J3). Because these channels are not used at the same time.

- increasing the capacity of registers to work with a large number of keywords - 1024 registers of 128 bits each;

- for special applications - an arbitrary number of registers (tens of millions) for working with a large number of documents.

- Construction of several input ports (register groups) for parallel work with several lists.

The application therefore proposes a page-accurate search approach. In addition to the lists of documents in which the keyword occurs, binary lists of document strings are formed (Fig. 3), and their processing replaces the stage of full-text analysis of the document. Document parsing has been replaced by paging through the Response Table of FIG. 4. The technical problem is solved due to the fact that:

- binary lists of document pages are lists of fixed length, which is convenient for processing;

- all pages of a document for a given keyword are presented in one binary string;

- hardware implementation of DM memory allows you to eliminate the operations of sorting pages of various keywords;

- DM memory allows you to simultaneously process many (dozens) of keywords (binary lists) for their intersection. DM provides fast conversion of multiple binary lists of given keywords into a page-by-page display of a document (each memory register into a document page) for quickly viewing and selecting document pages with given keywords without performing an operation to intersect the lists;

- the applied interpretation of binary strings is not limited in any way. The result can be read in the form of integers (line of the Answer Table in Fig. 4), depending on: binary values of numbers located in the same positions of binary strings, the sequence of writing binary strings into DM memory (the order of keywords in the request);

- viewing the found page will not make it difficult for the user due to its volume;

- in the search algorithm there is no need to build various complex logical structures, such as the distance between keywords, the order in which keywords are mentioned, their use within a sentence, paragraph, etc. All keywords on the page are highlighted in the background and this is enough for the user to evaluate it in detail.

Claims

Claim

1. A method for organizing a search for documents in application databases of unstructured data includes the formation of a query by the user, while the user specifies keywords and logical operations with them, the selection of keywords in the query in the block for processing query keywords, the use of an inverted index with programs for the formation, development and maintenance operation of an inverted index, while the mentioned programs interact with the indexing unit, which interacts with the database, programs for selecting lists by keywords, characterized in that:

- the table of binary strings contains a list of binary strings of a fixed length, each line is associated with a document number from the document table, with each bit of the binary string corresponding to one page of document text, the bit number in the string corresponds to the page number of this document, and 1 or 0 standing at a certain position, indicate the presence or absence of a given keyword on a given page;

- after selecting lists by keywords from the inverted index, they process lists of documents presented in the form of binary strings in the resulting list processing block, while the binary strings are loaded for processing into double memory, in which the numerical value of the string is 2 ^P , where n -

23 capacity of the memory register, if all the specified keywords occur in a document with a number equal to the line number in double memory, this allows you not to resort to the operation of crossing lists, while

2. Double memory for organizing document searches in applied unstructured databases by and. 1, which is a logical memory circuit that provides the ability to write data at a given memory cell address, as well as the ability to read written data from memory cells at a given cell address and output the read data to the output lines in accordance with the control signals CS, RD, OE and CM , characterized in that:

- contains a second data input channel that ensures that bit strings are written to a given bit of a memory cell in all memory cells simultaneously, while the length of the bit string is equal to the number of memory cells, and the number and width of memory cells is limited by the technological capabilities of microelectronics;

- SM input, configured to switch the operation of the circuit from a regular memory mode - writing input data to memory cells, to a dual memory mode - writing input data to a given bit of each memory cell;

- contains a block for addressing bit numbers in memory cells, which specify the common bit number for all memory cells into which data is written;

- contains logical switches installed at all inputs of memory flip-flops to switch the channel for receiving addresses from the addresses of memory cells to the addresses of bits in memory cells, depending on the control signal CM;

- contains logical switches installed at all data inputs of memory triggers to switch data acquisition channels depending on the CM control signal; - the output of the write/read control unit CS, RD, OE is connected to the output of the bit number selection unit.