RU2386167C1

RU2386167C1 - Device of information processing for information searching

Info

Publication number: RU2386167C1
Application number: RU2008135503/09A
Authority: RU
Inventors: Денис Иванович Костиков (RU); Денис Иванович Костиков; Павел Михайлович Сулима (RU); Павел Михайлович Сулима; Николай Иванович Тимофеев (RU); Николай Иванович Тимофеев; Николай Владимирович Тришин (RU); Николай Владимирович Тришин
Priority date: 2008-09-01
Filing date: 2008-09-01
Publication date: 2010-04-10

Abstract

FIELD: information technologies.

SUBSTANCE: device comprises memory (1) of initial array, unit (2) of symbol text processing, patch board (3), unit (4) of display, unit (5) of input-output, buffer memory (6) of input-output, unit (8) of text structural analysis, control unit (13), unit of archive card synthesis (14), memory (15) of archive cards, field bus (20). Technical result is achieved by introduction of long-term memory (7), unit of service information identification (9), unit (10) of stop-vocabulary, unit (11) of quasimorphological analysis, unit (12) of word bases inverted index generation, unit (17) for generation of criteria for sentences selection for archive card, unit (18) for sentences selection for archive card and counter (19) of symbols in text annotation with according links.

EFFECT: wider area of application and functional resources of information processing device due to provision of the possibility to process and search text information of theme and semantic varieties, and automatic adaptation of device to changes in subject area of processed information on the basis of complete exclusion of a human being from the process of analysis, reading, annotation and cataloguing of texts.

8 dwg

Description

Изобретение относится к техническим средствам информатики и вычислительной техники и может быть использовано для решения задач символьной обработки текстовой информации и предварительной обработки текстовых данных для информационного поиска.The invention relates to technical means of computer science and computer technology and can be used to solve problems of symbolic processing of text information and preliminary processing of text data for information retrieval.

Известно устройство для информационного поиска (патент RU 2039376, МПК G06F 17/30), содержащее блок сопряжения с памятью, память исходного массива, блок ввода-вывода, наборное поле, блок отображения, буферную память ввода-вывода, блок управления, буферную память, магистральную шину, включающую шины адресную, информационную и управляющую.A device for information retrieval (patent RU 2039376, IPC G06F 17/30), comprising a memory interface unit, a source array memory, an input-output unit, a dial-up field, a display unit, an input-output buffer memory, a control unit, a buffer memory, trunk bus, including address, information and control buses.

Недостатком известного устройства является то, что исходный массив текстовой информации для обработки необходимо преобразовывать к специальному виду, используемому в устройстве, результат обработки информации не дает возможности оценить содержание текста и не может быть использован в качестве аннотации к обработанному тексту, устройство не обеспечивает возможность выборки данных.A disadvantage of the known device is that the original array of textual information for processing must be converted to a special form used in the device, the result of processing the information does not make it possible to evaluate the content of the text and cannot be used as an annotation to the processed text, the device does not provide the ability to select data .

Наиболее близким по своей сущности к заявляемому изобретению является устройство обработки информации для информационного поиска (патент RU 2096825, МПК G06F 17/00, 17/30), которое содержит блок сопряжения с памятью, память исходного массива, блок ввода-вывода, наборное поле, блок отображения, буферную память ввода-вывода, блок управления, буферную память, блок синтеза текстовых фрагментов, блок структурного анализа текстового фрагмента, блок символьной обработки текстового фрагмента, блок корректировки словаря словосочетаний, блок синтеза первичного словаря, блок фильтрации первичного словаря, блок синтеза вторичных словарей, блок анализа вторичных словарей, блок синтеза архивной карточки, блок архивации текстовых фрагментов, блок корректировки системных словарей, память архивной карточки, магистральную шину, включающую шины адресную, информационную и управляющую.The closest in essence to the claimed invention is an information processing device for information retrieval (patent RU 2096825, IPC G06F 17/00, 17/30), which contains a memory interface unit, a source array memory, an input-output unit, a type-setting field, display unit, input / output buffer memory, control unit, buffer memory, text fragment synthesis unit, text fragment structural analysis unit, text fragment symbol processing unit, word dictionary corrector adjustment unit, primary synthesis unit of the first dictionary, filtering unit of the primary dictionary, block for synthesizing secondary dictionaries, block for analyzing secondary dictionaries, block for synthesizing an archive card, block for archiving text fragments, block for adjusting system dictionaries, memory for an archive card, a trunk bus that includes address, information, and control buses.

В устройстве, соответствующем прототипу, процесс обработки информации и формирования архивной карточки разбит на два этапа.In the device corresponding to the prototype, the process of processing information and forming an archive card is divided into two stages.

На первом этапе устройство функционирует под управлением оператора. Итогом первого этапа работы является очередной текстовый фрагмент, который имеет тривиальный смысл: отдельная глава документа, отдельный абзац, отдельная статья сборника статей и т.п.At the first stage, the device operates under the control of the operator. The result of the first stage of work is another text fragment, which has a trivial meaning: a separate chapter in a document, a separate paragraph, a separate article in a collection of articles, etc.

На втором этапе устройство работает в автоматическом режиме следующим образом.At the second stage, the device operates in automatic mode as follows.

Оператор устройства формирует в наборном поле команду активизации устройства. С наборного поля на вход блока ввода-вывода поступает сигнал начала работы устройства в автоматическом режиме. С выхода блока ввода-вывода по магистральной шине сигнал передается на вход блока управления, где он преобразуется в последовательность команд управления устройством. Блок управления последовательно активизирует блоки структурного анализа текстового фрагмента, символьной обработки текстового фрагмента, корректировки словаря словосочетаний, синтеза первичного словаря, фильтрации первичного словаря, синтеза вторичных словарей, анализа вторичных словарей, синтеза архивной карточки, архивации текстовых фрагментов, корректировки системных словарей путем подачи команд управления на входы указанных блоков. Сигнал результата работы устройства (архивная карточка текстового фрагмента) подается на вход памяти архивных карточек, где он записывается для дальнейшего использования.The device operator generates a device activation command in the typed field. From the typesetting field to the input of the I / O unit, a signal is received that the device starts to work in automatic mode. From the output of the input-output unit via the bus, the signal is transmitted to the input of the control unit, where it is converted into a sequence of device control commands. The control unit sequentially activates the blocks of structural analysis of a text fragment, symbolic processing of a text fragment, corrections of a dictionary of word combinations, synthesis of a primary dictionary, filtering of a primary dictionary, synthesis of secondary dictionaries, analysis of secondary dictionaries, synthesis of an archive card, archiving of text fragments, adjustment of system dictionaries by issuing control commands to the inputs of these blocks. The signal of the result of the operation of the device (archive card of a text fragment) is fed to the memory input of archive cards, where it is recorded for future use.

Принцип функционирования всех перечисленных блоков одинаков. На вход блока № k по магистральной шине последовательно поступают сигналы из информационного массива буферной памяти, сформированного блоком № (k-1). С выхода блока № k сигналы последовательно подаются в информационный массив буферной памяти, а с него - на блок № (k+1).The principle of operation of all of these blocks is the same. At the input of block No. k, the signals from the information array of the buffer memory formed by block No. (k-1) are sequentially received via the bus line. From the output of block No. k, the signals are sequentially fed into the information array of the buffer memory, and from it to block No. (k + 1).

Блок структурного анализа текстового фрагмента выделяет информационные сигналы о наличии во входных данных строк с определенным процентным содержанием цифровых символов, а также сигналы символьного информационного массива заданной длины. Сигнал структуры указанных выходных данных запоминается в блоке структурного анализа текстового фрагмента при настройке устройства. Настройка устройства осуществляется до начала его работы.The block for structural analysis of a text fragment selects information signals about the presence in the input data of strings with a certain percentage of digital characters, as well as signals of a character information array of a given length. The signal structure of the specified output is stored in the block of structural analysis of the text fragment when setting up the device. The device is set up before it starts.

В блоке символьной обработки текстового фрагмента проводится обработка входных сигналов, соответствующая декомпозиции исходной информации на отдельные слова и словосочетания (совокупность слов, ограниченных в информации исходного текста кавычками). С выхода блока символьной обработки текстового фрагмента на буферной памяти поступает сигнал информационного массива, структура которого позволяет считывать из буферной памяти отдельное слово или словосочетание, а также различать при поиске слова и словосочетания, при этом в буферную память записываются сигналы данных о количестве строк и слов обработанной информации текстового фрагмента.In the block of symbolic processing of the text fragment, the input signals are processed corresponding to the decomposition of the source information into separate words and phrases (a set of words limited in quotation marks in the source text information). From the output of the symbol processing block of a text fragment on the buffer memory, an information array signal is received, the structure of which allows to read a single word or phrase from the buffer memory, as well as to distinguish when searching for words and phrases, while data signals on the number of lines and words processed in the buffer memory are written text fragment information.

В блоке корректировки словаря словосочетаний проводится сравнение сигналов количества символов в считанной из блока информации с эталонным сигналом. При превышении количества символов эталонного значения сигнал, считанный из буферной памяти, подвергается в блоке корректировки словаря словосочетаний обработке, соответствующей декомпозиции словосочетаний на отдельные слова. Результаты обработки информационных сигналов в блоке 1 корректировки словаря словосочетаний запоминаются в его информационном массиве, структура которого позволяет проводить поиск нужного слова и хранить кроме символьной информации также числовую, привязанную с помощью адресации данных к конкретному слову.In the correction block of the dictionary of phrases, the signals of the number of characters in the information read from the block are compared with the reference signal. If the number of characters of the reference value is exceeded, the signal read from the buffer memory is subjected to processing in the corrector of the dictionary of phrases corresponding to the decomposition of phrases into individual words. The results of processing information signals in block 1 of the dictionary corrections are stored in its information array, the structure of which allows you to search for the desired word and store, in addition to symbolic information, also numeric information associated with the addressing of data to a specific word.

Блок синтеза первичного словаря формирует сигнал нового информационного массива той же структуры, но не содержащий одинаковых слов и словосочетаний. Сигналы числовых данных, привязанных к каждому слову указанного информационного массива, содержат информацию о количестве таких слов в сигнале входной информации блока синтеза первичного словаря.The primary dictionary synthesis block generates a signal of a new information array of the same structure, but does not contain the same words and phrases. The numerical data signals associated with each word of the specified information array contain information about the number of such words in the input information signal of the primary dictionary synthesis block.

Блок фильтрации первичного словаря предназначен для формирования сигнала нового информационного массива той же структуры. В блоке фильтрации первичного словаря формируется сигнал, соответствующий результату операции удаления из входного информационного массива информации о тех словах, которые совпадают со словами информационного массива той же структуры, записываемого в память блока фильтрации первичного словаря при настройке устройства. Если сигнал слова из входной информации блока фильтрации первичного словаря удовлетворяет условиям сравнения со словами внутренней памяти блока, данные о нем не попадают в выходной информационный массив. Операция сравнения проводится в два этапа. На первом этапе сигналы слов сравниваются по схеме полного совпадения, на втором этапе сигналы слов из входной информации оцениваются на предмет совпадения символьной структуры слов (например, последовательности символов "работали" и "работ." совпадают по символьной структуре).The primary dictionary filtering unit is designed to generate a signal of a new information array of the same structure. In the filter block of the primary dictionary, a signal is generated that corresponds to the result of the operation of deleting from the input information array information about those words that coincide with the words of the information array of the same structure recorded in the memory of the filter block of the primary dictionary when setting up the device. If the word signal from the input information of the filtering unit of the primary dictionary satisfies the comparison conditions with the words of the internal memory of the block, the data about it does not fall into the output information array. The comparison operation is carried out in two stages. At the first stage, the word signals are compared according to the scheme of complete coincidence, at the second stage, the word signals from the input information are evaluated for the coincidence of the symbolic structure of words (for example, the sequences of the characters "worked" and "works." Coincide in the symbolic structure).

В блоке синтеза вторичных словарей проводится обработка входного информационного сигнала, соответствующая разделению данных входного информационного массива на четыре независимых информационных массива той же структуры. Сигнал каждого слова из входной информации проходит в блоке синтеза вторичных словарей посимвольную обработку на предмет выяснения вида каждого символа ("строчной", "прописной", "цифровой", "алфавитный", "специальный", "значимый", "пробел"). Сигналы слов, состоящих только из цифровых и специальных символов, исключаются из дальнейшей обработки. Сигналы оставшихся слов (и данные им соответствующие) суммируются с сигналами одного из выходных информационных массивов: "имена собственные", "аббревиатуры", "кавычки", "рядовые слова".In the block for synthesizing secondary dictionaries, the input information signal is processed corresponding to the division of the data of the input information array into four independent information arrays of the same structure. The signal of each word from the input information passes symbol-by-symbol processing in the block for synthesizing secondary dictionaries to determine the type of each character ("lowercase", "uppercase", "digital", "alphabetical", "special", "significant", "space"). Signals of words consisting only of digital and special characters are excluded from further processing. The signals of the remaining words (and the data corresponding to them) are summed with the signals of one of the output information arrays: "proper names", "abbreviations", "quotation marks", "ordinary words".

Блок анализа вторичных словарей анализирует сигнал идентификационного кода информационного массива, который считывается с выхода буферной памяти, и выполняет в зависимости от значения идентификационного кода один из трех вариантов обработки входного информационного сигнала. Первый вариант обработки заключается в считывании входного сигнала и записи его в один из выходных информационных массивов, два остальных варианта основаны на выделении из сигнала слова последовательности символов основы слова путем удаления символов суффиксов и окончаний и проведения оценки совпадения символьной структуры полученной последовательности символов с сигналами оставшихся слов текущего информационного массива.The secondary dictionaries analysis unit analyzes the signal of the identification code of the information array, which is read from the output of the buffer memory, and performs one of three processing options for the input information signal depending on the value of the identification code. The first processing option is to read the input signal and write it to one of the output information arrays, the other two options are based on the selection of the sequence of characters of the word base from the word signal by removing suffix and ending characters and assessing the coincidence of the character structure of the obtained sequence of characters with the signals of the remaining words current information array.

Блок синтеза архивной карточки формирует выходной сигнал, адреса для хранения в памяти архивной карточки новой архивной карточки, проводит адресацию входной информации для информационного поиска, формирует выходной информационный сигнал новой архивной карточки.The synthesis card synthesis unit generates an output signal, addresses for storing in the archive card memory of a new archive card, addresses the input information for information retrieval, generates an output information signal for a new archive card.

Блок архивации текстовых фрагментов проводит обработку входных сигналов, соответствующую операциям определения адреса архивной области буферной памяти для хранения фрагмента, проверки ее существования и при необходимости ее создания и идентификации, сжатия исходной информации.The block for archiving text fragments carries out processing of input signals corresponding to the operations of determining the address of the archive region of the buffer memory for storing a fragment, verifying its existence and, if necessary, creating and identifying it, compressing the initial information.

В блоке корректировки системных словарей определяется адрес доступа к последней архивной карточке, притом выходной сигнал адреса с выхода блока корректировки системных словарей поступает на вход блока управления. Блок управления формирует команду чтения архивной карточки и вместе с сигналом адреса доступа к карточке направляет ее на вход памяти архивной карточки. Параллельно блок управления подает на вход буферной памяти управляющий сигнал подготовки для записи области системных словарей.In the block for adjusting system dictionaries, the address of access to the last archive card is determined, moreover, the output signal of the address from the output of the block for adjusting system dictionaries is fed to the input of the control unit. The control unit generates a command to read the archive card and, together with the signal of the address of access to the card, directs it to the memory input of the archive card. At the same time, the control unit supplies the preparation control signal for recording the area of system dictionaries to the input of the buffer memory.

С выхода памяти архивных карточек снимаются сигналы шести информационных массивов, записанных в память архивных карточек с выхода блока синтеза архивной карточки. Информационные сигналы с выхода памяти архивных карточек подаются на вход блока корректировки системных словарей.From the output of the archive card memory, the signals of six information arrays recorded in the archive card memory from the output of the archive card synthesis block are removed. Information signals from the output of the archive card memory are fed to the input of the system dictionaries correction block.

Блок корректировки системных словарей выполняет операции обработки входных информационных сигналов, соответствующие преобразованию входной информации к виду списка слов без какой-либо дополнительной числовой информации, и сформированный информационный сигнал с его выхода поступает на вход буферной памяти. После выполнения этой операции блок корректировки системных словарей подает на вход блока управления команду завершения работы устройства.The system dictionaries adjustment unit performs processing of input information signals corresponding to the conversion of input information to the form of a word list without any additional numerical information, and the generated information signal from its output is fed to the input of the buffer memory. After performing this operation, the system dictionaries correction unit sends a command to shut down the device to the input of the control unit.

Блок управления формирует последовательность команд управления, соответствующих очистке буферной памяти, за исключением областей памяти системных словарей и архивных областей хранения сигналов сжатой информации текстовых фрагментов, и после выполнения указанной последовательности команд завершает работу устройства.The control unit generates a sequence of control commands corresponding to clearing the buffer memory, with the exception of the memory areas of system dictionaries and archive areas for storing signals of compressed information of text fragments, and after the execution of the specified sequence of commands ends the operation of the device.

Недостатком прототипа является схематичность и условность отображения содержания текстового документа, задействование человека-оператора при предварительной обработке информации,The disadvantage of the prototype is the schematic and conventional display of the contents of a text document, the involvement of the human operator in the preliminary processing of information,

Целью изобретения является расширение области применения и функциональных возможностей устройства за счет обеспечения возможности обработки и поиска текстовой информации различной тематической и смысловой направленности, а также автоматической адаптации устройства к изменению предметной области обрабатываемой информации на основе полного исключения человека из процесса анализа, чтения, аннотирования и каталогизации текстов.The aim of the invention is to expand the scope and functionality of the device by providing the ability to process and search for textual information of various thematic and semantic directions, as well as automatically adapt the device to change the subject area of the processed information based on the complete exclusion of a person from the process of analysis, reading, annotation and cataloging texts.

Цель достигается тем, что в известное устройство обработки информации для информационного поиска, содержащее последовательно соединенные наборное поле, блок отображения, который является выходом устройства в целом, блок ввода-вывода, магистральную шину, к которой подключены входы и выходы блока управления и памяти архивных карточек, второй выход которой через блок ввода-вывода подключен ко второму входу блока отображения, а вход памяти архивных карточек соединен с выходом блока синтеза архивной карточки, память исходного массива, первый выход которой через блок символьной обработки текста подключен ко входу блока структурного анализа текста, буферную память ввода-вывода, вход которой является входом устройства в целом, согласно изобретению введены долговременная память, блок выделения служебной информации, блок хранения и корректировки стоп-словаря, блок квазиморфологического анализа, блок формирования инвертированного индекса основ слов, блок формирования связанных основ слов, блок формирования признаков отбора предложений для архивной карточки, блок отбора предложений для архивной карточки и счетчик знаков в аннотации текста, причем выход буферной памяти ввода-вывода через долговременную память, которая подключена к магистральной шине, соединен со входом памяти исходного массива, второй выход которой через блок выделения служебной информации соединен с первым входом блока синтеза архивной карточки, а через последовательно соединенные блок отбора предложений для архивной карточки и счетчик знаков в аннотации текста подключен ко второму входу блока синтеза архивной карточки, к третьему входу которого подключены последовательно соединенные блок хранения и корректировки стоп-словаря, блок квазиморфологического анализа, блок формирования инвертированного индекса основ слов, при этом выход блока структурного анализа текста через блок квазиморфологического анализа и последовательно соединенные блок формирования связанных основ слов и блок формирования признаков отбора предложений для архивной карточки соединен со вторым входом блока отбора предложений для архивной карточки.The goal is achieved by the fact that in the known information processing device for information retrieval, containing a sequentially connected typesetting field, a display unit, which is the output of the device as a whole, an input-output unit, a trunk bus, to which the inputs and outputs of the control unit and archive card memory are connected the second output of which through the input-output unit is connected to the second input of the display unit, and the input of the archive card memory is connected to the output of the synthesis card synthesis unit, the memory of the original array, the first the first output of which through the symbol processing unit is connected to the input of the text structural analysis unit, the input-output buffer memory, the input of which is the input of the device as a whole, according to the invention, long-term memory, an auxiliary information allocation unit, a stop dictionary storage and correction unit, a block are introduced quasimorphological analysis, a block for generating an inverted index of the foundations of words, a block for forming the associated basics of words, a block for generating signs of selecting proposals for an archive card, a block for selecting proposals for the archive card and a character counter in the annotation of the text, the output of the I / O buffer memory via a long-term memory connected to the main bus connected to the memory input of the source array, the second output of which is connected through the service information allocation unit to the first input of the archive synthesis unit cards, and through series-connected the block of selection of offers for the archive card and the character counter in the annotation of the text is connected to the second input of the synthesis block of the archive card, to the third input which has a serially connected block for storing and adjusting the stop dictionary, a quasimorphological analysis block, an inverted index base word formation block, and an output of a text structural analysis block through a quasimorphological analysis block and a serially connected block word formation block and sentence selection formation block for the archive card is connected to the second input of the selection block for the archive card.

Сопоставительный анализ технического решения с устройством, выбранным в качестве прототипа, показывает, что новизна технического решения заключается в ведении в заявленное устройство новых схемных элементов: долговременной памяти, блока выделения служебной информации, блока хранения и корректировки стоп-словаря, блока квазиморфологического анализа, блока формирования инвертированного индекса основ слов, блока формирования связанных основ слов, блока формирования признаков отбора предложений для архивной карточки, блока отбора предложений для архивной карточки и счетчика знаков в аннотации текста, с соответствующими связями.A comparative analysis of the technical solution with the device selected as a prototype shows that the novelty of the technical solution consists in maintaining new circuit elements in the claimed device: long-term memory, service information allocation unit, stop dictionary storage and adjustment unit, quasimorphological analysis unit, formation unit inverted index of word stems, block of formation of related word stems, block of formation of signs of selection of sentences for an archive card, block of selection of pre expansions for archival card counter and characters in the text annotations, with respective connections.

Таким образом, заявляемое техническое решение соответствует критерию изобретения «новизна».Thus, the claimed technical solution meets the criteria of the invention of "novelty."

Анализ известных технических решений в исследуемой и смежных областях позволяет сделать вывод о том, что введенные функциональные узлы известны. Однако введение их в известное устройство с указанными связями придает этому устройству новые свойства. Введенные функциональные узлы взаимодействуют таким образом, что позволяют расширить область применения и функциональные возможности устройства обработки, автоматически адаптировать устройства к изменению предметной области обрабатываемой информации, исключить человека из процесса анализа, чтения, аннотирования и каталогизации текстов.Analysis of known technical solutions in the studied and related fields allows us to conclude that the introduced functional units are known. However, their introduction into a known device with the indicated connections gives this device new properties. The introduced functional units interact in such a way that they expand the scope and functionality of the processing device, automatically adapt the device to change the subject area of the processed information, and exclude a person from the process of analysis, reading, annotation and cataloging of texts.

Таким образом, техническое решение соответствует критерию "изобретательский уровень", так как оно для специалиста явным образом не следует из уровня техники.Thus, the technical solution meets the criterion of "inventive step", since it does not explicitly follow from the prior art for a specialist.

Изобретение может быть использовано для решения задач символьной обработки текстовой информации и предварительной обработки текстовых данных для информационного поиска.The invention can be used to solve the problems of symbolic processing of text information and preliminary processing of text data for information retrieval.

Таким образом, изобретение соответствует критерию "промышленная применимость".Thus, the invention meets the criterion of "industrial applicability".

На фиг.1 представлена блок-схема устройства обработки информации для информационного поиска, на фиг.2 - блок-схема алгоритма функционирования блока квазиморфологического анализа, на фиг.3 - блок-схема алгоритма функционирования блока формирования связанных пар основ слов, на фиг.4 - блок-схема алгоритма функционирования блока формирования признаков отбора предложений, на фиг.5 - блок-схема алгоритма функционирования блока отбора предложений, на фиг.6 - блок-схема алгоритма функционирования блока формирования инвертированного индекса, на фиг.7 - блок-схема алгоритма функционирования блока выделения служебной информации, на фиг.8 - блок-схема алгоритма функционирования блока хранения и корректировки стоп-словаря.Figure 1 presents a block diagram of a device for processing information for information retrieval, figure 2 is a block diagram of a functioning algorithm of a quasi-morphological analysis unit, figure 3 is a block diagram of a functioning algorithm of a unit for generating related pairs of word stems, in figure 4 - a block diagram of a functioning algorithm of a block for generating signs of selecting proposals, Fig. 5 is a block diagram of a functioning algorithm for a block for selecting offers, Fig. 6 is a block diagram of a functioning algorithm for a block for generating an inverted index, in fi .7 - a flowchart block allocation operation of the service information in Figure 8 - a flowchart of the functioning of the storage unit and adjusting the stop dictionary.

Фиг.1:Figure 1:

1 - память исходного массива;1 - memory of the source array;

2 - блок символьной обработки текста;2 - block character processing text;

3 - наборное поле;3 - typesetting field;

4 - блок отображения;4 - display unit;

5 - блок ввода-вывода;5 - input-output unit;

6 - буферная память ввода-вывода;6 - input / output buffer memory;

7 - долговременная память;7 - long-term memory;

8 - блок структурного анализа текста;8 - block structural analysis of the text;

9 - блок выделения служебной информации;9 - block allocation of service information;

10 - блок хранения и корректировки стоп-словаря;10 - block storage and adjustment of the stop dictionary;

11 - блок квазиморфологического анализа;11 - block quasimorphological analysis;

12 - блок формирования инвертированного индекса основ слов;12 - block forming an inverted index of the basics of words;

13 - блок управления;13 - control unit;

14 - блок синтеза архивной карточки;14 - block synthesis archive cards;

15 - память архивных карточек;15 - memory of archive cards;

16 - блок формирования связанных основ слов;16 is a block forming the associated word stems;

17 - блок формирования признаков отбора предложений для архивной карточки;17 - block forming the signs of the selection of proposals for an archive card;

18 - блок отбора предложений для архивной карточки;18 - block selection proposals for an archive card;

19 - счетчик знаков в аннотации текста;19 - counter characters in the annotation of the text;

20 - магистральная шина,20 - trunk bus

причем вход буферной памяти 6 ввода-вывода является входом устройства в целом, а блок 4 отображения является выходом устройства в целом, при этом выход буферной памяти 6 ввода-вывода через долговременную память 7, которая подключена к магистральной шине 20, соединен со входом памяти 1 исходного массива, первый выход которой через блок 2 символьной обработки текста подключен ко входу блока 8 структурного анализа текста, а второй выход которой через блок 9 выделения служебной информации соединен с первым входом блока 14 синтеза архивной карточки, а через последовательно соединенные блок 18 отбора предложений для архивной карточки и счетчик 19 знаков в аннотации текста подключен ко второму входу блок 14 синтеза архивной карточки, к третьему входу которого подключены последовательно соединенные блок 10 хранения и корректировки стоп-словаря, блок 11 квазиморфологического анализа, блок 12 формирования инвертированного индекса основ слов, при этом выход блока 8 структурного анализа текста через блок 11 квазиморфологического анализа и последовательно соединенные блок 16 формирования связанных основ слов и блок 17 формирования признаков отбора предложений для архивной карточки соединен со вторым входом блока 18 отбора предложений для архивной карточки, при этом к магистральной шине 20 подключены входы и выходы блока 13 управления и памяти 15 архивных карточек, второй выход которой через блок 5 ввода-вывода подключен ко второму входу блока 4 отображения, а вход памяти 15 архивных карточек соединен с выходом блока 14 синтеза архивной карточки,moreover, the input of the input / output buffer memory 6 is the input of the device as a whole, and the display unit 4 is the output of the device as a whole, while the output of the input / output buffer memory 6 via a long-term memory 7, which is connected to the bus 20, is connected to the memory input 1 an initial array, the first output of which is connected through the block 2 of character processing of text to the input of the block 8 of structural analysis of the text, and the second output of which is connected through the block 9 for extracting service information to the first input of the synthesis synthesis block 14 of the archive card, and through the series-connected block 18 of the selection of proposals for the archive card and the counter 19 characters in the annotation of the text connected to the second input block 14 of the synthesis of the archive card, the third input of which is connected in series to the block 10 for storage and adjustment of the stop dictionary, block 11 of the quasi-morphological analysis, block 12 the formation of an inverted index of the basics of words, while the output of block 8 structural analysis of the text through block 11 quasi-morphological analysis and sequentially connected block 16 forming links of the word basics and the block 17 for generating signs for selecting offers for an archive card is connected to the second input of the block 18 for selecting offers for an archive card, while the inputs and outputs of the control unit 13 and the memory 15 of archive cards are connected to the main bus 20, the second output of which is through block 5 the input-output is connected to the second input of the display unit 4, and the memory input 15 of the archive cards is connected to the output of the synthesis card synthesis unit 14,

Работа указанного устройства состоит в следующем.The operation of the specified device is as follows.

С выхода устройства преобразования исходного файла к формату. t×t текстовые файлы через буферную память 6 ввода-вывода записываются в долговременную память 7, а атрибуты этих файлов из долговременной памяти 7 через магистральную шину 20, блок 5 ввода-вывода поступают на блок 4 отображения, по данным которого оператор с помощью наборного поля 3 выбирает файл, подлежащий обработке. Управляющие сигналы с наборного поля 3 через блок 4 отображения, блок 5 ввода-вывода и магистральную шину 20 поступают на блок 13 управления, откуда на долговременную память 7 подается сигнал на считывание соответствующего файла в память 1 исходного массива. Из памяти 1 исходного массива текстовый файл поступает на блок 9 выделения служебной информации, блок 18 отбора предложений для архивной карточки и через блок 2 символьной обработки текста на блок 8 структурного анализа текстаFrom the output of the device converting the source file to the format. t × t text files are written to the non-volatile memory 7 through the I / O buffer memory 6, and the attributes of these files from the non-volatile memory 7 via the main bus 20, the input-output block 5 are sent to the display unit 4, according to which the operator uses a typeset field 3 selects the file to be processed. The control signals from the dialing field 3 through the display unit 4, the input-output unit 5 and the bus line 20 are sent to the control unit 13, from where a signal is sent to the long-term memory 7 to read the corresponding file into the memory 1 of the original array. From the memory 1 of the source array, the text file is sent to the service information allocation block 9, the sentence selection block 18 for the archive card and, through the character processing block 2, to the text structural analysis block 8

Работа памяти 1 исходного массива, блока 2 символьной обработки текста, наборного поля 3, блока 4 отображения, блока 5 ввода-вывода, буферной памяти 6 ввода-вывода, блока 8 структурного анализа текста, блока 13 управления полностью аналогична прототипу.The work of the memory 1 of the original array, block 2 character processing, typing field 3, block 4 display, block 5 I / O, buffer memory 6 I / O, block 8 structural analysis of the text, block 13 control is completely similar to the prototype.

С блока 8 структурного анализа текста на блок 11 квазиморфологического анализа поступает текстовый файл, в котором удалены все служебные символы и выделены отдельные слова. Блок 11 квазиморфологического анализа функционирует в соответствии с алгоритмом, представленным на фиг.2, по которому из текстового файла удаляются слова, содержащие менее трех символов, удаляются незначащие слова, находящихся в блоке 10 хранения и корректировки стоп-словаря, из оставшихся слов текстового файла формируются основы слов путем отсечения окончаний, затем удаляются полученные основы, содержащие менее трех символов.From block 8 of the structural analysis of the text to block 11 of the quasi-morphological analysis, a text file is received in which all service characters are deleted and individual words are highlighted. The quasimorphological analysis unit 11 operates in accordance with the algorithm shown in FIG. 2, according to which words containing less than three characters are deleted from the text file, insignificant words that are in the stop dictionary storage and adjustment unit 10 are deleted, and the remaining words of the text file are formed word stems by cutting off the endings, then the resulting strings containing less than three characters are deleted.

В качестве примера может быть использовано преобразование первого предложения описания данного устройства.As an example, the conversion of the first sentence of the description of this device can be used.

Исходное предложение: «Изобретение относится к техническим средствам информатики и вычислительной техники и может быть использовано для решения задач символьной обработки текстовой информации и предварительной обработки текстовых данных для информационного поиска»Initial proposal: “The invention relates to technical means of computer science and computer technology and can be used to solve the problems of symbolic processing of text information and preliminary processing of text data for information retrieval”

Предложение после квазиморфологической обработки:Proposal after quasi-morphological processing:

«изобретен относит техническ средств информатик вычислител техник использован решен задач символ обработк текстов информац предварител обработк текстов данн информационн поиск»"Invented relates technical means computer science computer technician used solved problems symbol word processing information preliminary word processing data information search"

С блока 11 квазиморфологического анализа основы значащих слов длиной три и более символов поступают на блок 16 формирования связанных основ слов и блок 12 формирования инвертированного индекса основ слов. Блок 16 формирования связанных основ слов функционирует в соответствии с алгоритмом, представленным на фиг.3, по которому из последовательности отдельных основ значащих слов формируются двухосновные сочетания. Пример результатов такого преобразования приведен в табл.1.From the block 11 of the quasi-morphological analysis of the basics of significant words with a length of three or more characters, they arrive at the block 16 for forming the associated word stems and the block 12 for forming the inverted index of the word stems. Block 16 of the formation of related word stems operates in accordance with the algorithm presented in figure 3, according to which dibasic combinations are formed from a sequence of individual stems of significant words. An example of the results of such a conversion is given in Table 1.

Таблица 1Table 1 изобретен относитinvented relates относит техническrelates technical техническ средствtechnical means средств информатикcomputer science tools информатик вычислителcomputer scientist вычислител техникcomputer technician техник использованtechnician used использован решенused resolved решен задачsolved problems задач символtask symbol символ обработкprocessing symbol обработк текстовword processing текстов информацtexts of information информац предварителpre-information предварител обработкpretreatment обработк текстовword processing текстов даннdata texts данн информационнthis information информационн поискinformation search

С блока 16 формирования связанных основ слов двухосновные сочетания в виде, аналогичном представленному в табл.1, поступают на блок 17 формирования признаков отбора предложений для архивной карточки. Блок 17 формирования признаков отбора предложений для архивной карточки функционирует в соответствии с алгоритмом, представленным на фиг.4, по которому вычисляется количество появлений двухосновных сочетаний из табл.1, и двухосновные сочетания сортируются в порядке убывания количеств их появления в текстовом файле. Затем двухосновные сочетания, имеющие максимальное количество появлений, определяются как признаки отбора предложений для аннотации, включаемой в архивную карточку. В данном примере это двухосновное сочетание «обработк текстов», которое появляется в тексте два раза (табл.2).From the block 16 for the formation of the associated word stems, the dibasic combinations in the form similar to that presented in Table 1 are sent to the block 17 for generating signs of selecting proposals for the archive card. Block 17 of the formation of signs of selection of proposals for an archive card operates in accordance with the algorithm presented in figure 4, which calculates the number of occurrences of dibasic combinations from Table 1, and dibasic combinations are sorted in descending order of the quantities of their appearance in a text file. Then dibasic combinations having the maximum number of occurrences are defined as signs of selection of proposals for the annotation included in the archive card. In this example, this is a two-basic combination of “word processing” that appears twice in the text (Table 2).

Таблица 2table 2 обработк текстовword processing 22 вычислител техникcomputer technician 1one данн информационнthis information 1one задач символtask symbol 1one изобретен относитinvented relates 1one информатик вычислителcomputer scientist 1one информац предварителpre-information 1one информационн поискinformation search 1one использован решенused resolved 1one относит техническrelates technical 1one предварител обработкpretreatment 1one решен задачsolved problems 1one символ обработкprocessing symbol 1one средств информатикcomputer science tools 1one текстов даннdata texts 1one текстов информацtexts of information 1one техник использованtechnician used 1one техническ средствtechnical means 1one

Признаки отбора из блока 17 формирования признаков отбора предложений для архивной карточки, в данном случае - это двухосновное сочетание «обработк текстов», передаются в блок 18 отбора предложений для архивной карточки, на второй вход которого из памяти 1 исходного массива поступает исходный текст для обработки. Блок 18 отбора предложений для архивной карточки функционирует в соответствии с алгоритмом, представленным на фиг.5, по которому для архивной карточки отбираются предложения, содержащие двухсловные сочетания, являющиеся признаками отбора. Отобранные предложения из блока 18 отбора предложений для архивной карточки через счетчик 19 знаков в аннотации, который ограничивает объем каждого отобранного предложения шестью первыми словами в этом предложении и ограничивает общий объем архивной карточки, в соответствии с ГОСТ 7.9-95 СИБИД «Реферат и аннотация. Общие требования», 500 символами, поступают на блок 14 синтеза архивной карточки, на второй вход которого поступает служебная информация из блока 9 выделения служебной информации и ключевые слова из блока 12 формирования инвертированного индекса основ слов.Signs of selection from block 17 of the formation of signs of selection of proposals for an archive card, in this case, this is a two-basic combination of “word processing”, transmitted to block 18 of selection of offers for an archive card, to the second input of which from the memory 1 of the source array the source text for processing is received. Block 18 selection of proposals for the archive card operates in accordance with the algorithm presented in figure 5, according to which for the archive card offers are selected containing two-word combinations that are signs of selection. The selected offers from block 18 of the selection of proposals for the archive card through the counter 19 characters in the annotation, which limits the volume of each selected proposal to the first six words in this proposal and limits the total volume of the archive card, in accordance with GOST 7.9-95 SIBID "Abstract and annotation. General requirements ”, 500 characters, are sent to the archive card synthesis block 14, the second input of which receives service information from the service information allocation block 9 and keywords from the block 12 for generating an inverted word base index.

Ключевые слова формируются следующим образом. Из блока 11 квазиморфологического анализа основы значащих слов длиной три и более символов поступают на блок 12 формирования инвертированного индекса основ слов, где указанные основы слов упорядочиваются в порядке убывания частоты их появления в анализируемом тексте, чем формируется инвертированный индекс анализируемого текста. Считается, что слова, чаще встречающиеся в тексте, в целом полнее отражают его содержание, нежели слова, редко встречающиеся в тексте. Шесть основ слов, имеющих максимальную частоту появления в анализируемом тексте, из блока 12 формирования инвертированного индекса основ слов в качестве «ключевых слов» поступают на третий вход блока 14 синтеза архивной карточки.Keywords are formed as follows. From the block 11 of the quasimorphological analysis, the stems of significant words with a length of three or more characters are sent to the block 12 for generating an inverted index of the stems of words, where the indicated stems of words are ordered in decreasing order of frequency of their appearance in the analyzed text, which forms the inverted index of the analyzed text. It is believed that words that are more often found in the text generally reflect its content more fully than words that are rarely found in the text. Six word stems having the maximum frequency of occurrence in the analyzed text from block 12 of the formation of the inverted index of the word strings as “keywords” are received at the third input of the archive card synthesis block 14.

Работа блока 14 синтеза архивной карточки полностью аналогична прототипу. Блок 9 выделения служебной информации функционирует в соответствии с алгоритмом, представленным на фиг.7. Блок 12 формирования инвертированного индекса основ слов функционирует в соответствии с алгоритмом, представленным на фиг.6.The work of block 14 synthesis archive card is completely similar to the prototype. Block 9 allocation of service information operates in accordance with the algorithm presented in Fig.7. Block 12 forming an inverted index of the basics of words operates in accordance with the algorithm presented in Fig.6.

Сформированные в блоке 14 синтеза архивной карточки архивные карточки накапливаются в памяти 15 архивных карточек. Накопленные в памяти 15 архивных карточек архивные карточки используются при составлении тематических указателей текстов, каталогизации текстов и при информационном поиске по ключевым словам и атрибутам.The archive cards formed in the synthesis card synthesis block 14 are stored in the memory of 15 archive cards. The archived cards accumulated in the memory of 15 archive cards are used in the compilation of thematic indexes of texts, cataloging of texts and in the information search by keywords and attributes.

Таким образом, положительный эффект по сравнению с прототипом заключается в расширении области применения и функциональных возможностей устройства обработки информации различной тематической и смысловой направленности, а также автоматической адаптации устройства к изменению предметной области обрабатываемой информации на основе полного исключения человека из процесса анализа, чтения, аннотирования и каталогизации текстов.Thus, the positive effect compared to the prototype is to expand the scope and functionality of the information processing device of various thematic and semantic directions, as well as automatically adapt the device to change the subject area of the processed information based on the complete exclusion of a person from the analysis, reading, annotation and cataloging texts.

Устройство обработки информации для информационного поиска реализовано на базе входных формирователей сигнала 74LCX16 245; перепрограммируемой логической интегральной схемы (ППЛИС) VirtexII FG676 и модуля оперативной памяти 7С1380 (ОЗУ1); ППЛИС XC9572XL-VQ6, линейки индикаторов КРС3216, генератора 50 МГц; процессора цифровой обработки сигналов (ЦОС) TMS320C6416, постоянного запоминающего устройства 93LC66B и трех модулей оперативной памяти 7С1380 (ОЗУ2-ОЗУ4). Устройство устанавливается в стандартную шину PCI персонального компьютера, с которого и производится загрузка рабочих конфигураций ППЛИС и процессора ЦОС. Сигналы текстовых сообщений поступают на вход устройства через формирователи на ППЛИС.An information processing device for information retrieval is implemented based on input signal conditioners 74LCX16 245; reprogrammable logic integrated circuit (FPGA) VirtexII FG676 and 7C1380 RAM module (RAM1); FPGA XC9572XL-VQ6, line of indicators KRS3216, generator 50 MHz; digital signal processing processor (DSP) TMS320C6416, read-only memory 93LC66B and three modules 7C1380 RAM (OZU2-OZU4). The device is installed in the standard PCI bus of a personal computer, from which the working configurations of the FPGA and the DSP processor are loaded. Signals of text messages are received at the input of the device through the formers on the FPGA.

Claims

An information processing device for information retrieval, comprising a sequentially connected typesetting field, a display unit, which is the output of the device as a whole, an input-output unit, a bus line to which the inputs and outputs of the control unit and archive card memory are connected, the second output of which is through the input unit -the output is connected to the second input of the display unit, and the input of the archive card memory is connected to the output of the synthesis card synthesis unit, the memory of the original array, the first output of which is symbolically through the unit text processing is connected to the input of the text structural analysis unit, the input / output buffer memory, the input of which is the input of the device as a whole, characterized in that it has long-term memory, an auxiliary information allocation unit, a stop dictionary storage and adjustment unit, a quasi-morphological analysis unit , block for the formation of an inverted index of the foundations of words, block for the formation of related word stems, block for the formation of signs for selecting proposals for an archive card, block for selecting sentences for an archive card and a character counter in the text annotation, wherein the output of the input / output buffer memory via a long-term memory that is connected to the main bus is connected to the memory input of the original array, the second output of which is connected to the first input of the archive card synthesis unit through the auxiliary information allocation unit, and through a series-connected selection block for archival cards and a character counter in the text annotation is connected to the second input of the synthesis card synthesis block, to the third input of which there are connected properly connected block for storing and adjusting the stop dictionary, a block of quasi morphological analysis, a block for generating an inverted index of word stems, while the output of a block for structural analysis of the text through a block of quasi morphological analysis and sequentially connected a block for forming related word strings and a block for generating signs for selecting sentences for an archive card with the second input of the proposal selection block for the archive card.