EA041027B1

EA041027B1 - METHOD AND SYSTEM FOR RETRIEVING NAMED ENTITIES

Info

Publication number: EA041027B1
Application number: EA202092862
Authority: EA
Inventors: Антон Александрович Емельянов
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Priority date: 2020-08-31
Filing date: 2020-12-23
Publication date: 2022-08-30

Description

Область техникиTechnical field

Представленное изобретение относится, в общем, к области вычислительной техники, а в частности к способу и системе извлечения именованных сущностей. Техническое решение находит применение во многих областях, связанных с обработкой текстов на естественных языках и извлечением информации из текста.The present invention relates, in general, to the field of computer technology, and in particular to a method and system for extracting named entities. The technical solution finds application in many areas related to the processing of texts in natural languages and extracting information from text.

Уровень техникиState of the art

В настоящий момент задача извлечения именованных сущностей из текстовых данных требуется в широком ряде направлений и прикладных задач. В частности, при построении диалоговых систем (чатботов, умных помощников) извлечение именованных сущностей необходимо для правильного понимая смысла текста, намерения пользователя и формирования ответа на запросы пользователя.At the moment, the task of extracting named entities from text data is required in a wide range of areas and applications. In particular, when building dialog systems (chatbots, smart assistants), the extraction of named entities is necessary for a correct understanding of the meaning of the text, user intentions and the formation of a response to user requests.

В промышленных областях, где требуется классификация документов, извлечение именованных сущностей может применяться для генерации новых признаков текста, чтобы улучшить качество работы классификации в условиях реального применения.In industrial applications where document classification is required, named entity extraction can be used to generate new text features to improve classification performance in real-world applications.

Задача извлечения именованных сущностей является подзадачей в ряде направлений, таких как анализ текстовых документов, классификация отзывов, кластеризация текстовых коллекций, в диалоговых системах и др. В настоящее время используется ряд подходов для решения задачи извлечения именованных сущностей из текстов на русском языке, каждый из которых обладает своими преимуществами и недостатками:The task of extracting named entities is a subtask in a number of areas, such as the analysis of text documents, classification of reviews, clustering of text collections, in dialog systems, etc. Currently, a number of approaches are used to solve the problem of extracting named entities from texts in Russian, each of which has its own advantages and disadvantages:

Named Entity Recognition (NER) от DeepPavlov, URL: http://docs.deeppavlov.ai/en/master/features/ models/ner.html. Решение основано на дообучении языковой модели RuBERT под задачу извлечения именованных сущностей. Система имеет более 100 млн обучаемых параметров. Обучение с помощью этой модели требуют больших объемов GPU памяти (более 3 GB);Named Entity Recognition (NER) by DeepPavlov, URL: http://docs.deeppavlov.ai/en/master/features/models/ner.html. The solution is based on additional training of the RuBERT language model for the task of extracting named entities. The system has more than 100 million trainable parameters. Training with this model requires large amounts of GPU memory (more than 3 GB);

Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition. URL: https://arxiv.org/pdf/1709.09686.pdf. Решение основано на конкатенации Bi-LSTM char слоя, Token Embedding слоя для кодирования входной последовательности и Bi-LSTM-CRF слоя для предсказания сущностей. Данное решение показывает худшее качество, нежели современные модели, основанные на больших языковых моделях, однако в памяти при обучении занимает в разы меньше места;Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition. URL: https://arxiv.org/pdf/1709.09686.pdf. The solution is based on the concatenation of a Bi-LSTM char layer, a Token Embedding layer for encoding the input sequence, and a Bi-LSTM-CRF layer for entity prediction. This solution shows worse quality than modern models based on large language models, however, it takes up much less space in memory during training;

Deep-NER: named entity recognizer based on deep neural networks and transfer learning. Модель имеет в основе ELMO (Matthew E. Peters, Mark Neumann, Mohit lyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations. 2018. URL: https://arxiv.org/abs/1802.05365) или BERT языковую модель. Далее идет BiLSTM-CRF слой для предсказаний. Система имеет более 100 млн обучаемых параметров и занимает в памяти при обучении более 3-4 GB GPU памяти. Работает хуже, чем модель от DeepPavlov.Deep-NER: named entity recognizer based on deep neural networks and transfer learning. The model is based on ELMO (Matthew E. Peters, Mark Neumann, Mohitlyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations. 2018. URL: https://arxiv.org/abs/1802.05365) or BERT language model. Next comes the BiLSTM-CRF layer for predictions. The system has more than 100 million trainable parameters and takes up more than 3-4 GB of GPU memory in memory during training. Works worse than the model from DeepPavlov.

Томита-парсер. URL: https://yandex.ru/dev/tomita/. Томита-парсер создан для извлечения структурированных данных из текста на естественном языке. Вычленение фактов происходит при помощи контекстносвободных грамматик и словарей ключевых слов. Парсер позволяет писать свои грамматики и добавлять словари для нужного языка. С его помощью можно извлекать именованные сущности. Основной недостаток состоит в сложности написания правил: для каждой новой сущности требуется новая группа правил, что требует затраты времени специалиста. Однако данный инструмент не требует обучающих данных. Основные недостатки приведенных нейросетевых моделей состоят в том, что они требуют большие объемы памяти при обучении и предсказании и наличия соответствующего оборудования. Использование томитапарсера требует высококвалифицированного специалиста для написания грамматик.Tomita parser. URL: https://yandex.ru/dev/tomita/. The Tomita parser is designed to extract structured data from natural language text. Extraction of facts occurs with the help of context-free grammars and keyword dictionaries. The parser allows you to write your own grammars and add dictionaries for the desired language. It can be used to retrieve named entities. The main disadvantage is the complexity of writing rules: for each new entity, a new group of rules is required, which requires the time of a specialist. However, this tool does not require training data. The main disadvantages of the above neural network models are that they require large amounts of memory for training and prediction and the availability of appropriate equipment. Using the tomitaparser requires a highly skilled grammar writer.

Раскрытие изобретенияDisclosure of invention

Технической проблемой или задачей, поставленной в данном техническом решении, является создание нового эффективного, простого и надежного метода извлечения именованных сущностей из текста на русском языке. При этом решение должно обеспечить высокое качество извлечения именованных сущностей и требовать как можно меньше GPU памяти (менее 2 GB). Также решение должно уметь решать задачу классификации именованных сущностей (одновременно - Joint learning (Bing Liu, Lane Ian. 2016. Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv:1609.01454) или по отдельности).The technical problem or task posed in this technical solution is to create a new efficient, simple and reliable method for extracting named entities from the text in Russian. At the same time, the solution should provide high quality extraction of named entities and require as little GPU memory as possible (less than 2 GB). Also, the solution should be able to solve the problem of classifying named entities (at the same time - Joint learning (Bing Liu, Lane Ian. 2016. Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv:1609.01454) or separately).

Техническим результатом является повышение точности предсказания именованных сущностей. Указанный технический результат достигается благодаря осуществлению способа извлечения именованных сущностей из текстовой информации, выполняемого по меньшей мере одним вычислительным устройством, содержащего этапы, на которых:The technical result is to increase the accuracy of predicting named entities. The specified technical result is achieved due to the implementation of a method for extracting named entities from textual information, performed by at least one computing device, comprising the steps of:

получают текстовую информацию;receive text information;

выполняют разбиение текста на слова;split the text into words;

выполняют токенизацию текста для получения последовательности токенов;performing tokenization of the text to obtain a sequence of tokens;

формируют посредством нейронной сети для полученной последовательности токенов набор векторов;form by means of a neural network for the received sequence of tokens a set of vectors;

формируют на основе полученного набора векторов векторное представление последовательности токенов;based on the obtained set of vectors, a vector representation of the sequence of tokens is formed;

посредством сравнения показателей полученного векторного представления последовательностиby comparing the indicators of the obtained vector representation of the sequence

- 1 041027 токенов с заранее заданными показателями векторов, полученными в результате обучения нейронной сети, осуществляют предсказание именованных сущностей для векторного представления последовательности токенов;- 1 041027 tokens with predetermined indicators of vectors obtained as a result of training the neural network carry out the prediction of named entities for the vector representation of the sequence of tokens;

распознают полученные на предыдущем этапе именованные сущности посредством подбора метки слова.recognize the named entities obtained at the previous stage by selecting the word label.

В одном из частных примеров осуществления способа векторное представление последовательности токенов на основе набора векторов формируют посредством расчета взвешенной суммы или средних значений показателей векторов, содержащихся в наборе векторов.In one of the particular examples of the implementation of the method, a vector representation of a sequence of tokens based on a set of vectors is formed by calculating the weighted sum or average values of the indicators of the vectors contained in the set of vectors.

В другом частном примере осуществления способа дополнительно выполняют этап, на котором корректируют размерность векторов, содержащихся в полученном векторном представлении последовательности токенов, для получения векторного представления последовательности токенов с заданной размерностью.In another particular embodiment of the method, the step is additionally performed, at which the dimension of the vectors contained in the received vector representation of the sequence of tokens is corrected in order to obtain a vector representation of the sequence of tokens with a given dimension.

В другом частном примере осуществления способа дополнительно выполняют этапы, на которых:In another particular embodiment of the method, the steps are additionally performed, at which:

на основе показателей векторов, содержащихся в векторном представлении последовательности токенов, и показателей предыдущих векторов в упомянутой последовательности посредством рекуррентного слоя нейронной сети определяют зависимости показателей векторов токенов;based on the indicators of the vectors contained in the vector representation of the sequence of tokens, and the indicators of the previous vectors in the said sequence, dependencies of the indicators of the token vectors are determined by means of a recurrent neural network layer;

добавляют информацию о зависимостях показателей векторов токенов к представлению последовательности токенов посредством формирования нового векторного представления последовательности токенов.adding information about the dependencies of indicators of vectors of tokens to the representation of the sequence of tokens by generating a new vector representation of the sequence of tokens.

определяют для каждого вектора в векторном представлений последовательности токенов зависимости его показателей от показателей других векторов;determine for each vector in the vector representation of the sequence of tokens the dependence of its indicators on the indicators of other vectors;

генерируют на основе показателей векторного представления последовательности токенов и данных о зависимостях показателей векторов, определенных на предыдущем этапе, новое векторное представление последовательности токенов.based on the indicators of the vector representation of the sequence of tokens and the data on the dependencies of the indicators of the vectors determined in the previous step, a new vector representation of the sequence of tokens is generated.

В другом частном примере осуществления способа финальная метка слова может быть определена посредством нейронных сетей методом голосования.In another particular embodiment of the method, the final word label can be determined by neural networks by voting.

на основе показателей векторного представления последовательности токенов определяют максимальные значения этих показателей по всем векторам и на их основе формируют вектор, содержащий максимальные значения показателей полученного векторного представления последовательности токенов;on the basis of indicators of the vector representation of the sequence of tokens, the maximum values of these indicators for all vectors are determined and, on their basis, a vector is formed containing the maximum values of the indicators of the obtained vector representation of the sequence of tokens;

на основе показателей векторного представления последовательности токенов определяют средние значения этих показателей по всем векторам и на их основе формируют вектор, содержащий средние значения показателей полученного векторного представления последовательности токенов;on the basis of indicators of the vector representation of the sequence of tokens, the average values of these indicators for all vectors are determined and, on their basis, a vector is formed containing the average values of the indicators of the obtained vector representation of the sequence of tokens;

на основе показателей вектора, содержащего максимальные значения показателей векторного представления последовательности токенов, вектора, содержащего средние значения показателей полученного векторного представления последовательности токенов, и показателей последнего вектора в векторном представлении последовательности токенов формируют результирующую вектор;based on the indices of the vector containing the maximum values of the indices of the vector representation of the sequence of tokens, the vector containing the average values of the indices of the obtained vector representation of the sequence of tokens, and the indices of the last vector in the vector representation of the sequence of tokens, the resulting vector is formed;

осуществляют классификацию результирующего вектора для определения класса векторного представления последовательности токенов. В другом предпочтительном варианте осуществления заявленного решения представлена система извлечения именованных сущностей, содержащая по меньшей мере одно вычислительное устройство и по меньшей мере одно устройство памяти, содержащее машиночитаемые инструкции, которые при их исполнении по меньшей мере одним вычислительным устройством выполняют вышеуказанный способ.carry out the classification of the resulting vector to determine the class of the vector representation of the sequence of tokens. In another preferred embodiment of the claimed solution, a named entity retrieval system is provided, comprising at least one computing device and at least one memory device containing machine-readable instructions that, when executed by at least one computing device, perform the above method.

Краткое описание чертежейBrief description of the drawings

Признаки и преимущества настоящего технического решения станут очевидными из приводимого ниже подробного описания изобретения и прилагаемых чертежей, на которых:The features and advantages of the present technical solution will become apparent from the following detailed description of the invention and the accompanying drawings, in which:

на фиг. 1 представлена общая схема взаимодействия элементов системы извлечения именованных сущностей;in fig. 1 shows the general scheme of interaction between the elements of the system for extracting named entities;

на фиг. 2 представлен пример общего вида системы извлечения именованных сущностей.in fig. Figure 2 shows an example of a general view of a named entity extraction system.

Осуществление изобретенияImplementation of the invention

Ниже будут описаны понятия и термины, необходимые для понимания данного технического решения.The concepts and terms necessary for understanding this technical solution will be described below.

В данном техническом решении под системой подразумевается, в том числе, компьютерная система, ЭВМ (электронно-вычислительная машина), ЧПУ (числовое программное управление), ПЛК (программируемый логический контроллер), компьютеризированные системы управления и любые другие устройства, способные выполнять заданную, четко определенную последовательность операций (действий, инструкций).In this technical solution, a system means, among other things, a computer system, a computer (electronic computer), CNC (numerical control), PLC (programmable logic controller), computerized control systems and any other devices capable of performing a given, clearly a certain sequence of operations (actions, instructions).

Под устройством обработки команд подразумевается электронный блок, вычислительное устройство, либо интегральная схема (микропроцессор), исполняющая машинные инструкции (программы).A command processing device is an electronic unit, a computing device, or an integrated circuit (microprocessor) that executes machine instructions (programs).

Устройство обработки команд считывает и выполняет машинные инструкции (программы) с одногоThe command processor reads and executes machine instructions (programs) from one

- 2 041027 или более устройств хранения данных. В роли устройства хранения данных могут выступать, но не ограничиваясь, жесткие диски (HDD), флеш-память, ПЗУ (постоянное запоминающее устройство), твердотельные накопители (SSD), оптические приводы.- 2 041027 or more storage devices. The role of a storage device can be, but not limited to, hard drives (HDD), flash memory, ROM (read only memory), solid state drives (SSD), optical drives.

Программа - последовательность инструкций, предназначенных для исполнения устройством управления вычислительной машины или устройством обработки команд.Program - a sequence of instructions intended for execution by a computer control device or a command processing device.

База данных (БД) - совокупность данных, организованных в соответствии с концептуальной структурой, описывающей характеристики этих данных и взаимоотношения между ними, причем такое собрание данных, которое поддерживает одну или более областей применения (ISO/IEC 2382:2015, 2121423 database).Database (DB) - a collection of data organized in accordance with a conceptual structure that describes the characteristics of this data and the relationship between them, and such a collection of data that supports one or more areas of application (ISO / IEC 2382: 2015, 2121423 database).

Сущность - любой различимый объект (объект, который мы можем отличить от другого), информацию о котором необходимо хранить в базе данных. Экземпляр сущности - это конкретный представитель данной сущности. Например, представителем сущности Сотрудник может быть Сотрудник Иванов. Экземпляры сущностей должны быть различимы, т.е. сущности должны иметь некоторые свойства, уникальные для каждого экземпляра этой сущности. Именованная сущность - объект определенного типа, имеющий имя, название или идентификатор. Какие типы выделяет система, определяется в рамках конкретной задачи. Для новостного домена обычно это люди (PER), места (LOC), организации (ORG) и разное, объекты широкого спектра: события, слоганы и т.д.An entity is any distinguishable object (an object that we can distinguish from another), information about which needs to be stored in the database. An entity instance is a specific instance of that entity. For example, Employee Ivanov can be a representative of the Employee entity. Entity instances must be distinguishable, i.e. entities must have some properties that are unique to each instance of that entity. A named entity is an object of a specific type that has a name, title, or identifier. What types the system allocates is determined within the framework of a specific task. For a news domain, these are usually people (PER), places (LOC), organizations (ORG) and miscellaneous, objects of a wide range: events, slogans, etc.

В соответствии со схемой, приведенной на фиг. 1, система 100 извлечения именованных сущностей содержит соединенные между собой: модуль 10 предобработки и токенизации текста; модуль 20 формирования векторов; модуль 30 формирования векторного представления последовательности токенов; модуль 40 определения зависимости показателей векторов токенов; модуль 50 определения типов зависимостей между токенами в последовательности векторного представления токенов; модуль 60 корректировки размерности векторов; модуль 70 предсказания именованных сущностей для токенов; модуль 80 предсказания класса последовательности векторного представления токенов; и модуль 90 распознавания именованных сущностей.In accordance with the diagram shown in Fig. 1, the named entity extraction system 100 includes interconnected: a text preprocessing and tokenization module 10; a vector generating unit 20; a module 30 for generating a vector representation of a sequence of tokens; a module 40 for determining the dependence of indicators of token vectors; a module 50 for determining types of dependencies between tokens in a token vector representation sequence; a vector dimension correction module 60; a named entity prediction module 70 for tokens; a token vector representation sequence class prediction module 80; and a named entity recognition module 90 .

Модуль 10 предобработки и токенизации текста может быть реализован на базе по меньшей мере одного вычислительного устройства, оснащенного соответствующим программным обеспечением, и включать набор моделей для токенизации текста, например, набор моделей WordPiece (см. Yonghui Wu, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Mike Schuster, Zhifeng Chen. 2016. Googles neural machine translation system: Bridging the gap between human and machine translation, volume arXiv:1609.08144).The text preprocessing and tokenization module 10 may be implemented on the basis of at least one computing device equipped with appropriate software and include a set of models for text tokenization, for example, a set of WordPiece models (see Yonghui Wu, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Mike Schuster, Zhifeng Chen 2016. Googles neural machine translation system: Bridging the gap between human and machine translation, volume arXiv:1609.08144).

Модуль 20 формирования векторов может быть реализован на базе языковой модель BERT. BERT это нейронная сеть от Google, показавшая с большим отрывом state-of-the-art результаты на целом ряде задач. С помощью BERT можно создавать программы с ИИ для обработки естественного языка, отвечать на вопросы, заданные в произвольной форме, создавать чат-боты, автоматические переводчики, анализировать текст и так далее. Модуль 30 формирования векторного представления последовательности токенов может быть реализован на базе по меньшей мере одного вычислительного устройства, оснащенного соответствующим программным обеспечением для определения взвешенной сумма или среднего (в зависимости от конфигурации) значения показателей векторов для формирования векторного представления последовательности токенов.The vector generation module 20 may be implemented based on the BERT language model. BERT is a neural network from Google that has shown state-of-the-art results by a wide margin on a number of tasks. With BERT, you can create AI programs for natural language processing, answer free-form questions, create chat bots, automatic translators, analyze text, and so on. The module 30 for generating a vector representation of a sequence of tokens can be implemented on the basis of at least one computing device equipped with appropriate software for determining the weighted sum or average (depending on the configuration) of the values of the indicators of vectors for generating a vector representation of a sequence of tokens.

Модуль 40 определения зависимости показателей векторов токенов может быть реализован на базе рекуррентной нейронной сети (RNN), выполненной с возможностью на основе значений показателей векторов, содержащихся в векторном представлении последовательности токенов, определять рекуррентные зависимости между показателями векторов. Граф вычислений рекуррентных нейронных сетей также может содержать циклы, которые отражают зависимости значения переменной в следующий момент времени от ее текущего значения.The module 40 for determining the dependence of indicators of vectors of tokens can be implemented on the basis of a recurrent neural network (RNN) configured to determine recurrent dependencies between indicators of vectors based on the values of the indicators of the vectors contained in the vector representation of the sequence of tokens. The calculation graph of recurrent neural networks can also contain cycles that reflect the dependence of the value of a variable at the next point in time on its current value.

Модуль 50 определения типов зависимостей между токенами в последовательности векторного представления токенов может быть реализован на базе нейронной сети, содержащей Multi-Head attention нейросетевой слой. Multi-head attention - это специальный слой, который дает возможность каждому входному вектору взаимодействовать с другими словами через механизм внимания (attention mechanism), вместо передачи параметров hidden state как в RNN или соседних слов как в CNN. Ему на вход даются вектора Query, и несколько пар Key и Value (на практике, Key и Value это всегда один и тот же вектор). Каждый из них преобразуется обучаемым линейным преобразованием, а потом вычисляется скалярное произведение Q со всеми K по очереди, прогоняется результат этих скалярных произведений через softmax, и с полученными весами все вектора суммируются в единый вектор. Модуль 60 корректировки размерности векторного представления последовательности токенов может быть реализован на базе линейных слоев нейронной сети, предназначенных для обработки показателей векторов для корректировки их размерностей. Весовые коэффициенты для линейных слоев могут быть заданы разработчиком модуля 60 при его проектировании. Модуль 70 предсказания именованных сущностей для токенов может быть реализован на базе по меньшей мере одной нейронной сети, заранее обученной на конкретных наборах именованных сущностей и соответствующих им наборах векторов. В частности, для предсказания именованных сущностей может быть использован слой softmax, широко известный из уровня техники (см.,The module 50 for determining the types of dependencies between the tokens in the sequence of the vector representation of the tokens can be implemented on the basis of a neural network containing a Multi-Head attention neural network layer. Multi-head attention is a special layer that allows each input vector to interact with other words through the attention mechanism, instead of passing hidden state parameters as in RNNs or adjacent words as in CNNs. It is given Query vectors as input, and several Key and Value pairs (in practice, Key and Value are always the same vector). Each of them is transformed by a trainable linear transformation, and then the scalar product Q is calculated with all K in turn, the result of these scalar products is run through softmax, and with the resulting weights, all vectors are summed into a single vector. The module 60 for adjusting the dimension of the vector representation of the sequence of tokens can be implemented on the basis of the linear layers of the neural network, designed to process the indicators of the vectors to adjust their dimensions. The weighting factors for the line layers can be specified by the module 60 designer during its design. Named entity prediction module 70 for tokens may be implemented based on at least one neural network pre-trained on specific sets of named entities and their corresponding sets of vectors. In particular, the softmax layer, widely known in the art, can be used to predict named entities (see,

- 3 041027 например, https://ru.wikipedia.org/wiki/Softmax).- 3 041027 for example, https://ru.wikipedia.org/wiki/Softmax).

Модуль 80 предсказания класса векторного представления токенов может быть реализован на базе по меньшей мере одной нейронной сети, выполненной с возможностью осуществления операций подвыборки по максимальному (Мах Pooling) и среднему (Average Pooling) значению, содержащей по меньшей мере один линейный слой для корректировки размерности векторов и слой softmax. Модуль 90 распознавания именованных сущностей может быть реализован на базе по меньшей мере одной вычислительного устройства, выполненного с возможностью преобразования данных о предсказанных именованных сущностях в слова. Упомянутый алгоритм преобразования может быть заранее задан разработчиком упомянутого модуля 90 при его проектировании.Token vector representation class prediction module 80 can be implemented based on at least one neural network configured to perform subsampling operations on the maximum (Max Pooling) and average (Average Pooling) value, containing at least one linear layer for adjusting the dimension of the vectors and a softmax layer. Named entity recognition module 90 may be implemented with at least one computing device capable of converting predicted named entity data into words. Said conversion algorithm may be predetermined by the developer of said module 90 during its design.

На первом этапе работы системы 100 извлечения именованных сущностей в модуль 10 предобработки и токенизации текста поступает текст, например, от источника данных 1. Источником 1 данных может быть по меньшей мере одна база данных (БД), в которой содержится текстовая информация, либо источником 1 данных может быть устройство пользователя, например, портативный или стационарный компьютер, мобильный телефон или смартфон, планшет и пр. Модуль 10 предобработки и токенизации текста выполняет разбиение текста на слова по пробелу между словами и его токенизацию с помощью модели WordPiece для получения последовательности токенов. Например, для текста Хочу оформить дебетовую карту последовательность токенов будет представлять: X', '##очу', 'о', '##фор', '##ми', '##ть', 'де', '##бет', '##овую', 'карт', '##у'. Также упомянутый модуль 10 добавляет специальный токен [CLS] в начало последовательности токенов.At the first stage of operation of the named entity extraction system 100, text is received into the text preprocessing and tokenization module 10, for example, from data source 1. Data source 1 can be at least one database (DB) that contains text information, or source 1 data can be a user device, for example, a laptop or desktop computer, mobile phone or smartphone, tablet, etc. The text preprocessing and tokenization module 10 splits the text into words by the space between words and tokenizes it using the WordPiece model to obtain a sequence of tokens. For example, for the text I want to issue a debit card, the sequence of tokens will be: X', '##ochu', 'o', '##for', '##mi', '##t', 'de', '# #bet', '##new', 'card', '##y'. Also, said module 10 adds a special token [CLS] to the beginning of the token sequence.

Соответственно, сформированная последовательность токенов далее упомянутым модулем 10 передается в модуль 20 формирования векторов, который известными из уровня техники методами (см., например, статью Anton Emelyanov, Ekaterina Artemova. Gapping parsing using pretrained embeddings, attention mechanism and NCRF. 2019. Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference Dialogue (2019). Issue 18. Supplementary volume. Pages 21-30. URL: http://www.dialog-21.ru/media/4870/-dialog2019scopusvolplus.pdf) формирует для последовательности токенов набор векторов, например, 12 векторных представлений последовательности токенов, после чего полученный набор векторов направляются упомянутым модулем 20 в модуль 30 формирования векторного представления последовательности токенов. На основе полученного набора векторов модуль 30 формирует векторное представление последовательности токенов, например, посредством расчета взвешенной суммы или средних значений показателей 12 упомянутых векторных представлений последовательности токенов.Accordingly, the generated sequence of tokens is further transmitted by the mentioned module 10 to the vector generation module 20, which is using methods known from the prior art (see, for example, the article Anton Emelyanov, Ekaterina Artemova. Gapping parsing using pretrained embeddings, attention mechanism and NCRF. 2019. Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference Dialogue (2019) Issue 18 Supplementary volume Pages 21-30 URL: http://www.dialog-21.ru/media/4870/-dialog2019scopusvolplus.pdf) forms for a sequence of tokens, a set of vectors, for example, 12 vector representations of the sequence of tokens, after which the resulting set of vectors is sent by the mentioned module 20 to the module 30 for generating a vector representation of the sequence of tokens. Based on the obtained set of vectors, the module 30 generates a vector representation of the sequence of tokens, for example, by calculating the weighted sum or average values of the indicators of the 12 mentioned vector representations of the sequence of tokens.

Далее полученное модулем 30 векторное представление последовательности токенов направляется в модуль 60 корректировки размерности, который корректирует размерность векторов, содержащихся в полученном векторном представлении последовательности токенов, например, посредством умножения показателей векторов на сгенерированное случайное нецелое число, для получения векторного представления последовательности токенов с заданной размерностью.Next, the vector representation of the sequence of tokens obtained by module 30 is sent to the dimensionality correction module 60, which corrects the dimension of the vectors contained in the received vector representation of the sequence of tokens, for example, by multiplying the indicators of the vectors by the generated random non-integer number, to obtain a vector representation of the sequence of tokens with a given dimension.

В альтернативном варианте реализации заявленного решения полученное векторное представление последовательности токенов упомянутым модулем 30 может быть направлено в модуль 40 определения зависимости показателей векторов токенов, который на основе показателей векторов, содержащихся в векторном представлении последовательности токенов, и показателей предыдущих векторов в упомянутой последовательности, посредством рекуррентного слоя определяет зависимости показателей векторов токенов, которые могут быть выражены в виде векторов. Информация о зависимостях показателей векторов токенов добавляется упомянутым модулем 40 к векторному представлению последовательности токенов посредством формирования нового векторного представления последовательности токенов, после чего сформированное модулем 40 векторное представление последовательности токенов с информацией о зависимостях показателей векторов токенов направляется в модуль 50 определения типов зависимостей между токенами, либо в модуль 60 корректировки размерности векторного представления последовательности токенов для его обработки описанным ранее способом. Модуль 50 определения типов зависимостей между токенами при получении векторного представления последовательности токенов для каждого вектора известными из уровня техники методами определяет зависимости его показателей от показателей других векторов, входящих векторное представление последовательности токенов, которые также могут выражены в виде векторов, после чего упомянутый модуль 50 корректирует векторное представление последовательности токенов путем генерирования на основе показателей векторного представления последовательности токенов и данных о зависимостях показателей векторов нового векторного представления последовательности токенов. Данные о зависимостях показателей векторов характеризуют типы зависимостей между токенами, содержащимися в векторном представлении последовательности токенов, и могут указывать на наличие семантической или морфологической зависимости между токенами. Соответственно, сформированное векторное представление последовательности токенов модулем 50 направляется в модуль 60 корректировки размерности векторного представления последовательности токенов для его обработки описанным ранее способом.In an alternative implementation of the proposed solution, the obtained vector representation of the sequence of tokens by the mentioned module 30 can be sent to the module 40 for determining the dependence of the indicators of the vectors of tokens, which, based on the indicators of the vectors contained in the vector representation of the sequence of tokens, and the indicators of the previous vectors in the mentioned sequence, by means of a recurrent layer defines the dependencies of indicators of vectors of tokens, which can be expressed as vectors. Information about the dependencies of indicators of token vectors is added by said module 40 to the vector representation of the sequence of tokens by generating a new vector representation of the sequence of tokens, after which the vector representation of the sequence of tokens generated by module 40 with information about the dependencies of indicators of token vectors is sent to the module 50 for determining types of dependencies between tokens, or to the module 60 for adjusting the dimension of the vector representation of the sequence of tokens for its processing in the manner described earlier. The module 50 for determining the types of dependencies between tokens, when obtaining a vector representation of a sequence of tokens for each vector by methods known from the prior art, determines the dependence of its indicators on the indicators of other vectors included in the vector representation of a sequence of tokens, which can also be expressed as vectors, after which the said module 50 corrects vector representation of the sequence of tokens by generating, based on the indicators of the vector representation of the sequence of tokens and data on the dependencies of the indicators of the vectors of a new vector representation of the sequence of tokens. Dependency data on vector indicators characterize the types of dependencies between tokens contained in the vector representation of a sequence of tokens and may indicate the presence of a semantic or morphological dependency between tokens. Accordingly, the generated vector representation of the sequence of tokens is sent by the module 50 to the module 60 for adjusting the dimension of the vector representation of the sequence of tokens for processing in the manner described earlier.

После того, как векторное представление последовательности токенов с заданной размерностью было сформировано упомянутым модулем 60, оно направляется в модуль 70 предсказания именованныхAfter the vector representation of the sequence of tokens with a given dimension has been generated by the mentioned module 60, it is sent to the module 70 for predicting named

- 4 041027 сущностей для токенов. Модуль 70 предсказания именованных сущностей для токенов на основе сравнения показателей полученного векторного представления последовательности токенов с заранее заданными показателями векторов, полученными в результате обучения нейронной сети, осуществляет предсказание именованных сущностей для векторного представления последовательности токенов. Например, для слова дебетовую может быть определена следующая последовательность меток именованных сущностей в IO разметке: 'I_DEB_CARD', 'I_DEB_CARD', 'I_O'. Метка О означает, что сущности нет (в данном случае модель ошиблась на одном токене). Финальное предсказание именованной сущности для слова дебетовую будет DEB_CARD.- 4 041027 entities for tokens. The token named entity prediction module 70 performs named entity prediction for the vector representation of the token sequence based on the comparison of the metrics of the obtained vector representation of the token sequence with predetermined vector metrics obtained as a result of training the neural network. For example, for the word debit, the following sequence of labels of named entities in IO markup can be defined: 'I_DEB_CARD', 'I_DEB_CARD', 'I_O'. The label O means that there is no entity (in this case, the model made a mistake on one token). The final prediction of the named entity for the word debit will be DEB_CARD.

Далее данные об именованных сущностях упомянутым модулем 70 направляются в модуль 90 распознавания именованных сущностей, который для каждой именованной сущности известными из уровня техники методами подбирает метку слова, причем финальная метка слова может быть определена посредством нейронных сетей методом голосования, а в качестве финальной метки выбирается та, у которой наибольшее их число. Например, для метки I_DEB_CARD может быть определено 2 голоса, а для метки I_O - 1 голос. Соответственно, финальное предсказание метки именованной сущности для слова также будет DEB_CARD.Further, the data about the named entities is sent by the mentioned module 70 to the named entity recognition module 90, which selects a word label for each named entity using methods known from the prior art, and the final label of the word can be determined by means of neural networks by voting, and the final label is selected as , which has the largest number of them. For example, the label I_DEB_CARD could have 2 votes, and the label I_O could have 1 vote. Accordingly, the final prediction of the named entity label for the word will also be DEB_CARD.

Также дополнительно векторное представление последовательности токенов с заданной размерностью может быть обработано модулем 80 предсказания класса последовательности векторного представления токенов. Упомянутый модуль 80 на основе показателей векторного представления последовательности токенов с заданной размерностью, полученного от упомянутого ранее модуля 60, при помощи операции Max Pooling определяет максимальные значения этих показателей по всем векторам и на их основе формирует вектор, содержащий максимальные значения показателей полученного векторного представления последовательности токенов. Также упомянутый модуль 80 на основе показателей векторного представления последовательности токенов с заданной размерностью при помощи операции Average Pooling определяет средние значения этих показателей по всем векторам и на их основе формирует вектор, содержащий средние значения показателей полученного векторного представления последовательности токенов, после чего на основе показателей вектора, содержащего максимальные значения показателей векторного представления последовательности токенов, вектора, содержащего средние значения показателей полученного векторного представления последовательности токенов, и показателей последнего вектора в векторном представлении последовательности токенов упомянутый модуль 80 формирует результирующую вектор, например, посредством операции конкатенации. Далее результирующий вектор обрабатывается линейным слоем, входящим в состав модуля 80, для корректировки его размерности описанным ранее способом, и слоем softmax для классификации результирующего вектора и определения на основе результата классификации класса для векторного представления последовательности токенов, для которого был сформирован результирующий вектор.Also, additionally, a vector representation of a sequence of tokens with a given dimension can be processed by the sequence class prediction module 80 of the vector representation of tokens. The mentioned module 80, based on the indicators of the vector representation of a sequence of tokens with a given dimension, received from the previously mentioned module 60, using the Max Pooling operation, determines the maximum values of these indicators for all vectors and, based on them, forms a vector containing the maximum values of the indicators of the obtained vector representation of the sequence of tokens . Also, the mentioned module 80, based on the indicators of the vector representation of the sequence of tokens with a given dimension, using the Average Pooling operation, determines the average values of these indicators for all vectors and, on their basis, forms a vector containing the average values of the indicators of the obtained vector representation of the sequence of tokens, after which, based on the indicators of the vector , containing the maximum values of the indicators of the vector representation of the sequence of tokens, a vector containing the average values of the indicators of the obtained vector representation of the sequence of tokens, and the indicators of the last vector in the vector representation of the sequence of tokens, said module 80 forms the resulting vector, for example, by means of a concatenation operation. Next, the resulting vector is processed by a linear layer included in module 80 to adjust its dimension in the manner described earlier, and a softmax layer to classify the resulting vector and determine, based on the classification result, the class for the vector representation of the sequence of tokens for which the resulting vector was generated.

Информация о классе векторного представления последовательности токенов может указывать, например, на наличие или отсутствие именованных сущностей в текстовой информации или на вид запроса (намерения) пользователя, содержащемся в текстовой информации, для которой было сформировано упомянутое векторное представление последовательности токенов. Например, если текстовая информация представляет собой текст на моей дебетовой карте с номером хххх хххх хххх хххх были операции, которых я не делал. Что делать?, в размеченном виде который будет представлен как О О B_DEB_CARD I_DEB_CARD О B_NUM_CARD I_NUM_CARD O O O O O O O O O O О, то векторному представлению последовательности токенов, сформированному для данной текстовой информации, модулем 80 будет определен класс позвать_оператора, указывающий на вид запроса пользователя - вызов оператора. Если текстовая информация будет представлять текст Я хочу дебетовую карту, то класс будет определен как хочу_карту, указывающий на вид запроса - оформление карты. Также вид запроса пользователя может указывать на желание получить кредит или прочую услугу. Данные о метках слов модулем 90 распознавания именованных сущностей и данные классификации векторного представления последовательности токенов модулем 80 предсказания класса последовательности векторного представления токенов далее могут быть направлены в устройство 2 хранения данных, для их последующего отображения пользователю по соответствующему запросу. Таким образом, за счет того, что векторное представление последовательности токенов формируют на основе набора векторов, полученных посредством нейронной сети в результате обработки данных последовательности токенов, а предсказание именованных сущностей для векторного представления последовательности токенов выполняют посредством сравнения показателей полученного векторного представления последовательности токенов с заранее заданными показателями векторов, полученными в результате обучения нейронной сети, повышается точность предсказания именованных сущностей. Также дополнительно точность предсказания именованных сущностей обеспечивается за счет того, что при предсказании учитывается информацию о зависимостях показателей векторов токенов от других векторов, содержащихся в векторном представлении последовательности токенов.Information about the class of the vector representation of the sequence of tokens may indicate, for example, the presence or absence of named entities in the text information or the type of user request (intention) contained in the text information for which the said vector representation of the sequence of tokens was generated. For example, if the text information is text on my debit card number xxxx xxxx xxxx xxxx there were transactions that I didn't do. What to do?, in a marked-up form, which will be presented as O O B_DEB_CARD I_DEB_CARD O B_NUM_CARD I_NUM_CARD O O O O O O O O O O O, then the vector representation of the sequence of tokens generated for this textual information will be defined by module 80 with the call_operator class, indicating the type of user request - operator call. If the text information will be the text I want a debit card, then the class will be defined as want_card, indicating the type of request - card design. Also, the type of user request may indicate a desire to receive a loan or other service. The word label data by the named entity recognition module 90 and the token sequence vector representation classification data by the token vector representation sequence class prediction module 80 can then be sent to the data storage device 2 for subsequent display to the user upon request. Thus, due to the fact that the vector representation of the sequence of tokens is formed on the basis of a set of vectors obtained by the neural network as a result of processing the data of the sequence of tokens, and the prediction of named entities for the vector representation of the sequence of tokens is performed by comparing the indicators of the obtained vector representation of the sequence of tokens with predetermined indicators of vectors obtained as a result of neural network training increase the accuracy of predicting named entities. Also, the prediction accuracy of named entities is additionally ensured due to the fact that the prediction takes into account information about the dependencies of the indicators of token vectors on other vectors contained in the vector representation of the sequence of tokens.

В общем виде (см. фиг. 2) система (200) извлечения именованных сущностей содержит объединенные общей шиной информационного обмена один или несколько процессоров (201), средства памяти, такие как ОЗУ (202) и ПЗУ (203), интерфейсы ввода/вывода (204), устройства ввода/вывода (205), и устIn general terms (see Fig. 2), the system (200) for extracting named entities contains one or more processors (201), memory facilities, such as RAM (202) and ROM (203), I / O interfaces connected by a common information exchange bus (204), input/output devices (205), and

--

Claims

host for networking (206).

The processor (201) (or multiple processors, multi-core processor, etc.) can be selected from a range of devices currently widely used, for example, manufacturers such as: Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK ™, Qualcomm Snapdragon™, etc. Under the processor or one of the processors used in the system (200), it is also necessary to take into account the graphics processor, for example, NVIDIA GPU with a CUDA-compatible software model, or Graphcore, the type of which is also suitable for full or partial execution of the method, and can also be used to training and application of machine learning models in various information systems.

RAM (202) is a random access memory and is designed to store machine-readable instructions executable by the processor (201) to perform the necessary operations for logical data processing. The RAM (202) typically contains the executable instructions of the operating system and associated software components (applications, program modules, etc.). In this case, the RAM (202) may be the available memory of the graphics card or graphics processor.

ROM (203) is one or more persistent storage devices such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media (CD-R /RW, DVD-R/RW, BlueRay Disc, MD), etc.

Various types of I/O interfaces (204) are used to organize the operation of system components (200) and organize the operation of external connected devices. The choice of appropriate interfaces depends on the particular design of the computing device, which can be, but not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

To ensure user interaction with the computer system (200), various means (205) of I/O information are used, for example, a keyboard, a display (monitor), a touch screen, a touchpad, a joystick, a mouse manipulator, a light pen, a stylus, a touch panel, trackball, speakers, microphone, augmented reality tools, optical sensors, tablet, indicator lights, projector, camera, biometric identification tools (retinal scanner, fingerprint scanner, voice recognition module), etc.

The networking means (206) provides data transmission via an internal or external computer network, for example, an Intranet, Internet, LAN, etc. As one or more means (206) can be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and / or BLE module, Wi-Fi module and others

Additionally, satellite navigation tools can also be used as part of the system (200), for example, GPS, GLONASS, BeiDou, Galileo. The specific choice of elements of the system (200) for the implementation of various software and hardware architectural solutions may vary while maintaining the required functionality provided. Modifications and improvements to the above described embodiments of the present technical solution will be clear to experts in this field of technology. The foregoing description is provided by way of example only and is not intended to be limiting in any way. Thus, the scope of the present technical solution is limited only by the scope of the appended claims.

CLAIM

1. A method for extracting named entities from textual information, performed by at least one computing device, comprising the steps of:

receive text information;

split the text into words;

performing tokenization of the text to obtain a sequence of tokens;

form by means of a neural network for the received sequence of tokens a set of vectors;

based on the obtained set of vectors, a vector representation of the sequence of tokens is formed;

by comparing the indicators of the obtained vector representation of the sequence of tokens with predetermined indicators of the vectors obtained as a result of training the neural network, predict named entities for the vector representation of the sequence of tokens;

recognize the named entities obtained at the previous stage by selecting the word label.

2. The method according to claim 1, characterized in that a vector representation of a sequence of tokens based on a set of vectors is formed by calculating a weighted sum or average values

-