RU2775358C1

RU2775358C1 - Method and system for obtaining vector representation of electronic text document for classification by categories of confidential information

Info

Publication number: RU2775358C1
Application number: RU2021128049A
Authority: RU
Inventors: Кирилл Евгеньевич Вышегородцев; Иван Александрович Оболенский; Максим Сергеевич Головня
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Filing date: 2021-09-24
Publication date: 2022-06-29

Abstract

FIELD: computer technology.

SUBSTANCE: invention relates to computer technology. A computer-implemented method for vector representation of an electronic text document for determining the category of confidential information contained in it, performed using a processor and containing the steps, at which: a model for placing m-skip-n-grams in clusters is formed, while forming the said model: determining the list of used m-skip-n-grams; converting to a vector representation of each m-skip-n-gram from the list; m-skip-n-gram clustering; perform processing of the text document using the obtained model, during which: the occurrence of m-skip-n-grams in the document is counted; determining document clusters based on the occurrence of m-skip-n-grams; the number of occurrences of m-skip-n-grams from each cluster is summarized; a vector representation of the document is formed; the category of confidential information in a text document is defined.

EFFECT: enabling the preservation of different semantics of words in a document by mapping words to multiple clusters.

10 cl, 6 dwg, 1 tbl

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

[0001] Настоящее изобретение относится к вычислительным системам в широком смысле. Более конкретно - к системам и способам обработки естественного языка, искусственного языка, любых знаковых систем. Может использоваться в системах обработки информации, базах данных, электронных хранилищах.[0001] The present invention relates to computing systems in a broad sense. More specifically, to systems and methods for processing natural language, artificial language, any sign systems. It can be used in information processing systems, databases, electronic storages.

УРОВЕНЬ ТЕХНИКИBACKGROUND OF THE INVENTION

[0002] Автоматическая обработка, передача, хранение документов, может включать классификацию исходных документов, их кластеризацию и иные действия путем соотнесения векторного представления документа с другим векторным представлением документа, любого множества или группы документов. Варианты реализации данного изобретения могут быть схожи с решениями, изложенными ранее в патентах RU 2701995 С2, RU 2583716 С2, RU 2254610 С2. Методы получения векторного представления и дальнейшей классификации - сети BERT, GPT, рекуррентные сети (RNN) и подобные, требуют больших вычислительных ресурсов. Вместе с этим подавляющее большинство пользователей имеют слабые персональные ЭВМ (стационарный ПК, ноутбуки), которые не содержат в себе специальных вычислительных устройств -многоядерных процессоров, графических вычислителей (GPU). Способы, которые не требует больших мощностей (Tf-Idf), не обладают высокой эффективностью векторного представления документов для их дальнейшей обработки, например, классификации. Таким образом, в настоящее время, невозможно эффективно реализовывать получение векторного представления и/или классификацию документа на стороне пользователя.[0002] Automatic processing, transmission, storage of documents may include the classification of source documents, their clustering and other actions by correlating a vector representation of a document with another vector representation of a document, any set or group of documents. Embodiments of this invention may be similar to the solutions described earlier in patents RU 2701995 C2, RU 2583716 C2, RU 2254610 C2. Methods for obtaining a vector representation and further classification - BERT, GPT networks, recurrent networks (RNN) and the like, require large computing resources. At the same time, the vast majority of users have weak personal computers (stationary PCs, laptops) that do not contain special computing devices - multi-core processors, graphics computers (GPUs). Methods that do not require large capacities (Tf-Idf) do not have high efficiency of vector representation of documents for their further processing, for example, classification. Thus, at present, it is not possible to effectively implement the receipt of a vector representation and/or classification of a document on the user's side.

[0003] Основные недостатки существующих решений обусловлены следующим:[0003] The main disadvantages of existing solutions are due to the following:

- решения с высоким качеством классификации требуют специальных вычислительных устройств (многоядерных процессоров, GPU), которые отсутствуют у простых ПК;- solutions with high classification quality require special computing devices (multi-core processors, GPU), which are not available in simple PCs;

- требуется хранить в памяти (оперативной или графической) всю предобученную модель нейронной сети. Стандартная версия «BERT» занимает 680 мегабайт (мультиязычная от Google: https://huggingface.co/bert-base-multilingual-cased; русскоязычная от DeepPavlov: https://huggingface.co/DeepPavlov/rubert-base-cased-sentence), расширенные версии 1,8 Гб (LaBSE: https://huggingface.co/sentence-transformers/LaBSЕ; от Сбербанка: http://sberbank-ai/sbert_large_nlu_ru);- it is required to store in memory (operational or graphical) the entire pre-trained model of the neural network. The standard version of "BERT" occupies 680 megabytes (multilingual from Google: https://huggingface.co/bert-base-multilingual-cased; Russian from DeepPavlov: https://huggingface.co/DeepPavlov/rubert-base-cased-sentence ), extended versions 1.8 GB (LaBSE: https://huggingface.co/sentence-transformers/LaBSE; from Sberbank: http://sberbank-ai/sbert_large_nlu_ru);

- очень медленная скорость обработки для пользователя - несколько минут на текст из порядка 10^∧3 слов;- very slow processing speed for the user - several minutes for a text of about 10 ^∧ 3 words;

- ускорение обработки путем использования только части документа неминуемо ведет к потери семантической целостности документа и неполноте информации о нем;- acceleration of processing by using only part of the document inevitably leads to the loss of the semantic integrity of the document and the incompleteness of information about it;

- организация работы на стороннем сервере требует передачи документа (часто и большого размера) и работы мощного вычислительного центра, что делает такой подход очень дороги, поскольку количество документов, подлежащих обработке, составляет десятки миллионов документов в месяц.- the organization of work on a third-party server requires the transfer of a document (often of a large size) and the operation of a powerful computing center, which makes this approach very expensive, since the number of documents to be processed is tens of millions of documents per month.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

[0004] Заявленное изобретение направлено на решение технической проблемы, заключающейся в сокращении времени обработки текстовых данных на ЭВМ пользователя (офисный ПК пользователя) и возможности применения на неспециализированных ЭВМ с помощью их преобразовании в векторную форму без непосредственного использования моделей нейронных сетей.[0004] The claimed invention is aimed at solving the technical problem of reducing the processing time of text data on the user's computer (user's office PC) and the possibility of using it on non-specialized computers by converting them into vector form without directly using neural network models.

[0005] Техническим результатом является повышение скорости обработки документов на ЭВМ пользователя для классификации содержащейся в нем конфиденциальной информации, за счет применения векторных представлений m-skip-n-грамм слов и их использовании для подсчета частот предобученных семантических кластеров в текстовом документе для преобразования его в векторный вид.[0005] The technical result is to increase the speed of processing documents on the user's computer to classify the confidential information contained in it, by using vector representations of m-skip-n-gram words and using them to count the frequencies of pre-trained semantic clusters in a text document to convert it into vector view.

[0006] Дополнительным результатом заявленного решения является также не только классификации документов на заданные категории конфиденциальной информации, но и возможность кластеризации документов внутри каждой категории, за счет получаемого векторного вида документа, который имеет смысл частот семантических (смысловых) характеристик.[0006] An additional result of the claimed solution is also not only the classification of documents into specified categories of confidential information, but also the possibility of clustering documents within each category, due to the resulting vector form of the document, which makes sense of the frequencies of semantic (semantic) characteristics.

[0007] Заявленный технический результат достигается за счет выполнения компьютерно-реализуемого способа векторного представления электронного текстового документа для определения категории конфиденциальной информации, содержащейся в нем, выполняемого с помощью процессора и содержащего этапы, на которых:[0007] The claimed technical result is achieved by performing a computer-implemented method of vector representation of an electronic text document to determine the category of confidential information contained in it, performed using a processor and containing the steps in which:

- формируют по меньшей мере одну модель размещения m-skip-n-грамм по кластерам, при этом m-skip-n-грамма представляет по меньшей мере отдельное слово и при формировании упомянутой модели осуществляют:- at least one model of placing m-skip-n-grams in clusters is formed, while the m-skip-n-gram represents at least a single word, and when forming the said model, the following is carried out:

определение списка используемых m-skip-n-грамм;

determining the list of used m-skip-n-grams;

преобразование в векторное представление каждой m-skip-n-граммы из списка;

converting to a vector representation of each m-skip-n-gram from the list;

кластеризацию m-skip-n-грамм по их векторным представлениям;

clustering m-skip-n-grams by their vector representations;

- выполняют обработку по меньшей мере одного текстового документа с помощью полученной модели размещения m-skip-n-грамм, в ходе которой:- processing at least one text document using the obtained m-skip-n-gram placement model, during which:

осуществляют подсчет встречаемости m-skip-n-грамм в текстовом документе;

carry out the calculation of the occurrence of m-skip-n-grams in a text document;

определяют кластеры текстового документа на основании встречаемости m-skip-n-грамм;

determining text document clusters based on the occurrence of m-skip-n-grams;

суммируют количество встречаемости m-skip-n-грамм из каждого кластера;

summarize the number of occurrences of m-skip-n-grams from each cluster;

формируют векторное представление текстового документа на основании упорядоченной последовательности сумм m-skip-n-грамм; определяют категорию конфиденциальной информации в текстовом документе на основании модели размещения m-skip-n-грамм.

forming a vector representation of a text document based on an ordered sequence of m-skip-n-gram sums; determining the category of confidential information in the text document based on the m-skip-n-gram placement model.

[0008] В одном частном варианте реализации способа используют нечеткое разбиение списка m-skip-n-грамм на кластеры.[0008] In one particular embodiment of the method, a fuzzy partitioning of the list of m-skip-n-grams into clusters is used.

[0009] В другом частном варианте реализации способа каждая m-skip-n-грамма относится к нескольким кластерам.[0009] In another particular embodiment of the method, each m-skip-n-gram belongs to several clusters.

[0010] В другом частном варианте реализации способа каждая m-skip-n-грамма имеет вес, характеризующий ее близость к заданному кластеру.[0010] In another particular embodiment of the method, each m-skip-n-gram has a weight characterizing its proximity to a given cluster.

[0011] В другом частном варианте реализации способа кластеризация m-skip-n-грамм по их векторным представлениям выполняется более одного раза.[0011] In another particular implementation of the method, the clustering of m-skip-n-grams according to their vector representations is performed more than once.

[0012] В другом частном варианте реализации способа дополнительно используются веса для кластеров m-skip-n-грамм, характеризующие значимость кластеров для векторного представления документа.[0012] In another particular embodiment of the method, weights are additionally used for m-skip-n-gram clusters, characterizing the significance of clusters for the vector representation of the document.

[0013] В другом частном варианте реализации способа дополнительно производится получение векторных представлений m-skip-n-грамм, не входящих в список, и их соотнесение по векторным представлениям к по меньшей мере одному кластеру.[0013] In another particular embodiment of the method, the vector representations of m-skip-n-grams not included in the list are additionally obtained and their correlation according to vector representations to at least one cluster.

[0014] В другом частном варианте реализации способа список или часть списка используемых m-skip-n-грамм формируется исходя из встречаемости m-skip-n-грамм в текстовых данных, получаемых из внешних источников данных.[0014] In another particular implementation of the method, the list or part of the list of used m-skip-n-grams is formed based on the occurrence of m-skip-n-grams in text data obtained from external data sources.

[0015] Заявленное решение также осуществляется с помощью системы получения векторного представления электронного документа, которая содержит по меньшей мере один процессор и по меньшей мере одну память, хранящую машиночитаемые инструкции, которые при их исполнении процессором реализуют вышеуказанный способ.[0015] The claimed solution is also implemented using a system for obtaining a vector representation of an electronic document, which contains at least one processor and at least one memory storing machine-readable instructions that, when executed by the processor, implement the above method.

[0016] Заявленный технический результат также достигается за счет компьютерно-реализуемого способа определения категории конфиденциальной информации в текстовом документе, выполняемый с помощью процессора и содержащий этапы, на которых выполняют:[0016] The claimed technical result is also achieved due to a computer-implemented method for determining the category of confidential information in a text document, which is performed using a processor and contains the following steps:

- предварительное формирование по меньшей мере одной модели размещения m-skip-n-грамм по кластерам по заданной тематике, при этом m-skip-n-грамма представляет по меньшей мере отдельное слово и при формировании упомянутой модели осуществляют:- preliminary formation of at least one model for placing m-skip-n-grams in clusters on a given topic, while the m-skip-n-gram represents at least a single word, and when forming the said model, the following is carried out:

determining the list of used m-skip-n-grams;

converting to a vector representation of each m-skip-n-gram from the list;

clustering m-skip-n-grams by their vector representations;

- получают по меньшей мере один электронный текстовый документ;- receive at least one electronic text document;

- выполняют обработку полученного текстового документа с помощью полученной модели размещения m-skip-n-грамм, в ходе которой:- processing the received text document using the obtained m-skip-n-gram placement model, during which:

determining text document clusters based on the occurrence of m-skip-n-grams;

summarize the number of occurrences of m-skip-n-grams from each cluster;

формируют векторное представление текстового документа на основании упорядоченной последовательности сумм m-skip-n-грамм;

forming a vector representation of a text document based on an ordered sequence of m-skip-n-gram sums;

- определяют категорию конфиденциальной информации в электронном текстовом документе с помощью модели размещения m-skip-n-грамм.- determine the category of confidential information in the electronic text document using the m-skip-n-gram placement model.

[0017] В настоящем документе под «m-skip-n-граммой слов» (или просто «m-skip-n-граммой» // http://www.machinelearning.rU/wiki/images/7/78/2017 417 DrapakSN.pdf) понимается совокупность последовательности из т слов, которая получена из последовательности слов из некоторого текста, сохраняя в ней последовательность слов в текстах, при этом из исходной последовательности слов удалено m слов после каждого одного из т слов. Например, 0-skip-l-грамма слов это просто отдельные слова из текста. 0-skip-2-грамма это биграммы слов (пара подряд идущих слов в тексте), 0-skip-3-грамма это триграммы слов (тройка подряд идущих слов в тексте). Для построения 1-skip-2-граммы берется последовательность из трех слов, в которой удаляется второе слово - то есть, это первое и третье слово. Для построения 2-skip-4-граммы берется последовательность из 10 слов, в ней берется первое слов, затем 2 слова удаляется, затем берется следующее слово, далее удаляется 2 следующих слова, и так, пока не будет получена последовательность из 4 слов. Например, имеем предложение «в соответствии с одним или более вариантами реализации настоящего изобретения». В ней удаляются слова «соответствии с», «или более», «реализации настоящего». Тогда 2-skip-4-грамма для данного предложения будет выглядеть как «в одним вариантами изобретения». При этом при построении m-skip-n-грамм за «слово» могут приниматься, как только слова языка, так и любой знак препинания, предлог, союз или любая самостоятельная единица языка. Всюду далее, без потери общности для упрощения изложения, под «словом» («словами» и прочим) будет понимать любую возможную m-skip-n-грамму слов и последовательность m-skip-n-грамм слов, если не сказано иное, например, «отдельных слов», «одиночных слов» и подобное. При этом также будем использовать и сам исходный термин m-skip-n-грамма.[0017] In this document, under the "m-skip-n-gram of words" (or simply "m-skip-n-gram" // http://www.machinelearning.rU/wiki/images/7/78/2017 417 DrapakSN.pdf) is understood as a set of a sequence of m words, which is obtained from a sequence of words from some text, keeping in it a sequence of words in texts, while m words are removed from the original sequence of words after each one of m words. For example, 0-skip-l-gram words are just single words from the text. 0-skip-2-grams are bigrams of words (a pair of consecutive words in the text), 0-skip-3-grams are trigrams of words (three consecutive words in the text). To build a 1-skip-2-gram, a sequence of three words is taken, in which the second word is deleted - that is, these are the first and third words. To build a 2-skip-4-gram, a sequence of 10 words is taken, the first word is taken in it, then 2 words are removed, then the next word is taken, then the next 2 words are removed, and so on, until a sequence of 4 words is obtained. For example, we have the sentence "in accordance with one or more embodiments of the present invention." It removes the words "according to", "or more", "implementation of the present". Then the 2-skip-4-gram for this sentence would look like "in one variation of the invention". At the same time, when constructing m-skip-n-grams, a “word” can be taken as only the words of the language, as well as any punctuation mark, preposition, conjunction, or any independent unit of the language. Everywhere below, without loss of generality to simplify the presentation, by "word" ("words" and so on) we mean any possible m-skip-n-gram of words and a sequence of m-skip-n-grams of words, unless otherwise stated, for example , "individual words", "single words" and the like. In this case, we will also use the original term m-skip-n-gram itself.

[0018] Под «эмбедингом» слова (от англ. embedding - вложение) или же «векторным представлением» слова или просто «вектором» слова будем понимать такой числовой вектор, который получены из слов или других языковых сущностей, и который определен для слова, и имеет фиксированную размерность для метода его получения. Другими словами, векторным представлением слова является упорядоченная последовательность чисел - числовой вектор некоторого размера, когда каждое слово имеет свой определенный числовой вектор. В самом простом случае эмбеддинги слов можно получить нумерацией слов в некотором словаре и постановкой значения равного 1 в векторе, размерность которого равна числу слов в этом словаре. При этом на остальных позициях будут находиться значения равные 0. Например, для русского языка можно использовать Толковый словарь Даля. В нем пронумеруем все слова от первого до последнего. Так слово «абажур» будет иметь значение 1 на позиции 3, «абанат» - иметь значение 1 на позиции 7, и так далее. Если в словаре 200000 слов, то эмбединг будет иметь размерность 200000.[0018] By “embedding” a word (from English embedding - embedding) or a “vector representation” of a word or simply a “vector” of a word, we will understand such a numeric vector that is obtained from words or other linguistic entities, and which is defined for a word, and has a fixed dimension for the method of obtaining it. In other words, the vector representation of a word is an ordered sequence of numbers - a numeric vector of some size, when each word has its own specific numeric vector. In the simplest case, word embeddings can be obtained by numbering words in some dictionary and setting a value equal to 1 in a vector whose dimension is equal to the number of words in this dictionary. In this case, the remaining positions will contain values equal to 0. For example, for the Russian language, you can use Dahl's Explanatory Dictionary. We number all the words in it from the first to the last. So the word "lampshade" would have the value 1 at position 3, "abanat" would have the value 1 at position 7, and so on. If there are 200,000 words in the dictionary, then the embedding will have a dimension of 200,000.

Подобный метод построения эмбеддингов называют - one-hot encoding. Описание изобретения не ограничивает способ получения векторов слов. Данные вектора могут быть получены, например, нейронной сетью, реализующей математическое преобразование из пространства с одним измерением на слово в некоторое пространство вектора с большей размерностью. Иными методами, позволяющими сопоставить каждой m-skip-n-грамме свой вектор чисел заданной размерности. Данное векторное представление слов можно получать из уже известных набор векторизированных слов (Word2Vec, Glove, FastText и другие), модифицируя их или без такового. При этом под кластеризацией понимается группировка множества объектов на подмножества (кластеры) таким образом, чтобы объекты из одного кластера были более похожи друг на друга, чем на объекты из других кластеров по заданному критерию. Под «весом» элемента, например, «вес кластера» или «вес m-skip-n-граммы», можно понимать математическую конструкцию, коэффициенты, множители, используемые при проведении суммирования, интегрирования или усреднения и прочего с целью придания некоторым элементам большей значимости в результирующем значении по сравнению с другими элементами. «Вес» можно определить и как дополнительный множитель, коэффициент или число, сопоставляемое отдельным слагаемым или другим факторам, в скалярном произведении элементов используемого векторного пространства.A similar method of building embeddings is called one-hot encoding. The description of the invention does not limit the method for obtaining word vectors. The vector data can be obtained, for example, by a neural network that implements a mathematical transformation from a space with one dimension per word to some vector space with a higher dimension. Other methods that allow each m-skip-n-gram to be associated with its own vector of numbers of a given dimension. This vector representation of words can be obtained from already known set of vectorized words (Word2Vec, Glove, FastText and others), with or without modification. In this case, clustering is understood as a grouping of a set of objects into subsets (clusters) in such a way that objects from one cluster are more similar to each other than to objects from other clusters according to a given criterion. The “weight” of an element, for example, “cluster weight” or “m-skip-n-gram weight”, can be understood as a mathematical construction, coefficients, multipliers used when summing, integrating or averaging, etc., in order to give some elements greater significance in the resulting value compared to other elements. "Weight" can also be defined as an additional factor, coefficient or number associated with individual terms or other factors in the scalar product of the elements of the vector space used.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[0019] Настоящее изобретение иллюстрируется на примерах, без каких бы то ни было ограничений; его сущность становится понятной при рассмотрении приведенного ниже подробного описания изобретения в сочетании с чертежами, при этом:[0019] The present invention is illustrated by examples, without any limitation; its essence becomes clear when considering the following detailed description of the invention in combination with the drawings, while:

[0020] На Фиг. 1 схематически показан пример процесса автоматического получения векторного представления документа.[0020] In FIG. 1 schematically shows an example of a process for automatically obtaining a vector representation of a document.

[0021] На Фиг. 2 схематически показан пример процесса получения множества наборов кластеров с весами для каждого кластера в каждом наборе кластеров.[0021] In FIG. 2 schematically shows an example of a process for obtaining a plurality of cluster sets, with weights for each cluster in each cluster set.

[0022] На Фиг. 3 схематически показан пример процесса получения множества m-skip-n-грамм слов, где каждая m-skip-n-грамм слова нечетко относится к каждому кластеру в каждом наборе кластеров.[0022] In FIG. 3 schematically shows an example of a process for obtaining a plurality of m-skip-n-gram words, where each m-skip-n-gram word is loosely related to each cluster in each set of clusters.

[0023] На Фиг. 4 схематически показан пример процесса извлечения признаков документа.[0023] In FIG. 4 schematically shows an example of a document feature extraction process.

[0024] На Фиг. 5 схематически показан пример сопоставления кластеров слов с позициями вектора документа, для двумерного случая векторного представления слов и четким соотнесением слова к одному кластеру.[0024] In FIG. Figure 5 schematically shows an example of matching word clusters with positions of the document vector, for the two-dimensional case of the vector representation of words and a clear correlation of the word to one cluster.

[0025] Фиг. 6 иллюстрирует общую схему вычислительного устройства.[0025] FIG. 6 illustrates the general layout of a computing device.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯIMPLEMENTATION OF THE INVENTION

[0026] В настоящем документе описываются способы получения векторного представления электронного документа для дальнейшей обработки, передачи и хранения. Изобретение может быть применено к любым естественным языкам, искусственным языкам, любым знаковым система, при этом, далее в настоящем документе, все это будем называть просто «языком». Таким образом, далее, под «языком» понимаются естественные языки, искусственные языки, любые знаковые системы.[0026] This document describes how to obtain a vector representation of an electronic document for further processing, transmission and storage. The invention can be applied to any natural languages, artificial languages, any sign system, and, further in this document, all of this will be referred to simply as "language". Thus, further, "language" means natural languages, artificial languages, any sign systems.

[0027] Автоматическая обработка, передача, хранение документов, может включать классификацию исходных документов, их кластеризацию и иные действия путем соотнесения векторного представления документа с другим векторным представлением документа, любого множества или группы документов.[0027] Automatic processing, transmission, storage of documents may include the classification of source documents, their clustering and other actions by correlating a vector representation of a document with another vector representation of a document, any set or group of documents.

[0028] На Фиг. 1 представлена общая блок-схема заявленного способа получения векторной формы документа. На первом этапе (101) реализации изобретения производится получение модели размещения m-skip-n-грамм по кластерам. Такая модель представляет собой список m-skip-n-грамм каждая из которых соотнесена по меньшей мере к одному из кластеров. Пример такой модели для частного случая, когда m-skip-n-граммы являются символами и словами из русского языка, и разбивки на 1000 кластеров приведен в таблице ниже:[0028] In FIG. 1 shows a general block diagram of the claimed method for obtaining the vector form of a document. At the first stage (101) of the implementation of the invention, a model for placing m-skip-n-grams in clusters is obtained. Such a model is a list of m-skip-n-grams, each of which is assigned to at least one of the clusters. An example of such a model for a particular case, when m-skip-n-grams are symbols and words from the Russian language, and a breakdown into 1000 clusters is shown in the table below:

Процесс получения модели размещения m-skip-n-грамм по кластерам.The process of obtaining a placement model of m-skip-n-grams by clusters.

[0029] Для формирования модели сначала определяется список используемых m-skip-n-грамм. В частном случае это может быть толковый словарь языка, словари эмбедингов слов (Word2Vec, Glove, FastText и другие), список всех слов из статей с сайта «Википедия». В таком случае список слов должен быть большим, чтобы покрыть достаточное количество слов из анализируемых документов. В другом частном случае данный список проходит еще через нормализацию слов - процесс, когда получают леммы (https://ru.wikipedia.org/wiki/JleMMaTH3a4HR), стемминги (https://ru.wikipedia.org/wiki/Стемминг), морфологические корни, морфологические основы слов. В частном случае заглавные (прописные) и строчные буквы не различаются при формировании m-skip-n-грамм. В другом частном случае заглавные буквы различаются от строчных и их наличие образовывает разные m-skip-n-граммы. Помимо слов естественного, формального, искусственного языков в список могут включаться знаки препинания, цифры, аббревиатуры, нарицательные имена, имена собственные, сокращения, специальные символы (@";#№$%%^∧*()_}{|\ и прочие), математические символы и любые другие печатные последовательности символов. В частном случае можно рассматривать любую последовательность печатных знаков в знаковой системе при введенном знаке разделения слов. Размер списка, конкретные m-skip-n-граммы, использование нормализации выбираются исходя из решаемой задачи, доступных вычислительных ресурсов и других факторов.[0029] To generate the model, a list of m-skip-n-grams to be used is first determined. In a particular case, this can be an explanatory dictionary of the language, dictionaries of word embeddings (Word2Vec, Glove, FastText and others), a list of all words from articles from the Wikipedia site. In such a case, the word list must be large to cover a sufficient number of words from the analyzed documents. In another particular case, this list goes through the normalization of words - the process when they receive lemmas (https://ru.wikipedia.org/wiki/JleMMaTH3a4HR), stemmings (https://ru.wikipedia.org/wiki/stemming), morphological roots, morphological bases of words. In a particular case, capital (uppercase) and lowercase letters do not differ when forming m-skip-n-grams. In another particular case, capital letters differ from lowercase letters and their presence forms different m-skip-n-grams. In addition to the words of natural, formal, artificial languages, the list may include punctuation marks, numbers, abbreviations, common nouns, proper names, abbreviations, special characters (@";#№$%% ^∧ *()_}{|\ and others) , mathematical symbols and any other printable sequences of characters.In a particular case, any sequence of printed characters in the sign system can be considered with the entered word separation sign.The size of the list, specific m-skip-n-grams, the use of normalization are selected based on the problem being solved, available computational resources and other factors.

[0030] Далее каждая m-skip-n-грамма из списка преобразовывается в векторное представление с помощью модели естественного языка или без таковой. Настоящее описание не ограничивает способ получения, размерность или вид векторного представления. Для этого можно использовать способы, которые основаны на алгоритмах и моделях естественного языка Word2Vec, Glove, FastText, Universal Sentence Encoder, дополнение с трансформацией через сингулярное разложение, Tf-Idf. В одном частном варианте реализации вектор m-skip-n-граммы представляет собой конкатенацию нескольких векторов этой m-skip-n-граммы, полученных по разным способам.[0030] Next, each m-skip-n-gram from the list is converted to a vector representation with or without a natural language model. The present description does not limit the method of obtaining, the dimension or the form of the vector representation. To do this, you can use methods that are based on algorithms and natural language models Word2Vec, Glove, FastText, Universal Sentence Encoder, addition with transformation through singular value decomposition, Tf-Idf. In one particular implementation, the m-skip-n-gram vector is a concatenation of several vectors of this m-skip-n-gram obtained by different methods.

[0031] В одном частном варианте реализации способа в качестве предобученной модели естественного языка используют по крайней мере одну из следующих нейронных сетей: ELMO, GPT, GPT-2, GPT-3, LM-FiT, BERT-BASE, BERT-LARGE, RuBERT, CamemBERT, SciBERT, ROBERTA, R-Net, Semi-supervised Sequence learning, GloVe, Word2Vec, FastText, InferSent, OpenAI Transformer, USE-Transformer (Universal sentence encoder), USE-DAN, T-NLG (Turing Natural Language Generation) или наследуемые от них модели, в том числе и применение сокращений нейронных сетей (https://blog.rasaxom/compressing-bert-for-faster-prediction-2/#overview-of-model-compression) методами: дистилляции (https://habr.com/ru/company/avito/blog/485290/), прунинга (https://neurohive.io/ru/novosti/kak-sokratit-razmer-nejroseti-na-10-20-i-ne-proigrat-v-tochnosti/, https://openreview.net/pdf?id=rJl-b3RcF7), квантизации (https://www.researchgate.net/publication/342938169 KVANTIZACIA VESOV GLUBOKIH NEJRONNYH SETEJ POSLE OBUCENIA. https://www.hse.ru/data/20 19/07/12/1477568723/Грачев_резюме.pdf).[0031] In one particular embodiment of the method, at least one of the following neural networks is used as a pretrained natural language model: ELMO, GPT, GPT-2, GPT-3, LM-FiT, BERT-BASE, BERT-LARGE, RuBERT , CamemBERT, SciBERT, ROBERTA, R-Net, Semi-supervised Sequence learning, GloVe, Word2Vec, FastText, InferSent, OpenAI Transformer, USE-Transformer (Universal sentence encoder), USE-DAN, T-NLG (Turing Natural Language Generation) or models inherited from them, including the use of neural network abbreviations (https://blog.rasaxom/compressing-bert-for-faster-prediction-2/#overview-of-model-compression) by methods: distillation (https: //habr.com/ru/company/avito/blog/485290/), pruning (https://neurohive.io/ru/novosti/kak-sokratit-razmer-nejroseti-na-10-20-i-ne- proigrat-v-tochnosti/, https://openreview.net/pdf?id=rJl-b3RcF7), quantizations (https://www.researchgate.net/publication/342938169 KVANTIZACIA VESOV GLUBOKIH NEJRONNYH SETEJ POSLE OBUCENIA. https:/ /www.hse.ru/ data/20 19/07/12/1477568723/Grachev_resume.pdf).

[0032] В другом частном варианте реализации способа в качестве предобученной модели естественного языка используют вышеуказанные модели с последующем дообучением эмбедингов m-skip-n-грамм на наборе текстов по банковским документам (данным, материалам) или ином наборе текстовых документов.[0032] In another particular embodiment of the method, the above models are used as a pre-trained natural language model, followed by additional training of m-skip-n-gram embeddings on a set of texts from bank documents (data, materials) or another set of text documents.

[0033] Далее выполняется кластеризация векторов m-skip-n-грамм. Настоящее описание не ограничивает способ кластеризации списка m-skip-n-граммы. Можно использовать такие алгоритмы кластеризации как K-Means, Affinity propagation, Mean-shift, Spectral clustering, Ward hierarchical clustering, Agglomerative clustering, DBSCAN, OPTICS, Gaussian mixtures, Birch. Конечным результатом кластеризации является соотнесение каждой m-skip-n-граммы по меньшей мере к одному кластеру.[0033] Next, the m-skip-n-gram vectors are clustered. The present description does not limit the way the m-skip-n-gram list is clustered. You can use such clustering algorithms as K-Means, Affinity propagation, Mean-shift, Spectral clustering, Ward hierarchical clustering, Agglomerative clustering, DBSCAN, OPTICS, Gaussian mixtures, Birch. The end result of clustering is the assignment of each m-skip-n-gram to at least one cluster.

[0034] По завершению получения модели размещения m-skip-n-грамм выполняется обработка текстовых документов на этапе (102) с ее применением. По завершению применения модели получаем векторный вид документа (103). Если остались еще необработанные документы, то переходим ко следующему документу. Так пока не будет обработан каждый документ.[0034] Upon completion of obtaining the m-skip-n-gram placement model, processing of text documents is performed at step (102) using it. Upon completion of the application of the model, we obtain a vector view of the document (103). If there are still unprocessed documents, then go to the next document. So until each document is processed.

[0035] На Фиг. 2 приведен пример этапов формирования модели размещения m-skip-n-грамм по кластерам (101). Здесь осуществляется получение множества наборов кластеров с весами для каждого кластера в каждом наборе. В начале производим выбор базового списка m-skip-n-грамм, затем получаем их вектора (1011). Настоящее описание не ограничивает как-либо список, вид m-skip-n-грамм и значения параметров m и n в них. Отличительной частью изобретения являет сама возможность использования не только отдельных, одиночных слов языка, но любых m-skip-n-грамм слов. При этом не ограничивается возможность выбора таких пит, что полученные m-skip-n-граммы будут совпадать, например, с отдельными словами, биграммами слов, триграммами слов и прочим.[0035] In FIG. Figure 2 shows an example of the stages of forming a model for placing m-skip-n-grams by clusters (101). Here, a set of cluster sets is obtained with weights for each cluster in each set. First, we select the basic list of m-skip-n-grams, then we get their vector (1011). The present description does not limit in any way the list, the form of m-skip-n-grams and the values of the parameters m and n in them. A distinctive part of the invention is the very possibility of using not only separate, single words of the language, but any m-skip-n-gram of words. In this case, the possibility of choosing such nums is not limited, so that the resulting m-skip-n-grams will coincide, for example, with individual words, bigrams of words, trigrams of words, and so on.

[0036] Этот этап может реализовываться путем использования готовых словарей (Word2Vec, Glove, FastText и другие), определением всевозможных или только интересующих m-skip-n-грамм исключительно в классе анализируемых текстовых документов (по имеющиеся базе данных или учебном наборе) или любым иным способом. Далее получаем векторное представление выбранных m-skip-n-грамм любым способом (алгоритмы и модели Word2Vec, Glove, FastText, Universal Sentence Encoder, дополнение с трансформацией сингулярным разложением, Tf-Idf и другие) (1011). На этапе 1012 проводим многократную кластеризацию векторного представления m-skip-n-грамм слов. Данная кластеризация производится любым известным способом кластеризации объектов при котором изначально фиксируется требуемое количество кластеров. Отличительной особенностью изобретения является то, что кластеризация может проводиться несколько раз. При этом она может производиться на разное количество кластеров одним и тем же алгоритмом кластеризации, на одинаковое количество кластеров одним и тем же алгоритмом кластеризации, но с разными начальными инициализациями, на одинаковое количество кластеров различными алгоритмами кластеризации или любым иным способом получения различных совокупностей кластеризации объектов.[0036] This stage can be implemented by using ready-made dictionaries (Word2Vec, Glove, FastText and others), defining all possible or only m-skip-n-grams of interest exclusively in the class of text documents being analyzed (according to the available database or training set) or any in a different way. Next, we obtain a vector representation of the selected m-skip-n-grams in any way (algorithms and models Word2Vec, Glove, FastText, Universal Sentence Encoder, addition with singular value transformation, Tf-Idf and others) (1011). At step 1012, we perform multiple clustering of the vector representation of m-skip-n-gram words. This clustering is performed by any known method of object clustering, in which the required number of clusters is initially fixed. A distinctive feature of the invention is that the clustering can be carried out several times. At the same time, it can be performed for a different number of clusters by the same clustering algorithm, for the same number of clusters by the same clustering algorithm, but with different initial initializations, for the same number of clusters by different clustering algorithms, or by any other method of obtaining different sets of object clustering.

[0037] Таким образом формируется множество наборов кластеров K={k₁, k₂, …, k_N-1, k_N} (этап 1013). Для каждого Су кластера - j-oгo кластера в i-ом разбиении на кластеры, где

, задается свой весовой коэффициент q_ij, который может характеризовать значимость этого кластера для векторного представления документа (этап 1014). Описание изобретения не ограничивает методы получения весовых коэффициентов q_ij и их значения. Отличительной частью изобретения являет сама возможность использования весовых коэффициентов для кластеров, что позволяет характеризовать значимость каждого кластера для векторного представления документа. Одной из возможных реализаций является подход, когда производится выделение «мусорных» кластеров и исключение данного кластера, например, обнулением его веса. При этом «мусорным» кластером могут являться кластеры с высокой долей «стоп-слов», общих слов, не несущих информативность слов для конкретной решаемой задачи анализа текстовых данных. Значения весовых коэффициентов можно получить какими-либо автоматизированными вычислениями, определить экспертной оценкой, любым иным способом. При этом возможна ситуация, когда все кластеры являются равнозначными или, как отмечено выше, когда какие-либо кластеры вообще исключаются из использования. Результатом данного этапа является множество наборов кластеров с весами для каждого кластера в каждом наборе (этап 1015).[0037] Thus, a set of cluster sets K={k ₁ , k ₂ , ..., k _N-1 , k _N } is formed (step 1013). For each Su cluster - the j-th cluster in the i-th partition into clusters, where

, its weight coefficient q _ij is set, which can characterize the significance of this cluster for the vector representation of the document (step 1014). The description of the invention does not limit the methods for obtaining the weight coefficients q _ij and their values. A distinctive part of the invention is the very possibility of using weight coefficients for clusters, which allows characterizing the significance of each cluster for the vector representation of the document. One of the possible implementations is the approach when "garbage" clusters are selected and this cluster is excluded, for example, by zeroing its weight. At the same time, a "garbage" cluster can be clusters with a high proportion of "stop words", common words that do not carry the information content of words for a specific problem of text data analysis being solved. The values of the weight coefficients can be obtained by any automated calculations, determined by an expert assessment, in any other way. In this case, a situation is possible when all clusters are equivalent or, as noted above, when any clusters are generally excluded from use. The result of this step is a plurality of cluster sets with weights for each cluster in each set (step 1015).

[0038] На Фиг. 3 приведен пример этапов получения нечеткого соотнесения слов из списка. Для всех или части слов из словаря проводим разбиение на кластеры N раз (этап 1013) следующим образом: для каждой кластеризации может выбираться определенное количество кластеров; выбирается некоторый метод кластеризации, который позволяет кластеризовать объекты на заданное или неопределенное количество кластеров, и производится кластеризация. Каждую кластеризацию обозначим k_i, где

Тогда общее множество наборов кластеров обозначим K={k₁, k₂, …, k_N-1, k_N}. Каждая i-ая кластеризация k_i будет представлять собой набор из M_i кластеров. Каждый j-ый кластер в i-ом разбиении на кластеры обозначим С_ij, где

Далее, для данного множество наборов кластеров K={k₁, k₂, …, k_N-1, k_N} выбираем показатели соотнесения слова к кластерам (этап 201). Показатель соотнесения слова к кластерам позволяет определить нечеткое распределение слов по кластерам. В качестве такого показателя может быть расстояние от m-skip-n-грамм до центра кластера. Данное расстояние может нормироваться по расстояниям до центра всех кластеров, или по расстояниям только до нескольких (возможен вариант ближайших) кластеров. Другой возможный вариант реализации изобретения состоит в том, чтобы определять расстояние до оправленного количества ближайших объектов при четкой кластеризации и рассчитывать долю объектов из каждого кластера в этом оправленном количестве. При этом описание изобретение никак не ограничивает способ, меру и показатель получения нечеткого соотнесения объекта к кластеру. Отличительной особенностью изобретения является возможность использовать нечеткое соотнесение m-skip-n-грамм к кластерам. Таким образом для каждого используемого s-го слова определяется некоторый показатель w^(s) _ij, который характеризует в какой степени s-ое слово относится к кластеру С_ij (этап 302).[0038] In FIG. 3 shows an example of the stages of obtaining a fuzzy correlation of words from the list. For all or part of the words from the dictionary, we perform clustering N times (step 1013) as follows: for each clustering, a certain number of clusters can be selected; some clustering method is selected that allows objects to be clustered into a given or indefinite number of clusters, and clustering is performed. Denote each clustering by k _i , where

Then the total set of sets of clusters will be denoted as K={k ₁ , k ₂ , …, k _N-1 , k _N }. Each i-th clustering k _i will be a set of M _i clusters. Each j-th cluster in the i-th partition into clusters will be denoted as С _ij , where

Next, for a given set of sets of clusters K={k ₁ , k ₂ , ..., k _N-1 , k _N }, we select indicators of matching words to clusters (step 201). The word-to-cluster correlation indicator allows you to determine the fuzzy distribution of words by clusters. As such an indicator, the distance from m-skip-n-grams to the center of the cluster can be. This distance can be normalized by the distances to the center of all clusters, or by the distances to only a few (possibly closest) clusters. Another possible implementation of the invention is to determine the distance to a fixed number of nearest objects with a clear clustering and calculate the proportion of objects from each cluster in this adjusted number. At the same time, the description of the invention does not limit in any way the method, measure and indicator of obtaining a fuzzy assignment of an object to a cluster. A distinctive feature of the invention is the ability to use the fuzzy mapping of m-skip-n-grams to clusters. Thus, for each s-th word used, some indicator w ^(s) _ij is determined, which characterizes the extent to which the s-th word belongs to the cluster C _ij (step 302).

[0039] Каждое слово мы можем отнести к любому количеству кластеров. Подобную разбивку на кластеры с нечетким разбиением по кластерам можно получить, например, методом нечеткой кластеризации «С-средних», или с использованием оценки специалиста. При этом могут быть выбраны такие показатели, что каждое слово будет относиться только к одному кластеру, например, в случае, когда значение показателя соотнесения равно единице для одного кластера (не обязательно самого близкого к вектору слова) и равно нулям для всех остальных. В некоторых случаях результаты такого выбора будут характеризовать четкую кластеризацию, но возможны варианты реализации изобретения, когда m-skip-n-грамм не всегда будет относится к самому близкому кластеру. То есть, при этом результаты такого выбора показателей соотнесения, в общем случае, не обязательно будут соответствовать результатам работы алгоритмов четкой кластеризации слов. Описание изобретения не ограничивает количество разбивок на кластеры, которых, очевидно, должно быть больше нуля (>0). Описание изобретения не ограничивает метрику и математические пространства для произведения кластеризации, и соотнесения слов с кластерами.[0039] We can attribute each word to any number of clusters. Such a clustering with fuzzy clustering can be obtained, for example, using the “C-means” fuzzy clustering method, or using an expert's assessment. In this case, indicators can be chosen such that each word will belong to only one cluster, for example, in the case when the value of the correlation indicator is equal to one for one cluster (not necessarily the word closest to the vector) and equal to zero for all others. In some cases, the results of such a choice will characterize a clear clustering, but embodiments of the invention are possible when m-skip-n-gram will not always refer to the closest cluster. That is, while the results of such a choice of correlation indicators, in the general case, will not necessarily correspond to the results of the work of clear word clustering algorithms. The description of the invention does not limit the number of clusterings, which, obviously, must be greater than zero (>0). The description of the invention does not limit the metric and mathematical spaces for the product of clustering, and the correlation of words with clusters.

[0040] Этапы (201) и (202) имеют общие предшествующие шаги с этапами (1014) и (1015). Данными этапами являются этапы (1011, 1012, 1013). Описание изобретения не ограничивает взаимную выполнения упомянутых этапов шагов и их совокупное использование, а дает только пояснения к ним.[0040] Steps (201) and (202) have common preceding steps with steps (1014) and (1015). These steps are steps (1011, 1012, 1013). The description of the invention does not limit the mutual implementation of the mentioned stages of the steps and their combined use, but only provides explanations for them.

[0041] Пример раскрытия этапа (102) представлен на Фиг. 4. На вход этапа (102) подается документ с данными в текстовом виде. В частном варианте реализации способа сначала производится определение m-skip-n-грамм, которые присутствуют в документе и которые присутствуют в списке m-skip-n-грамм модели размещения (этап 1021).[0041] An exemplary disclosure of step (102) is shown in FIG. 4. At the input of stage (102) is a document with data in text form. In a particular implementation of the method, the m-skip-n-grams that are present in the document and that are present in the list of m-skip-n-grams of the placement model are first determined (step 1021).

[0042] При реализации этапа (1021) выполняется подсчет количества встречаемости т-skip-n-грамм из сформированного списка в документе. Затем смотрится к каким кластерам относятся эти m-skip-n-граммы. Подсчитывается количество m-skip-n-грамм из документа внутри каждого кластера.[0042] When implementing step (1021), the number of occurrences of m-skip-n-grams from the generated list in the document is counted. Then it looks to which clusters these m-skip-n-grams belong. The number of m-skip-n-grams from the document within each cluster is counted.

Например, имеем следующее предложение: «министр выступит на этой неделе». В нем, в соответствии с примером из таблицы 1, кластер 457 встречается 1 раз («министр»), кластер 537-1 раз («выступит»), кластер 737-2 раза («на», «этой»), кластер 368-1 раз («неделе»). Тогда итоговый вектор документ будет представлять собой вектор, у которого на позициях 457, 537 и 368 будет стоять значение 1, на позиции 737 значение 2, на всех остальных позициях - значение 0. В другом частном варианте реализации способа используется последующая нормировку данного вектора. В рассмотренном примере m-skip-n-граммы из списка встречаются 5 раз. Тогда нормированный вектор будет представлять собой вектор, у которого на позициях 457, 537 и 368 будет стоять значение 0,2, на позиции 737 значение 0,4, на всех остальных позициях - значение 0.For example, we have the following sentence: "the minister will speak this week." In it, in accordance with the example from Table 1, cluster 457 occurs 1 time (“minister”), cluster 537-1 time (“will speak”), cluster 737-2 times (“on”, “this”), cluster 368 -1 time ("week"). Then the final document vector will be a vector with the value 1 at positions 457, 537 and 368, the value 2 at positions 737, and the value 0 at all other positions. In another particular embodiment of the method, subsequent normalization of this vector is used. In the considered example m-skip-n-grams from the list occur 5 times. Then the normalized vector will be a vector with the value 0.2 at positions 457, 537 and 368, the value 0.4 at position 737, and the value 0 at all other positions.

[0043] Далее начинается формирование вектора документа. Каждая позиция в этом векторе соответствует определенному кластеру. Сначала берется m-skip-n-грамма из документа, для последующего определения к какому кластеру она относится по сформированной модели размещения m-skip-n-грамм. Затем значение, на соответствующей кластеру позиции в векторе документа, увеличиваем на количество этой m-skip-n-граммы в документе (этап 1022). В одном из частных вариантов способа реализации значение в позиции вектора увеличиваем не на количество m-skip-n-грамм, а на произведение количества и веса m-skip-n-граммы для этого кластера (этап 1022). На следующем этапе (1023) полученные значения на позициях вектора документа умножаем на соответствующие веса кластеров. Результатом является векторное представление документа (этап 1024).[0043] Next, the formation of the document vector begins. Each position in this vector corresponds to a specific cluster. First, an m-skip-n-gram is taken from the document, for subsequent determination to which cluster it belongs according to the generated m-skip-n-gram placement model. The value at the position corresponding to the cluster in the document vector is then incremented by the number of that m-skip-n-gram in the document (step 1022). In one of the particular implementation options, the value in the position of the vector is increased not by the number of m-skip-n-grams, but by the product of the number and weight of the m-skip-n-gram for this cluster (step 1022). At the next stage (1023), the obtained values at the positions of the document vector are multiplied by the corresponding weights of the clusters. The result is a vector representation of the document (block 1024).

[0044] В соответствии с одним или более вариантами реализации настоящего изобретения, пример способа автоматизированного получения векторного представления электронного документа может включать в себя следующие этапы. Выбирается вектор, размерность которого совпадает с общим количеством полученных кластеров во всех кластеризациях. Данный вектор инициализируем произвольными начальными значениями. Описание изобретения не ограничивает значения, используемые для начальной инициализации вектора.[0044] In accordance with one or more embodiments of the present invention, an example of a method for automatically generating a vector representation of an electronic document may include the following steps. A vector is selected whose dimension coincides with the total number of clusters obtained in all clusterings. This vector is initialized with arbitrary initial values. The description of the invention does not limit the values used for the initial initialization of the vector.

[0045] Возможно также использовать, например, нулевые значения. Каждая позиция вектора строго соответствует определенному кластеру (Фиг. 5). Осуществляется извлечение слов из документа. Данное извлечение можно получить, например, последовательный проходом по документу, использовать некоторое представление документа, которое уже имеет извлеченные слова и их количества, любым иным способом. Для каждого слова используют веса соотношения для каждого кластера в каждом наборе (этап 202).[0045] It is also possible to use, for example, zero values. Each position of the vector strictly corresponds to a certain cluster (Fig. 5). The words are extracted from the document. This extraction can be obtained, for example, by a sequential pass through the document, use some representation of the document, which already has the extracted words and their numbers, in any other way. For each word, ratio weights are used for each cluster in each set (step 202).

[0046] Для примера рассмотрим последовательное извлечение слов из текста, не ограничивая варианты реализации изобретения. Выполняется получение слова из текста. По весам соотношения слова с каждым кластером выполняется поиск позиции в векторе документа, значения в которых необходимо увеличить. Осуществляем увеличение данных значений в выявленных позициях на некоторое значение. Это значение может быть фиксированным, либо изменяемым в зависимости от условий в процессе обработки документа. Данное значение уже может учитывать в себе вес соотношения слова с каждым кластером.[0046] For example, consider the sequential extraction of words from text, without limiting the embodiments of the invention. The word is obtained from the text. According to the weights of the correlation of a word with each cluster, a search is made for a position in the document vector, the values in which must be increased. We increase these values in the identified positions by a certain value. This value can be fixed or change depending on the conditions during document processing. This value can already take into account the weight of the word's relationship with each cluster.

[0047] Возможен также вариант изобретения, когда увеличение значений на позиции в векторе происходит на фиксированное значение, которое затем умножается на вес соотношения слова. Описание настоящего решения не ограничивает методы изменения значений на позициях в векторе, которые связаны с данным словом. Возможно также учитывать в итоговом изменении значений в векторе нечеткое соотнесение слова к различным кластерам. При этом не ограничиваются значения изменений значений, которые, в общем случае, могут быть и отрицательными. Для расчета значений, на которые изменяются значения в позициях векторов, можно использовать различные методы, которые позволяют учитывать и любые иные характеристики слов. Таким примером может быть использование метода «частоты использования слов - обратной частоты документа» (TF-IDF, Term Frequency - Inverse Document Frequency), или просто частоты слов. Также, для расчета значений, можно учитывать вес кластера, характеризующий значимость кластера для векторного представления документа. Проходя так по всему тексту, производим увеличение соответствующих позиций в векторе документа. Итогом прохода может стать векторный вид документа.[0047] It is also possible for the invention to increase the values at a position in the vector by a fixed value, which is then multiplied by the weight of the word ratio. The description of the present solution does not limit methods for changing values at positions in a vector that are associated with a given word. It is also possible to take into account the fuzzy correlation of the word to different clusters in the final change in the values in the vector. In this case, the values of the change in values are not limited, which, in the general case, can be negative. To calculate the values by which the values in the positions of the vectors change, you can use various methods that allow you to take into account any other characteristics of the words. An example would be to use the Term Frequency - Inverse Document Frequency (TF-IDF) method, or just word frequency. Also, to calculate the values, you can take into account the weight of the cluster, which characterizes the significance of the cluster for the vector representation of the document. Passing through the entire text in this way, we increase the corresponding positions in the document vector. The result of the pass can be a vector view of the document.

[0048] Настоящее описание не ограничивает подходы по использованию весов кластеров, характеризующие значимость каждого кластера для векторного представления документа, и весов соотношения слова для каждого кластера в каждом наборе. Заявленное решение дает возможность использовать данные веса при составлении векторного вида документа. Полученные значения могут использоваться для векторного вида документа или любое их множество и подмножество использоваться в обработке данных, в самостоятельном виде или в совокупности с другими данными, для получения нового векторного вида документа. Примером такого совокупного использования может служить конкатенация с вектором метода TF-IDF. В данном случае осуществляется составление векторного представления документа по методу TF-IDF. Составление вектора документа может осуществляться с помощью выполнения заявленного способа. Производится конкатенация двух векторов (операцию соединения, склеивания векторов). Результатом конкатенации и будет являться вектор документа. Дополнительно возможно дальнейшее проведение алгебраических преобразований над полученным вектором документа.[0048] The present description does not limit approaches to using cluster weights characterizing the significance of each cluster for a vector representation of a document, and word ratio weights for each cluster in each set. The claimed solution makes it possible to use weight data when compiling a vector view of a document. The obtained values can be used for the vector view of the document, or any set and subset of them can be used in data processing, in an independent form or in combination with other data, to obtain a new vector view of the document. An example of such cumulative usage would be concatenation with a TF-IDF method vector. In this case, the vector representation of the document is compiled using the TF-IDF method. Compilation of the document vector can be carried out by performing the claimed method. Two vectors are concatenated (the operation of joining, gluing vectors). The result of the concatenation will be the document vector. Additionally, it is possible to carry out further algebraic transformations on the resulting document vector.

[0049] В одном частном варианте реализации способа используется нечеткое разбиение списка m-skip-n-грамм на кластера. В этом варианте каждая m-skip-n-грамм соотносится больше чем к одному кластеру. Каждая m-skip-n-грамм имеет свое соотнесение к кластеру в зависимости от расстояния до этого кластера. При этом для расчета данного расстояния могут использоваться и координаты центра кластера, и координаты m-skip-n-грамм из этого и других кластеров. В частном случае такого способа кластеризации m-skip-n-грамма относится ко всем кластерам. Примером алгоритма для реализации данном частном варианте кластеризации является C-Means.[0049] In one particular embodiment of the method, fuzzy partitioning of the list of m-skip-n-grams into clusters is used. In this variant, each m-skip-n-gram corresponds to more than one cluster. Each m-skip-n-gram has its own correlation to a cluster depending on the distance to this cluster. In this case, both the coordinates of the cluster center and the coordinates of m-skip-n-grams from this and other clusters can be used to calculate this distance. In a special case of this clustering method, the m-skip-n-gram applies to all clusters. An example of an algorithm for implementing this particular clustering variant is C-Means.

[0050] В одном из частных вариантов реализации каждая m-skip-n-грамма относится к нескольким кластерам. При этом m-skip-n-грамма может относиться к одному или более кластерам в одинаковой степени.[0050] In one particular implementation, each m-skip-n-gram belongs to several clusters. In this case, the m-skip-n-gram can refer to one or more clusters to the same extent.

[0051] В другом частном варианте реализации каждая m-skip-n-грамма имеет вес, характеризующий ее соотнесение к заданному кластеру. В отличие от предыдущих частных вариантов в данном случае вес может характеризовать не только близость m-skip-n-граммы к кластеру. При расчете данного веса может учитываться, например, плотность кластеров, чтобы m-skip-n-грамму соотносить в большей степени к менее плотному близкому кластеру. Вес является множителем, который используется для каждой m-skip-n-граммы с каждым кластером. Вес может принимать и нулевое значение.[0051] In another particular implementation, each m-skip-n-gram has a weight characterizing its correlation to a given cluster. In contrast to the previous particular variants, in this case, the weight can characterize not only the proximity of the m-skip-n-gram to the cluster. When calculating this weight, for example, the density of clusters can be taken into account in order to correlate the m-skip-n-gram more to a less dense close cluster. The weight is a multiplier that is used for each m-skip-n-gram with each cluster. Weight can also be zero.

[0052] Еще в одном частном варианте реализации кластеризация списка m-skip-n-грамм по их векторным представлениям выполняется более одного раза. В таком варианте кластеризация производится несколько раз на разное количество кластеров одним и тем же алгоритмом кластеризации. Итоговый вектор документа будет представлять собой конкатенацию векторов по каждой из кластеризации (пример на Фиг. 5). В другом случае используются разные алгоритмы кластеризации для разбиения на одинаковое количество кластеров или на разное количество кластеров. В еще одном случае используется одинаковое количество кластеров, и один и тот же алгоритм, но с разными начальными инициализациями, если алгоритм подразумевает возможность различного результата при разных инициализациях (K-Means, C-Means, Spectral clustering, Gaussian mixtures и другие). Количество кластеризации, количество кластеров в каждой кластеризации и алгоритм кластеризации подбирается в зависимости от решаемой задачи. В частном случае, если список m-skip-n-грамм более 10000, можно использовать разбивку на кластера размерами: 50, 100, 200, 300, 500, 700, 1000, 1500, 2000, 3000, 5000. В ином частном случае составляются отдельные списки, например, для 0-skip-l-грамм (одиночных слов), 0-skip-2-грамм (биграмм слов), 0-skip-3-грамм (триграмм слов). Каждый из этих списков отдельно кластеризуется несколько раз. Итоговым вектором документа будет являться конкатенация векторов по каждой из кластеризации для каждого списка.[0052] In yet another particular implementation, clustering a list of m-skip-n-grams by their vector representations is performed more than once. In this variant, clustering is performed several times for a different number of clusters by the same clustering algorithm. The resulting document vector will be a concatenation of the vectors for each of the clusterings (example in Fig. 5). In another case, different clustering algorithms are used to partition into the same number of clusters or into a different number of clusters. In another case, the same number of clusters is used, and the same algorithm, but with different initial initializations, if the algorithm implies the possibility of different results with different initializations (K-Means, C-Means, Spectral clustering, Gaussian mixtures, and others). The number of clustering, the number of clusters in each clustering and the clustering algorithm is selected depending on the problem being solved. In a particular case, if the list of m-skip-n-grams is more than 10000, you can use a breakdown into clusters with sizes: 50, 100, 200, 300, 500, 700, 1000, 1500, 2000, 3000, 5000. separate lists, for example, for 0-skip-l-grams (single words), 0-skip-2-grams (bigrams of words), 0-skip-3-grams (trigrams of words). Each of these lists is separately clustered multiple times. The final document vector will be the concatenation of vectors for each of the clustering for each list.

[0053] В другом частном варианте реализации для каждого кластера m-skip-n-грамм в каждой кластеризации (если их несколько) используются веса. Эти веса характеризуют значимость кластеров для векторного представления документа. Веса могут иметь и нулевые значения. В таком случае количество встречаемости m-skip-n-грамм в этом кластере обнуляется. Этот подход полезен, когда требуется исключить из вектора кластера, которые содержат стоп-слова - слова, не несущие тематического смысла (и, к, у, о, при, на и прочие). В частных случаях веса для кластеров могут рассчитываться методами машинного обучения, задаваться как экспертная оценка.[0053] In another particular implementation, weights are used for each m-skip-n-gram cluster in each clustering (if there is more than one). These weights characterize the significance of clusters for the vector representation of the document. The weights can also be zero. In this case, the number of occurrences of m-skip-n-grams in this cluster is reset to zero. This approach is useful when it is required to exclude from the vector of a cluster that contain stop words - words that do not carry a thematic meaning (i, k, y, o, at, on, and others). In particular cases, the weights for clusters can be calculated by machine learning methods and given as an expert estimate.

[0054] Также, m-skip-n-грамма может расширяться новыми m-skip-n-граммами. В таком варианте выбирается способ расчета вектора m-skip-n-граммы. Если в документ встречается m-skip-n-грамма, которой нет в списке, то для нее рассчитывается векторное представление (вектор). Далее этот вектор соотносится к кластерам, которые получены на этапе кластеризации списка m-skip-n-грамм. При этом новая m-skip-n-грамма соотносится по меньшей мере к одному кластеру. Дальнейшие шаги способа аналогичны случаю, если эта m-skip-n-грамма присутствует в модели размещения m-skip-n-грамм по кластерам.[0054] Also, the m-skip-n-gram can be expanded with new m-skip-n-grams. In this option, the method for calculating the m-skip-n-gram vector is selected. If the document encounters an m-skip-n-gram that is not in the list, then a vector representation (vector) is calculated for it. Further, this vector is related to the clusters that are obtained at the stage of clustering the list of m-skip-n-grams. In this case, the new m-skip-n-gram corresponds to at least one cluster. Further steps of the method are similar to the case if this m-skip-n-gram is present in the model for placing m-skip-n-grams in clusters.

[0055] Список используемых m-skip-n-грамм может формироваться исходя из встречаемости m-skip-n-грамм в текстовых данных, получаемых из внешних источников данных. Таким множеством текстов может быть любой внешний массив текстовых данных. Это может быть, например, множество анализируемых текстов различной тематики, собираемых, например, через новостные сайты. Обработка, классификация и кластеризация собираемой информации позволяет найти семантические сходства, аналоги, реализовать ранжирование результатов поиска или решить любую другую задачу обработки языка. Применение заявленного решения не ограничивает источник и природу используемого множества тестовых данных.[0055] The list of used m-skip-n-grams can be formed based on the occurrence of m-skip-n-grams in text data obtained from external data sources. Such a set of texts can be any external array of text data. This can be, for example, a set of analyzed texts on various topics, collected, for example, through news sites. Processing, classification and clustering of the collected information allows you to find semantic similarities, analogues, implement the ranking of search results or solve any other language processing problem. The application of the claimed solution does not limit the source and nature of the set of test data used.

[0056] Примером реализации такого варианта является случай решения задачи классификации текстовых документов. Тогда список m-skip-n-грамм формируется по исходному набору документов. Например, берутся все одиночные слова и биграммы, которые встречаются в этих текстах. Или 200000 самых частых биграмм. Для этих списков рассчитываются их векторные представления. Далее эти вектора подаются на кластеризацию, которая реализуется описанными способами.[0056] An example of the implementation of such an option is the case of solving the problem of classifying text documents. Then the list of m-skip-n-grams is formed according to the initial set of documents. For example, all single words and bigrams that occur in these texts are taken. Or 200,000 most frequent bigrams. For these lists, their vector representations are calculated. Further, these vectors are fed to clustering, which is implemented by the described methods.

[0057] При реализации настоящего изобретения все описанные варианты можно использовать в любой возможной совокупности и сочетании. Примером такого частного варианта является случай, когда по имеющемуся набору данных выбираются все встречающиеся одиночные слова, биграмм слов, триграмм слов. В каждом списке рассчитываются вектора n-грамм слов. Далее каждый список векторов подается на многократную кластеризацию. В каждой кластеризации для каждой n-граммы выбирается вес соотнесения ее к кластеру. Затем берем каждый анализируемый документ. По встречаемости n-грамм получаем вектор по каждой кластеризации. Значения в кластерах умножаются на веса кластеров. Каждый такой вектор нормируется. Все вектора конкатенируются в общий вектор. Описание патента не ограничивает совокупности использования различных подходов.[0057] When implementing the present invention, all the described options can be used in any possible combination and combination. An example of such a particular variant is the case when all occurring single words, bigrams of words, trigrams of words are selected from the available data set. In each list, vectors of n-gram words are calculated. Further, each list of vectors is submitted for multiple clustering. In each clustering, for each n-gram, the weight of its correlation with the cluster is selected. Then we take each analyzed document. According to the occurrence of n-grams, we obtain a vector for each clustering. The values in the clusters are multiplied by the weights of the clusters. Each such vector is normalized. All vectors are concatenated into a common vector. The description of the patent does not limit the combination of use of various approaches.

[0058] Отличительной особенностью изобретения является предоставление возможности учета ранее неизвестных m-skip-n-грамм. Не ограничивая изобретение можно привести следующий подход для учета новых m-skip-n-грамм. Например, проходя по текстовому представлению документа при выявлении m-skip-n-граммы, которой нет в используемом словаре, формируется последующее векторное представление данной m-skip-n-граммы. После чего m-skip-n-грамма относится к определенному кластеру, центр которого является ближайший к ней.[0058] A distinctive feature of the invention is the ability to account for previously unknown m-skip-n-grams. Without limiting the invention, the following approach can be given to account for new m-skip-n-grams. For example, when passing through the textual representation of the document, when an m-skip-n-gram is found that is not in the used dictionary, the subsequent vector representation of this m-skip-n-gram is formed. After which the m-skip-n-gram refers to a certain cluster, the center of which is closest to it.

[0059] Ускорение работы вычислительного алгоритма достигается за счет того, что при реализации заявленного решения непосредственно не используется нейронная сеть (GPT, BERT и прочие). Не требуется пропускать текст документа через огромное множество слоев нейронной сети, что и обуславливает длительное время работы нейронной сети в несколько минут. Сама нейронная сеть применяется на этапе обучения для получения эмбедингов m-skip-n-грамм. Эти эмбединги включают в себя высокое качество получения векторного представления фраз для данной нейронной сети. Далее производится их объединение в различные семантические группы путем многообразной кластеризации. Таким образом, используемая при функционировании настоящего решения обученная модель представляет собой простые словари m-skip-n-грамм, где каждой m-skip-n-грамм соответствует номер кластера. А при получении векторного представления документа осуществляется поиск m-skip-n-грамм в словаре.[0059] The acceleration of the computational algorithm is achieved due to the fact that the implementation of the proposed solution does not directly use the neural network (GPT, BERT, and others). It is not required to pass the text of the document through a huge number of layers of the neural network, which causes a long time for the neural network to work in several minutes. The neural network itself is used at the training stage to obtain m-skip-n-gram embeddings. These embeddings include a high quality vector representation of phrases for a given neural network. Further, they are combined into various semantic groups by means of multiple clustering. Thus, the trained model used in the operation of the present solution is a simple m-skip-n-gram dictionaries, where each m-skip-n-gram corresponds to a cluster number. And when a vector representation of the document is received, m-skip-n-grams are searched in the dictionary.

[0060] При использовании хранения словарей m-skip-n-грамм, например, в виде хешированных таблиц, сложность такого поиска составляет O(1) и является самым быстрым поиском в общем случае. Поскольку при реализации заявленного решения используется не целиком текст документа, как в случае нейронных сетей, а его разбиение на m-skip-n-граммы, то возможно некоторое падение точности анализа. Однако падение крайне мало и несущественно за счет того, что используются важные, семантически значимые m-skip-n-граммы, которые объединяются в многообразный семантический кластер, частоты которых и представляют векторный вид документа, тем самым сохраняя большинство семантических связей слов в документе. Таким образом достигается сокращение время обработки текста, содержащий порядка 10^∧4 слов, с нескольких минут (порядка 10^∧2 секунд), до десятых долей секунд (порядка 10^∧-1 - 1). Ускорение вычислений может доходит до 1000 раз. При этом падение точности классификации является незначительным и приемлемым при таком увеличении скорости обработки. Малое падение точности отличает изобретение от методов с использованием известных методов классификации на основе только Tf-Idf.[0060] When using the storage of m-skip-n-gram dictionaries, for example, in the form of hashed tables, the complexity of such a search is O(1) and is the fastest search in the general case. Since the implementation of the proposed solution does not use the entire text of the document, as in the case of neural networks, but its division into m-skip-n-grams, then some drop in the accuracy of the analysis is possible. However, the drop is extremely small and insignificant due to the fact that important, semantically significant m-skip-n-grams are used, which are combined into a diverse semantic cluster, the frequencies of which represent the vector form of the document, thereby preserving most of the semantic relationships of words in the document. Thus, the processing time of a text containing about 10 ^∧ 4 words is reduced from several minutes (about 10 ^∧ 2 seconds) to tenths of seconds (about 10 ^∧ -1 - 1). Acceleration of calculations can reach up to 1000 times. At the same time, the drop in classification accuracy is insignificant and acceptable with such an increase in processing speed. The small drop in accuracy distinguishes the invention from methods using known classification methods based only on Tf-Idf.

[0061] Также, заявленное решение позволяет снизить потребность в оперативной памяти. Для работы нейронной сети требует порядка 10^∧2-10^∧3 Мбайт памяти (оперативной или видео памяти). Как указано ранее в описании нейронные сети могут занимать и 680 Мбайт (BERT) и 1,8 Гбайт (LaBSE). Модель изобретения представляет собой словари m-skip-n-граммы с номерами кластеров. Для 100000 слов (0-skip-l-граммы) такой словарь занимает в памяти чуть больше 1 Мбайта. Для 100000 биграмм (m-skip-2-граммы) менее 3 Мбайт. Таким образом, даже при использовании по 10 различных кластеризации для слов, биграмм и триграмм общая модель будет занимать менее 100Мбайт, что в несколько раз меньше моделей нейронных сетей.[0061] Also, the claimed solution reduces the need for RAM. For the operation of a neural network, it requires about 10 ^∧ 2-10 ^∧ 3 MB of memory (RAM or video memory). As indicated earlier in the description, neural networks can take up both 680 MB (BERT) and 1.8 GB (LaBSE). The model of the invention is m-skip-n-gram dictionaries with cluster numbers. For 100,000 words (0-skip-l-grams), such a dictionary occupies a little more than 1 MB in memory. For 100000 bigrams (m-skip-2-grams) less than 3 MB. Thus, even when using 10 different clusterings for words, bigrams and trigrams, the total model will take less than 100 MB, which is several times less than neural network models.

[0062] Заявленное решение может быть реализовано на общедоступных ЭВМ. Поскольку этапы получения векторного вида текстового документа представляют собой поиск в таблицах и простые операции нормировки вектора, то для данных операций не требуется специальных графических вычислителей (GPU). Функционирование изобретения не требуются огромного множества матричных вычислений, как в случае функционирования нейронных сетей, а способно эффективно выполняться на центральном процессоре (CPU) ЭВМ. Получение векторного вида и выполнение классификации электронного документа занимает порядка 10^∧-1 секунд даже на офисных персональных ЭВМ (ноутбуках) выпуском до 5 лет назад.[0062] The claimed solution can be implemented on public computers. Since the stages of obtaining a vector view of a text document are a search in tables and simple vector normalization operations, these operations do not require special graphics computers (GPUs). The functioning of the invention does not require a huge number of matrix calculations, as in the case of the functioning of neural networks, but is able to be efficiently performed on the central processing unit (CPU) of the computer. Obtaining a vector view and performing the classification of an electronic document takes about 10 ^∧ -1 seconds even on office personal computers (laptops) released up to 5 years ago.

[0063] В заявленном решении также существует возможность разделения этапа получения векторного вида электронного документа и этапа его классификации. Векторный вид документа возможно получать на ЭВМ пользователя. Сформированный вектор можно отправлять на сервер для проведения централизованной классификации. При этом размер такого вектора составляет 1-10 Кбайт. На сервере проводится классификация данного документа по располагающейся там модели классификации (модели размещения m-skip-n-грамм). Это позволяет централизовано выполнять обновлении модели классификации. Реализация такого подхода эффективно невозможна в случае функционирования нейронных сетей, поскольку на стороне пользователя производится расчет по всей модели, которая имеет неотделимые (неразрывные) части получения векторного представления и классификации. Выделения из нейронной сети отдельно слоев классификации и их вывод на сервер не имеет смысла, поскольку они прочно связаны логикой работы и со значениями всей нейронной сети. Эффективное переобучении слоев кластеризации возможно только при переобучении всей сети. Описываемый подход централизованного функционирования классификатора на сервере возможно реализовать если отсылать весь текст документа на сервер. Но размер таких данных может составлять и более 100 Мбайт и требует высоких вычислительных мощностей на сервере, в том числе и графических.[0063] In the claimed solution, it is also possible to separate the stage of obtaining a vector view of an electronic document and the stage of its classification. The vector view of the document can be received on the user's computer. The generated vector can be sent to the server for centralized classification. The size of such a vector is 1-10 KB. On the server, this document is classified according to the classification model located there (m-skip-n-gram placement model). This allows you to centrally update the classification model. The implementation of such an approach is effectively impossible in the case of the functioning of neural networks, since on the user side the calculation is carried out over the entire model, which has inseparable (inseparable) parts of obtaining a vector representation and classification. Extracting separate classification layers from the neural network and outputting them to the server does not make sense, since they are firmly connected by the logic of work and with the values of the entire neural network. Effective retraining of clustering layers is possible only if the entire network is retrained. The described approach of the centralized functioning of the classifier on the server can be implemented if the entire text of the document is sent to the server. But the size of such data can be more than 100 MB and requires high computing power on the server, including graphics.

[0064] При реализации настоящего решения появляется возможность обработки всего документа и в случае размера его содержания порядка 10^∧5 слов. Методы, основанные на нейронных сетях, очень долго обрабатывают длинные документы. Документ с содержание в 100000 слов могут обрабатываться в течении десятков минут. Для определенных нейронных сетей требуется равномерное распределение смыслового содержания документа, поскольку конечное содержание будет являться более значимым. Время обработки нелинейно возрастает от размера документа. Это связано с архитектурами внутренних слоев нейронных сетей, использовании им рекуррентных подходов, или использовании «механизма внимания» (англ. attention mechanism, attention model) в архитектуре-трансформере. Для возможности обработки такого документы в приемлемое время берется только часть документа. Обычно такая часть составляет порядка 1000 слов. Недостатком такого подхода является то, что подчасть документа может быть непрезентабельной и не соответствовать смыслу и содержанию всего документа. Описываемое изобретение имеет линейную зависимость времени обработки от размера документа. Может обрабатывать весь документ, учитывая семантику каждой подчасти документа. Это достигается тем, что учитывается каждая используемая m-skip-n-грамма, а время обработки зависит только от их количества. Таким образом, время обработки документа с содержанием порядка 10^∧5 будет составлять порядка 1 секунды. А в получаемом векторном представлении содержание начала документа будет равнозначно содержанию его середины или его концовки.[0064] With the implementation of the present solution, it becomes possible to process the entire document and in the case of a content size of about 10 ^∧ 5 words. Methods based on neural networks take a very long time to process long documents. A document with a content of 100,000 words can be processed within tens of minutes. For certain neural networks, a uniform distribution of the semantic content of the document is required, since the final content will be more meaningful. Processing time increases non-linearly with document size. This is due to the architecture of the inner layers of neural networks, his use of recurrent approaches, or the use of the "attention mechanism" (English attention mechanism, attention model) in the architecture-transformer. In order to be able to process such a document in a reasonable time, only a part of the document is taken. Usually such a part is about 1000 words. The disadvantage of this approach is that a sub-part of the document may be unpresentable and not correspond to the meaning and content of the entire document. The described invention has a linear relationship between processing time and document size. Can process the entire document, given the semantics of each sub-part of the document. This is achieved by taking into account each used m-skip-n-gram, and the processing time depends only on their number. Thus, the processing time for a document with a content of about 10 ^∧ 5 will be about 1 second. And in the resulting vector representation, the content of the beginning of the document will be equivalent to the content of its middle or its ending.

[0065] Заявленный способ (100) может применяться в частности для поиска и отбора схожих банковских документов или для кластеризации. Эффектом работы изобретения, после классификации документа (текстовой информации) как содержащий конфиденциальную информацию, может быть одно или несколько действий из следующих:[0065] The claimed method (100) can be used in particular for searching and selecting similar banking documents or for clustering. The effect of the invention, after classifying a document (textual information) as containing confidential information, may be one or more of the following:

- блокировка ЭВМ и иных доступов пользователя, если пользователь не имеет права обработки или ознакомления с конфиденциальной информацией;- blocking the computer and other user access if the user does not have the right to process or familiarize with confidential information;

- сбор с ЭВМ пользователя, или другой электронной системы, данных о его учетных записях, ip-адресе, сетевом окружении, геолокации, используемом оборудовании, программном обеспечении или иных данных, позволяющих идентифицировать пользователя, с дальнейшей пересылкой собранных данных и сигнала о требуемом внимании в уполномоченное подразделение для проведения проверочных мероприятий и расследований;- collection from the user's computer, or other electronic system, of data about his accounts, ip-address, network environment, geolocation, equipment used, software or other data that allows the user to be identified, with further forwarding of the collected data and a signal of the required attention to an authorized unit for conducting verification activities and investigations;

- блокировка пересылки электронного сообщения и/или документов, если содержащийся в нем контент или вложение содержит конфиденциальную информацию, а отправка осуществляется на внешний почтовый ящик (вне защищаемого периметра организации), или в адресатах есть сотрудники, не имеющие доступа к конфиденциальной информации с учетом установленных политик доступа.- blocking the forwarding of an electronic message and / or documents if the content or attachment contained in it contains confidential information, and the sending is carried out to an external mailbox (outside the protected perimeter of the organization), or there are employees in the recipients who do not have access to confidential information, taking into account the established access policy.

[0066] По итогам классификации конфиденциальной информации, управляющее воздействие в отношении ЭВМ пользователя может также включать непосредственную блокировку ЭВМ с оповещением об установленном нарушении, например, с помощью отображения соответствующего уведомления на экране АРМ и формировании пакетов данных или сигнала на ЭВМ сотрудника службы безопасности. Каждый факт нарушения установленного доступа к конфиденциальной информации может сохраняться в базе данных для ведения журнала учета. Представленные примеры являются лишь частными случаями формирования итогового воздействия при реализации заявленного решения и не ограничивают иные возможные примеры его использования для целей предотвращения утечки чувствительных данных.[0066] Based on the classification of confidential information, the control action in relation to the user's computer may also include direct blocking of the computer with notification of an established violation, for example, by displaying the appropriate notification on the AWS screen and generating data packets or a signal on the computer of a security officer. Each fact of violation of the established access to confidential information can be stored in the database for keeping a log book. The presented examples are only special cases of the formation of the final impact in the implementation of the claimed solution and do not limit other possible examples of its use for the purpose of preventing the leakage of sensitive data.

[0067] На Фиг. 6 представлен общий вид вычислительного устройства (400), пригодного для реализации заявленного решения. Устройство (400) может представлять собой, например, сервер или иной тип вычислительного устройства, который может применяться для реализации заявленного технического решения. В том числе входить в состав облачной вычислительной платформы.[0067] In FIG. 6 shows a general view of a computing device (400) suitable for implementing the claimed solution. The device (400) may be, for example, a server or other type of computing device that can be used to implement the claimed technical solution. Including being part of a cloud computing platform.

[0068] В общем случае вычислительное устройство (400) содержит объединенные общей шиной информационного обмена один или несколько процессоров (401), средства памяти, такие как ОЗУ (402) и ПЗУ (403), интерфейсы ввода/вывода (404), устройства ввода/вывода (405), и устройство для сетевого взаимодействия (406).[0068] In general, the computing device (400) contains one or more processors (401), memory devices such as RAM (402) and ROM (403), input / output interfaces (404), input devices connected by a common information exchange bus /output (405), and a device for networking (406).

[0069] Процессор (401) (или несколько процессоров, многоядерный процессор) могут выбираться из ассортимента устройств, широко применяемых в текущее время, например, компаний Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. В качестве процессора (501) может также применяться графический процессор, например, Nvidia, AMD, Graphcore и пр.[0069] The processor (401) (or multiple processors, multi-core processor) may be selected from a variety of devices currently widely used, such as Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™, and etc. The processor (501) can also be a graphics processor such as Nvidia, AMD, Graphcore, etc.

[0070] ОЗУ (402) представляет собой оперативную память и предназначено для хранения исполняемых процессором (401) машиночитаемых инструкций для выполнение необходимых операций по логической обработке данных. ОЗУ (402), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.).[0070] RAM (402) is a random access memory and is designed to store machine-readable instructions executable by the processor (401) to perform the necessary data logical processing operations. The RAM (402) typically contains the executable instructions of the operating system and associated software components (applications, program modules, etc.).

[0071] ПЗУ (403) представляет собой одно или более устройств постоянного хранения данных, например, жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш-память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD-R/RW, BlueRay Disc, MD) и др.[0071] A ROM (403) is one or more persistent storage devices such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[0072] Для организации работы компонентов устройства (400) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (404). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь: PCI, AGP, PS/2, IrDa, Fire Wire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232 и т.п.[0072] Various types of I/O interfaces (404) are used to organize the operation of device components (400) and organize the operation of external connected devices. The choice of the appropriate interfaces depends on the particular design of the computing device, which can be, but not limited to: PCI, AGP, PS/2, IrDa, Fire Wire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro , mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[0073] Для обеспечения взаимодействия пользователя с вычислительным устройством (400) применяются различные средства (405) В/В информации, например, клавиатура, дисплей (монитор), сенсорный дисплей, тач-пад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п.[0073] To ensure user interaction with the computing device (400), various means (405) of I/O information are used, for example, a keyboard, a display (monitor), a touch screen, a touchpad, a joystick, a mouse, a light pen, a stylus, touch panel, trackball, speakers, microphone, augmented reality, optical sensors, tablet, indicator lights, projector, camera, biometric identification tools (retinal scanner, fingerprint scanner, voice recognition module), etc.

[0074] Средство сетевого взаимодействия (406) обеспечивает передачу данных устройством (400) посредством внутренней или внешней вычислительной сети, например, Интранет, Интернет, ЛВС и т.п. В качестве одного или более средств (406) может использоваться, но не ограничиваться: Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др.[0074] The networking means (406) enables the communication of data by the device (400) via an internal or external computer network, such as an Intranet, Internet, LAN, and the like. As one or more means (406) can be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and / or BLE module, Wi-Fi module and others

[0075] Дополнительно могут применяться также средства спутниковой навигации в составе устройства (400), например, GPS, ГЛОНАСС, BeiDou, Galileo.[0075] Additionally, satellite navigation tools in the device (400) can also be used, for example, GPS, GLONASS, BeiDou, Galileo.

[0076] Таким образом, заявленное решение позволяет достичь следующих преимуществ:[0076] Thus, the claimed solution achieves the following advantages:

Повысить скорость обработки текстовых данных банковского документа.

Increase the speed of processing text data of a banking document.

Уменьшить ресурсозатраты по оперативной памяти на реализацию обработки текстовых данных банковского документа.

Reduce the resource consumption of RAM for the implementation of the processing of text data of a bank document.

Реализовать обработку документа на ЭВМ пользователя с приемлемым временем работы и высоким качеством классификации.

To implement document processing on the user's computer with an acceptable operating time and high classification quality.

Возможность обработки очень больших документов (порядка 10^∧5 слов) за время порядка секунды.

Ability to process very large documents (of the order of 10 ^∧ 5 words) in a time of the order of a second.

Реализовать обработку на общедоступных персональных ЭВМ только с центральным процессором (CPU), без использования графических процессоров (GPU).

Implement processing on public personal computers with only a central processing unit (CPU), without the use of graphical processing units (GPU).

Повысить сохранность семантического смысла путем кластеризации m-skip-n-грамм слов, а не только отдельных слов как производится в известных патентах.

Increase semantic meaning retention by clustering m-skip-n-grams of words, not just individual words as done in known patents.

Повысить точность представления текста с помощью латентных тематик через осуществление многократной кластеризация одних и тех же m-skip-n-грамм слов, в отличии от известных патентов, где нигде не указано, что может производиться многократная или повторная кластеризация. В подходе нашего патента, например, можно производить кластеризация на 100,200, 357, 500, 500 (снова на 500, но с новой инициализацией), 1000, 5000, 10000 кластеров.

To improve the accuracy of text presentation using latent topics through the implementation of multiple clustering of the same m-skip-n-gram words, unlike well-known patents, where it is nowhere indicated that multiple or repeated clustering can be performed. In the approach of our patent, for example, it is possible to cluster by 100,200, 357, 500, 500 (again by 500, but with a new initialization), 1000, 5000, 10000 clusters.

Сохранить различную семантику слов, производя сопоставление слов к нескольким кластерам, тем самым сохраняется неоднозначность терминов, в отличие от известных патентов, где имеем отнесение каждого слова только к одному (ближайшему) кластеру.

Save different semantics of words by comparing words to several clusters, thereby preserving the ambiguity of terms, in contrast to known patents, where we assign each word to only one (closest) cluster.

Путем использования весов (априорных или апостериорных) для кластеров возможно снизить значимость или вообще исключить кластера с общими словами, тем самым повысить качество анализа, в частности расчета близости двух текстов, в отличие от известных патентов, где нет никаких весов кластеров.

By using weights (a priori or a posteriori) for clusters, it is possible to reduce the significance or even exclude clusters with common words, thereby improving the quality of the analysis, in particular the calculation of the proximity of two texts, in contrast to known patents, where there are no cluster weights.

Повысить точность анализа через обработку новых, ранее неизвестных слов.

Increase the accuracy of the analysis through the processing of new, previously unknown words.

Расширить применение получаемого векторного представления и непосредственное использование получаемого вектора, без обучения классификаторов, для задач выявления семантических сходств, поиска аналогов, ранжирование результатов поиска и прочих.

Expand the application of the resulting vector representation and the direct use of the resulting vector, without training classifiers, for the tasks of identifying semantic similarities, searching for analogues, ranking search results, and others.

[0077] Представленные материалы заявки раскрывают предпочтительные примеры реализации технического решения и не должны трактоваться как ограничивающие иные, частные примеры его воплощения, не выходящие за пределы испрашиваемой правовой охраны, которые являются очевидными для специалистов соответствующей области техники.[0077] The submitted application materials disclose preferred examples of the implementation of the technical solution and should not be interpreted as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

1. A computer-implemented method of vector representation of an electronic text document for determining the category of confidential information contained therein, performed by a processor and comprising the steps of:

- at least one model of placing m-skip-n-grams in clusters is formed, while the m-skip-n-gram represents at least a single word, and when forming the said model, the following is carried out:

determining the list of used m-skip-n-grams;

converting to a vector representation of each m-skip-n-gram from the list;

clustering m-skip-n-grams by their vector representations;

- processing at least one text document using the obtained m-skip-n-gram placement model, during which:

determining text document clusters based on the occurrence of m-skip-n-grams;

summarize the number of occurrences of m-skip-n-grams from each cluster;

- determine the category of confidential information in the text document based on the placement model m-skip-n-grams.

2. The method according to claim 1, characterized in that fuzzy partitioning of the list of m-skip-n-grams into clusters is used.

3. The method according to claim 1, characterized in that each m-skip-n-gram belongs to several clusters.

4. The method according to p. 1, characterized in that each m-skip-n-gram has a weight characterizing its correlation to a given cluster.

5. The method according to claim 1, characterized in that the clustering of m-skip-n-grams according to their vector representations is performed more than once.

6. The method according to p. 1, characterized in that weights are additionally used for clusters of m-skip-n-grams, characterizing the significance of clusters for the vector representation of the document.

7. The method according to claim 1, characterized in that additionally, vector representations of m-skip-n-grams not included in the list are obtained and their correlation according to vector representations to at least one cluster.

8. The method according to claim 1, characterized in that the list or part of the list of used m-skip-n-grams is formed based on the occurrence of m-skip-n-grams in text data obtained from external data sources.

9. A system for obtaining a vector representation of an electronic document, comprising at least one processor and at least one memory storing machine-readable instructions that, when executed by the processor, implement the method according to any one of paragraphs. 1-8.

10. A computer-implemented method for determining the category of confidential information in a text document, performed by a processor and comprising the steps of:

- perform preliminary formation of at least one model for placing m-skip-n-grams in clusters on a given topic, while the m-skip-n-gram represents at least a single word, and when forming the said model, the following is performed:

determining the list of used m-skip-n-grams;

converting to a vector representation of each m-skip-n-gram from the list;

clustering m-skip-n-grams by their vector representations;

- receive at least one electronic text document;

- processing the received text document using the obtained m-skip-n-gram placement model, during which:

determining text document clusters based on the occurrence of m-skip-n-grams;

summarize the number of occurrences of m-skip-n-grams from each cluster;

- determine the category of confidential information in the electronic text document using the m-skip-n-gram placement model.