RU2775351C1

RU2775351C1 - Method and system for obtaining a vector representation of an electronic document

Info

Publication number: RU2775351C1
Application number: RU2021115760A
Authority: RU
Inventors: Кирилл Евгеньевич Вышегородцев; Дмитрий Георгиевич Давидов; Дмитрий Юрьевич Рюпичев; Александр Викторович Балашов
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Filing date: 2021-06-01
Publication date: 2022-06-29

Abstract

FIELD: computing technology.

SUBSTANCE: computer-implemented method for obtaining a vector representation of an electronic document, executed by means of a processing unit and including the stages of: generating a model of placement of m-skip-n-grams by clusters, wherein the generation of said model involves the following: determining the list of used m-skip-n-grams; converting each m-skip-n-gram from the list into a vector representation; clustering the m-skip-n-grams; processing the text document using the resulting model, involving: calculating the occurrence of m-skip-n-grams in the document; determining clusters of the document based on the occurrence of m-skip-n-grams; summarizing the amount of occurrences of m-skip-n-grams from each cluster; forming a vector representation of the document.

EFFECT: possibility of preserving different semantics of words in the document by matching words to multiple clusters.

10 cl, 6 dwg, 1 tbl

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

[0001] Настоящее изобретение относится к вычислительным системам в широком смысле. Более конкретно - к системам и способам обработки естественного языка, искусственного языка, любых знаковых систем. Может использоваться в системах обработки информации, базах данных, электронных хранилищах[0001] The present invention relates to computing systems in a broad sense. More specifically, to systems and methods for processing natural language, artificial language, any sign systems. Can be used in information processing systems, databases, electronic storage

УРОВЕНЬ ТЕХНИКИBACKGROUND OF THE INVENTION

[0002] Автоматическая обработка, передача, хранение документов, может включать классификацию исходных документов, их кластеризацию и иные действия путем соотнесения векторного представления документа с другим векторным представлением документа, любого множества или группы документов. Варианты реализации данного изобретения могут быть схожи с решениями, изложенными ранее в патентах RU 2701995 С2, RU 2583716 С2, RU 2254610 С2.[0002] Automatic processing, transmission, storage of documents may include the classification of source documents, their clustering and other actions by correlating a vector representation of a document with another vector representation of a document, any set or group of documents. Embodiments of this invention may be similar to the solutions described earlier in patents RU 2701995 C2, RU 2583716 C2, RU 2254610 C2.

[0003] Основные недостатки существующих решений обусловлены следующим:[0003] The main disadvantages of existing solutions are due to the following:

- при кластеризации слов теряется семантическая связь между словами в тексте;- when clustering words, the semantic connection between words in the text is lost;

- проводимая кластеризация не обладает обобщающей способностью для описания широкого спектра возможных скрытых тематик текста;- the ongoing clustering does not have a generalizing ability to describe a wide range of possible hidden topics of the text;

- слово можно сопоставить только с одним кластером, что приводит к потере смысла слова в различных контекстах, например, «ключ» как ручеек воды и «ключ» как элемент криптографических систем;- a word can be associated with only one cluster, which leads to the loss of the meaning of the word in various contexts, for example, "key" as a stream of water and "key" as an element of cryptographic systems;

- полученную кластеризацию неэффективно использовать для расчета расстояний, поскольку много общих слов во всех текстах;- the resulting clustering is inefficient to use for calculating distances, since there are many common words in all texts;

- невозможно учесть новые слова, которых раньше не было в используемом словаре;- it is impossible to take into account new words that were not previously in the used dictionary;

- ограничение применения для классификационных задач.- limitation of application for classification problems.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

[0004] Заявленное изобретение направлено на решение технической проблемы, заключающейся в повышении качества анализа (классификация, кластеризация) текстовых данных с помощью их преобразовании в векторную форму.[0004] The claimed invention is aimed at solving a technical problem, which consists in improving the quality of analysis (classification, clustering) of text data by converting them into vector form.

[0005] Техническим результатом является повышение точности представления текстовых данных в векторном формате, за счет применения векторных представлений т-skip-n-грамм слов и их применения для последующей кластеризации текстового документа для преобразования его в векторный вид.[0005] The technical result is to increase the accuracy of the representation of text data in vector format, by using vector representations of t-skip-n-gram words and their application for subsequent clustering of a text document to convert it into a vector form.

[0006] Дополнительным результатом заявленного решения является также сохранение семантического смысла текста при его переводе в векторное представление, за счет кластеризации непосредственно m-skip-n-грамм слов.[0006] An additional result of the claimed solution is also the preservation of the semantic meaning of the text when it is translated into a vector representation, due to the clustering of directly m-skip-n-gram words.

[0007] Заявленный технический результат достигается за счет выполнения компьютерно-реализуемого способа получения векторного представления электронного документа, выполняемого с помощью процессора и содержащего этапы, на которых:[0007] The claimed technical result is achieved by performing a computer-implemented method for obtaining a vector representation of an electronic document, which is performed using a processor and contains the following steps:

- формируют по меньшей мере одну модель размещения m-skip-n-грамм по кластерам, при этом m-skip-n-грамма представляет по меньшей мере отдельное слово и при формировании упомянутой модели осуществляют:- at least one model of placing m-skip-n-grams in clusters is formed, while the m-skip-n-gram represents at least a single word, and when forming the said model, the following is carried out:

• определение списка используемых m-skip-n-грамм;• definition of the list of used m-skip-n-grams;

• преобразование в векторное представление каждой m-skip-n-граммы из списка:• conversion to vector representation of each m-skip-n-gram from the list:

• кластеризацию m-skip-n-грамм по их векторным представлениям;• clustering of m-skip-n-grams according to their vector representations;

- выполняют обработку по меньшей мере одного текстового документа с помощью полученной модели размещения m-skip-n-грамм, в ходе которой:- processing at least one text document using the obtained m-skip-n-gram placement model, during which:

• осуществляют подсчет встречаемости m-skip-n-грамм в текстовом документе;• calculate the occurrence of m-skip-n-grams in a text document;

• определяют кластеры текстового документа па основании встречаемости m-skip-n-грамм;• determine text document clusters based on the occurrence of m-skip-n-grams;

• суммируют количество встречаемости m-skip-n-грамм из каждого кластера;• sum the number of occurrences of m-skip-n-grams from each cluster;

• формируют векторное представление текстового документа на основании упорядоченной последовательности сумм m-skip-n-грамм.• a vector representation of a text document is formed based on an ordered sequence of m-skip-n-gram sums.

[0008] В одном частном варианте реализации способа используют нечеткое разбиение списка m-skip-n-грамм на кластеры.[0008] In one particular embodiment of the method, a fuzzy partitioning of the list of m-skip-n-grams into clusters is used.

[0009] В другом частном варианте реализации способа каждая m-skip-n-грамма относится к нескольким кластерам.[0009] In another particular embodiment of the method, each m-skip-n-gram belongs to several clusters.

[0010] В другом частном варианте реализации способа каждая m-skip-n-грамма имеет вес, характеризующий ее близость к заданному кластеру.[0010] In another particular embodiment of the method, each m-skip-n-gram has a weight characterizing its proximity to a given cluster.

[0011] В другом частном варианте реализации способа кластеризация m-skip-n-грамм по их векторным представлениям выполняется более одного раза.[0011] In another particular implementation of the method, the clustering of m-skip-n-grams according to their vector representations is performed more than once.

[0012] В другом частном варианте реализации способа дополнительно используются веса для кластеров m-skip-n-грамм, характеризующие значимость кластеров для векторного представления документа.[0012] In another particular embodiment of the method, weights are additionally used for m-skip-n-gram clusters, characterizing the significance of clusters for the vector representation of the document.

[0013] В другом частном варианте реализации способа дополнительно производится получение векторных представлений m-skip-n-грамм. не входящих в список, и их соотнесение по векторным представлениям к по меньшей мере одному кластеру.[0013] In another particular embodiment of the method, vector representations of m-skip-n-grams are additionally obtained. not included in the list, and their correlation by vector representations to at least one cluster.

[0014] В другом частном варианте реализации способа список или часть списка используемых m-skip-n-грамм формируется исходя из встречаемости m-skip-n-грамм в текстовых данных, получаемых из внешних источников данных.[0014] In another particular implementation of the method, the list or part of the list of used m-skip-n-grams is formed based on the occurrence of m-skip-n-grams in text data obtained from external data sources.

[0015] Заявленное решение также осуществляется с помощью системы получения векторного представления электронного документа, которая содержит но меньшей мере один процессор и по меньшей мере одну память, хранящую машиночитаемые инструкции, которые при их исполнении процессором реализуют вышеуказанный способ.[0015] The claimed solution is also implemented using a system for obtaining a vector representation of an electronic document, which contains but at least one processor and at least one memory storing machine-readable instructions that, when executed by the processor, implement the above method.

[0016] Заявленный технический результат также достигается за счет компьютерно-реализуемого способа поиска источников информации, выполняемого с помощью процессора и содержащего этапы, на которых:[0016] The claimed technical result is also achieved through a computer-implemented method of searching for sources of information, performed using a processor and containing the steps in which:

- формируют по меньшей мере одну модель размещения m-skip-n-грамм по кластерам по заданной тематике, при этом m-skip-n-грамма представляет по меньшей мере отдельное слово и при формировании упомянутой модели осуществляют:- forming at least one model for placing m-skip-n-grams in clusters on a given topic, while the m-skip-n-gram represents at least a single word, and when forming the said model, the following is carried out:

• преобразование в векторное представление каждой m-skip-n-граммы из списка;• conversion to vector representation of each m-skip-n-gram from the list;

- получают по меньшей мере один источник информации в виде электронного текстового документа;- receive at least one source of information in the form of an electronic text document;

- выполняют обработку полученного текстового документа с помощью полученной модели размещения m-skip-n-грамм, в ходе которой:- processing the received text document using the obtained m-skip-n-gram placement model, during which:

• определяют кластеры текстового документа на основании встречаемости m-skip-n-грамм:• determine text document clusters based on the occurrence of m-skip-n-grams:

• формируют векторное представление текстового документа на основании упорядоченной последовательности сумм m-skip-n-грамм;• form a vector representation of a text document on the basis of an ordered sequence of sums of m-skip-n-grams;

- определяют принадлежность векторного представления документа заданной тематике.- determine whether the vector representation of the document belongs to a given topic.

[0017] В настоящем документе под «m-skip-n-граммой слов» (или просто «m-skip-n-граммой» // http://www.machinelearning.ru/wiki/images/7/78/2017_DrapakSN.pdf) понимается совокупность последовательности из n слов, которая получена из последовательности слов из некоторого текста, сохраняя в ней последовательность слов в текстах, при этом из исходной последовательности слов удалено m слов после каждого одного из n слов. Например, 0-skip-1-грамма слов это просто отдельные слова из текста. 0-skip-2-грамма это биграммы слов (пара подряд идущих слов в тексте). 0-skip-3-грамма это триграммы слов (тройка подряд идущих слов в тексте). Для построения 1-skip-2-граммы берется последовательность из трех слов, в которой удаляется второе слово - то есть, это первое и третье слово. Для построения 2-skip-4-граммы берется последовательность из 10 слов, в ней берется первое слов, затем 2 слова удаляется, затем берется следующее слово, далее удаляется 2 следующих слова, и так, пока не будет получена последовательность из 4 слов. Например, имеем предложение «в соответствии с одним или более вариантами реализации настоящего изобретения». В ней удаляются слова «соответствии с», «или более», «реализации настоящего». Тогда 2-skip-4-грамма для данного предложения будет выглядеть как «в одним вариантами изобретения». При этом при построении m-skip-n-грамм за «слово» могут приниматься, как только слова языка, так и любой знак препинания, предлог, союз или любая самостоятельная единица языка. Всюду далее, без потери общности для упрощения изложения, под «словом» («словами» и прочим) будет понимать любую возможную m-skip-n-грамму слов и последовательность m-skip-n-грамм слов, если не сказано иное, например, «отдельных слов», «одиночных слов» и подобное. При этом также будем использовать и сам исходный термин m-skip-n-грамма.[0017] In this document, under the "m-skip-n-gram of words" (or simply "m-skip-n-gram" // http://www.machinelearning.ru/wiki/images/7/78/2017_DrapakSN .pdf) is understood as a set of a sequence of n words, which is obtained from a sequence of words from some text, keeping in it a sequence of words in texts, while m words are removed from the original sequence of words after each one of n words. For example, 0-skip-1-gram words are just single words from a text. 0-skip-2-grams are digrams of words (a pair of consecutive words in a text). 0-skip-3-grams are trigrams of words (three consecutive words in a text). To build a 1-skip-2-gram, a sequence of three words is taken, in which the second word is deleted - that is, these are the first and third words. To build a 2-skip-4-gram, a sequence of 10 words is taken, the first word is taken in it, then 2 words are removed, then the next word is taken, then the next 2 words are removed, and so on, until a sequence of 4 words is obtained. For example, we have the sentence "in accordance with one or more embodiments of the present invention." It removes the words "according to", "or more", "implementation of the present". Then the 2-skip-4-gram for this sentence would look like "in one variation of the invention". At the same time, when constructing m-skip-n-grams, a “word” can be taken as only the words of the language, as well as any punctuation mark, preposition, conjunction, or any independent unit of the language. Everywhere below, without loss of generality to simplify the presentation, by "word" ("words" and so on) we mean any possible m-skip-n-gram of words and a sequence of m-skip-n-grams of words, unless otherwise stated, for example , "individual words", "single words" and the like. In this case, we will also use the original term m-skip-n-gram itself.

[0018] Под «эмбедингом» слова (от англ. embedding - вложение) или же «векторным представлением» слова или просто «вектором» слова будем понимать такой числовой вектор, который получены из слов или других языковых сущностей, и который определен для слова, и имеет фиксированную размерность для метода его получения. Другими словами, векторным представлением слова является упорядоченная последовательность чисел - числовой вектор некоторого размера, когда каждое слово имеет свой определенный числовой вектор. В самом простом случае эмбеддинги слов можно получить нумерацией слов в некотором словаре и постановкой значения равного 1 в векторе, размерность которого равна числу слов в этом словаре. При этом на остальных позициях будут находиться значения равные 0. Например, для русского языка можно использовать Толковый словарь Даля. В нем пронумеруем все слова от первого до последнего. Так слово «абажур» будет иметь значение 1 на позиции 3, «абанат» - иметь значение 1 на позиции 7, и так далее. Если в словаре 200000 слов, то эмбединг будет иметь размерность 200000. Подобный метод построения эмбеддингов называют - one-hot encoding. Описание изобретения не ограничивает способ получения векторов слов. Данные вектора могут быть получены, например, нейронной сетью, реализующей математическое преобразование из пространства с одним измерением на слово в некоторое пространство вектора с большей размерностью. Иными методами, позволяющими сопоставить каждой m-skip-n-грамме свой вектор чисел заданной размерности. Данное векторное представление слов можно получать из уже известных набор векторизированных слов (Word2Vec, Glove, FastText и другие), модифицируя их или без такового. При этом под кластеризацией понимается группировка множества объектов на подмножества (кластеры) таким образом, чтобы объекты из одного кластера были более похожи друг на друга, чем на объекты из других кластеров по заданному критерию. Под «весом» элемента, например «вес кластера» или «вес m-skip-n-граммы», можно понимать математическую конструкцию, коэффициенты, множители, используемые при проведении суммирования, интегрирования или усреднения и прочего с целью придания некоторым элементам большей значимости в результирующем значении по сравнению с другими элементами. «Вес» можно определить и как дополнительный множитель, коэффициент или число, сопоставляемое отдельным слагаемым или другим факторам, в скалярном произведении элементов используемого векторного пространства.[0018] By “embedding” a word (from English embedding - embedding) or a “vector representation” of a word or simply a “vector” of a word, we will understand such a numeric vector that is obtained from words or other linguistic entities, and which is defined for a word, and has a fixed dimension for the method of obtaining it. In other words, the vector representation of a word is an ordered sequence of numbers - a numeric vector of some size, when each word has its own specific numeric vector. In the simplest case, word embeddings can be obtained by numbering words in some dictionary and setting a value equal to 1 in a vector whose dimension is equal to the number of words in this dictionary. In this case, the remaining positions will contain values equal to 0. For example, for the Russian language, you can use Dahl's Explanatory Dictionary. We number all the words in it from the first to the last. So the word "lampshade" would have the value 1 at position 3, "abanat" would have the value 1 at position 7, and so on. If there are 200,000 words in the dictionary, then the embedding will have a dimension of 200,000. A similar method of building embeddings is called one-hot encoding. The description of the invention does not limit the method for obtaining word vectors. The vector data can be obtained, for example, by a neural network that implements a mathematical transformation from a space with one dimension per word to some vector space with a higher dimension. Other methods that allow each m-skip-n-gram to be associated with its own vector of numbers of a given dimension. This vector representation of words can be obtained from already known set of vectorized words (Word2Vec, Glove, FastText and others), with or without modification. In this case, clustering is understood as a grouping of a set of objects into subsets (clusters) in such a way that objects from one cluster are more similar to each other than to objects from other clusters according to a given criterion. The “weight” of an element, such as “cluster weight” or “m-skip-n-gram weight”, can be understood as a mathematical construction, coefficients, multipliers used when summing, integrating or averaging, etc., in order to give some elements greater significance in the resulting value compared to other elements. "Weight" can also be defined as an additional factor, coefficient or number associated with individual terms or other factors in the scalar product of the elements of the vector space used.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[0019] Настоящее изобретение иллюстрируется на примерах, без каких бы то ни было ограничений; его сущность становится понятной при рассмотрении приведенного ниже подробного описания изобретения в сочетании с чертежами, при этом:[0019] The present invention is illustrated by examples, without any limitation; its essence becomes clear when considering the following detailed description of the invention in combination with the drawings, while:

[0020] На Фиг. 1 схематически показан пример процесса автоматического получения векторного представления документа.[0020] In FIG. 1 schematically shows an example of a process for automatically obtaining a vector representation of a document.

[0021] На Фиг. 2 схематически показан пример процесса получения множества наборов кластеров с весами для каждого кластера в каждом наборе кластеров.[0021] In FIG. 2 schematically shows an example of a process for obtaining a plurality of cluster sets, with weights for each cluster in each cluster set.

[0022] На Фиг. 3 схематически показан пример процесса получения множества m-skip-n-грамм слов, где каждая m-skip-n-грамм слова нечетко относится к каждому кластеру в каждом наборе кластеров.[0022] In FIG. 3 schematically shows an example of a process for obtaining a plurality of m-skip-n-gram words, where each m-skip-n-gram word is loosely related to each cluster in each set of clusters.

[0023] На Фиг. 4 схематически показан пример процесса извлечения признаков документа.[0023] In FIG. 4 schematically shows an example of a document feature extraction process.

[0024] На Фиг. 5 схематически показан пример сопоставления кластеров слов с позициями вектора документа, для двумерного случая векторного представления слов и четким соотнесением слова к одному кластеру.[0024] In FIG. Figure 5 schematically shows an example of matching word clusters with positions of the document vector, for the two-dimensional case of the vector representation of words and a clear correlation of the word to one cluster.

[0025] Фиг. 6 иллюстрирует общую схему вычислительного устройства.[0025] FIG. 6 illustrates the general layout of a computing device.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯIMPLEMENTATION OF THE INVENTION

[0026] В настоящем документе описываются способы получения векторного представления электронного документа для дальнейшей обработки, передачи и хранения. Изобретение может быть применено к любым естественным языкам, искусственным языкам, любым знаковым система, при этом, далее в настоящем документе, все это будем называть просто «языком». Таким образом, далее, под «языком» понимаются естественные языки, искусственные языки, любые знаковые системы.[0026] This document describes how to obtain a vector representation of an electronic document for further processing, transmission and storage. The invention can be applied to any natural languages, artificial languages, any sign system, and, further in this document, all of this will be referred to simply as "language". Thus, further, "language" means natural languages, artificial languages, any sign systems.

[0027] Автоматическая обработка, передача, хранение документов, может включать классификацию исходных документов, их кластеризацию и иные действия путем соотнесения векторного представления документа с другим векторным представлением документа, любого множества или группы документов.[0027] Automatic processing, transmission, storage of documents may include the classification of source documents, their clustering and other actions by correlating a vector representation of a document with another vector representation of a document, any set or group of documents.

[0028] На Фиг. 1 представлена общая блок-схема заявленного способа получения векторной формы документа. На первом этапе (101) реализации изобретения производится получение модели размещения m-skip-n-грамм по кластерам. Такая модель представляет собой список m-skip-n-грамм каждая из которых соотнесена по меньшей мере к одному из кластеров. Пример такой модели для частного случая, когда m-skip-n-граммы являются символами и словами из русского языка, и разбивки на 1000 кластеров приведен в таблице ниже:[0028] In FIG. 1 shows a general block diagram of the claimed method for obtaining the vector form of a document. At the first stage (101) of the implementation of the invention, a model for placing m-skip-n-grams in clusters is obtained. Such a model is a list of m-skip-n-grams, each of which is assigned to at least one of the clusters. An example of such a model for a particular case, when m-skip-n-grams are symbols and words from the Russian language, and a breakdown into 1000 clusters is shown in the table below:

Процесс получения модели размещения m-skip-n-грамм по кластерам.The process of obtaining a placement model of m-skip-n-grams by clusters.

[0029] Для формирования модели сначала определяется список используемых m-skip-n-грамм. В частном случае это может быть толковый словарь языка, словари эмбедингов слов (Word2Vec. Glove, FastText и другие), список всех слов из статей с сайта «Викинедия». В таком случае список слов должен быть большим, чтобы покрыть достаточное количество слов из анализируемых документов. В другом частном случае данный список проходит еще через нормализацию слов - процесс, когда получают леммы (https://ru.wikipedia.org/wiki/Лемматизация), стемминги (https://ru.wikipedia.org/wiki/Стемминг), морфологические корни, морфологические основы слов. В частном случае заглавные (прописные) и строчные буквы не различаются при формировании m-skip-n-грамм. В другом частном случае заглавные буквы различаются от строчных и их наличие образовывает разные m-skip-n-граммы. Помимо слов естественного, формального, искусственного языков в список могут включаться знаки препинания, цифры, аббревиатуры, нарицательные имена, имена собственные, сокращения, специальные символы (@";#№$%%^*()_}{⎢\ и прочие), математические символы и любые другие печатные последовательности символов. В частном случае можно рассматривать любую последовательность печатных знаков в знаковой системе при введенном знаке разделения слов. Размер списка, конкретные m-skip-n-граммы. использование нормализации выбираются исходя из решаемой задачи, доступных вычислительных ресурсов и других факторов.[0029] To generate the model, a list of m-skip-n-grams to be used is first determined. In a particular case, this can be an explanatory dictionary of the language, dictionaries of word embeddings (Word2Vec. Glove, FastText and others), a list of all words from articles from the Wikipedia site. In such a case, the word list must be large to cover a sufficient number of words from the analyzed documents. In another particular case, this list goes through the normalization of words - the process when they receive lemmas (https://ru.wikipedia.org/wiki/Lemmatization), stemmings (https://ru.wikipedia.org/wiki/Stemming), morphological roots, morphological bases of words. In a particular case, capital (uppercase) and lowercase letters do not differ when forming m-skip-n-grams. In another particular case, capital letters differ from lowercase letters and their presence forms different m-skip-n-grams. In addition to the words of natural, formal, artificial languages, the list may include punctuation marks, numbers, abbreviations, common nouns, proper names, abbreviations, special characters (@";##$%%^*()_}{⎢\ and others) , mathematical symbols, and any other printable sequences of characters.In a particular case, any sequence of printable characters in the sign system can be considered with the entered word separation sign.The size of the list, specific m-skip-n-grams.The use of normalization is selected based on the problem being solved, available computational resources and other factors.

[0030] Далее каждая m-skip-n-грамма из списка преобразовывается в векторное представление. Настоящее описание не ограничивает способ получения, размерность или вид векторного представления. Для этого можно использовать способы, которые основаны на алгоритмах и моделях Word2Vec, Glove, FastText, Universal Sentence Encoder, дополнение с трансформацией сингулярным разложением, Tf-Idf. В одном частном варианте реализации вектор m-skip-n-граммы представляет собой конкатенацию нескольких векторов этой m-skip-n-граммы, полученных по разным способам.[0030] Next, each m-skip-n-gram from the list is converted into a vector representation. The present description does not limit the method of obtaining, the dimension or the form of the vector representation. To do this, you can use methods that are based on the algorithms and models of Word2Vec, Glove, FastText, Universal Sentence Encoder, addition with singular value transformation, Tf-Idf. In one particular implementation, the m-skip-n-gram vector is a concatenation of several vectors of this m-skip-n-gram obtained by different methods.

[0031] Далее выполняется кластеризация векторов m-skip-n-грамм. Настоящее описание не ограничивает способ кластеризации списка m-skip-n-граммы. Можно использовать такие алгоритмы кластеризации как K-Means, Affinity propagation, Mean-shift, Spectral clustering. Ward hierarchical clustering, Agglomerative clustering, DBSCAN. OPTICS, Gaussian mixtures, Birch. Конечным результатом кластеризации является соотнесение каждой m-skip-n-граммы по меньшей мере к одному кластеру.[0031] Next, the m-skip-n-gram vectors are clustered. The present description does not limit the way the m-skip-n-gram list is clustered. You can use such clustering algorithms as K-Means, Affinity propagation, Mean-shift, Spectral clustering. Ward hierarchical clustering, Agglomerative clustering, DBSCAN. OPTICS, Gaussian mixtures, Birch. The end result of clustering is the assignment of each m-skip-n-gram to at least one cluster.

[0032] По завершению получения модели размещения m-skip-n-грамм выполняется обработка текстовых документов на этапе (102) с ее применением. По завершению применения модели получаем векторный вид документа (103). Если остались еще необработанные документы, то переходим ко следующему документу. Так пока не будет обработан каждый документ.[0032] Upon completion of obtaining the m-skip-n-gram placement model, processing of text documents is performed at step (102) using it. Upon completion of the application of the model, we obtain a vector view of the document (103). If there are still unprocessed documents, then go to the next document. So until each document is processed.

[0033] На Фиг. 2 приведен пример этапов формирования модели размещения m-skip-n-грамм по кластерам (101). Здесь осуществляется получение множества наборов кластеров с весами для каждого кластера в каждом наборе. В начале производим выбор базового списка m-skip-n-грамм, затем получаем их вектора (1011). Настоящее описание не ограничивает как-либо список, вид m-skip-n-грамм и значения параметров m и n в них. Отличительной частью изобретения являет сама возможность использования не только отдельных, одиночных слов языка, но любых m-skip-n-грамм слов. При этом не ограничивается возможность выбора таких n и m. что полученные m-skip-n-граммы будут совпадать, например, с отдельными словами, биграммами слов, триграммами слов и прочим.[0033] In FIG. Figure 2 shows an example of the stages of forming a model for placing m-skip-n-grams by clusters (101). Here, a set of cluster sets is obtained with weights for each cluster in each set. First, we select the basic list of m-skip-n-grams, then we get their vector (1011). The present description does not limit in any way the list, the form of m-skip-n-grams and the values of the parameters m and n in them. A distinctive part of the invention is the very possibility of using not only separate, single words of the language, but any m-skip-n-gram of words. In this case, the possibility of choosing such n and m is not limited. that the resulting m-skip-n-grams will coincide, for example, with individual words, bigrams of words, trigrams of words, and so on.

[0034] Этот этап может реализовываться путем использования готовых словарей (Word2Vec, Glove, FastText и другие), определением всевозможных или только интересующих m-skip-n-грамм исключительно в классе анализируемых текстовых документов (по имеющиеся базе данных или учебном наборе) или любым иным способом. Далее получаем векторное представление выбранных m-skip-n-грамм любым способом (алгоритмы и модели Word2Vec, Glove. FastText, Universal Sentence Encoder, дополнение с трансформацией сингулярным разложением, Tf-Idf и другие) (1011). На этапе 1012 проводим многократную кластеризацию векторного представления m-skip-n-грамм слов. Данная кластеризация производится любым известным способом кластеризации объектов при котором изначально фиксируется требуемое количество кластеров. Отличительной особенностью изобретения является то, что кластеризация может проводиться несколько раз. При этом она может производиться на разное количество кластеров одним и тем же алгоритмом кластеризации, на одинаковое количество кластеров одним и тем же алгоритмом кластеризации, но с разными начальными инициализациями, на одинаковое количество кластеров различными алгоритмами кластеризации или любым иным способом получения различных совокупностей кластеризаций объектов.[0034] This stage can be implemented by using ready-made dictionaries (Word2Vec, Glove, FastText and others), defining all possible or only m-skip-n-grams of interest exclusively in the class of text documents being analyzed (according to the available database or training set) or any in a different way. Next, we obtain a vector representation of the selected m-skip-n-grams in any way (algorithms and models Word2Vec, Glove. FastText, Universal Sentence Encoder, addition with singular value transformation, Tf-Idf and others) (1011). At step 1012, we perform multiple clustering of the vector representation of m-skip-n-gram words. This clustering is performed by any known method of object clustering, in which the required number of clusters is initially fixed. A distinctive feature of the invention is that the clustering can be carried out several times. At the same time, it can be performed for a different number of clusters by the same clustering algorithm, for the same number of clusters by the same clustering algorithm, but with different initial initializations, for the same number of clusters by different clustering algorithms, or in any other way to obtain different sets of object clusterings.

[0035] Таким образом формируется множество наборов кластеров K={k₁, k₂, …, k_N-1, k_N] (этап 1013). Для каждого C_ij кластера - j-ого кластера в i-м разбиении на кластеры, где

задается свой весовой коэффициент q_ij, который может характеризовать значимость этого кластера для векторного представления документа (этап 1014). Описание изобретения не ограничивает методы получения весовых коэффициентов q_ij и их значения. Отличительной частью изобретения являет сама возможность использования весовых коэффициентов для кластеров, что позволяет характеризовать значимость каждого кластера для векторного представления документа. Одной из возможных реализаций является подход, когда производится выделение «мусорных» кластеров и исключение данного кластера, например, обнулением его веса. При этом «мусорным» кластером могут являться кластеры с высокой долей «стоп-слов», общих слов, не несущих информативность слов для конкретной решаемой задачи анализа текстовых данных. Значения весовых коэффициентов можно получить какими-либо автоматизированными вычислениями, определить экспертной оценкой, любым иным способом. При этом возможна ситуация, когда все кластеры являются равнозначными или, как отмечено выше, когда какие-либо кластеры вообще исключаются из использования. Результатом данного этапа является множество наборов кластеров с весами для каждого кластера в каждом наборе (этап 1015).[0035] Thus, a set of cluster sets K={k ₁ , k ₂ , ..., k _N-1 , k _N ] is formed (step 1013). For each C _ij cluster - the j-th cluster in the i-th partition into clusters, where

its own weight coefficient q _ij is set, which can characterize the significance of this cluster for the vector representation of the document (step 1014). The description of the invention does not limit the methods for obtaining the weight coefficients q _ij and their values. A distinctive part of the invention is the very possibility of using weight coefficients for clusters, which allows characterizing the significance of each cluster for the vector representation of the document. One of the possible implementations is the approach when "garbage" clusters are selected and this cluster is excluded, for example, by zeroing its weight. At the same time, a "garbage" cluster can be clusters with a high proportion of "stop words", common words that do not carry the information content of words for a specific problem of text data analysis being solved. The values of the weight coefficients can be obtained by any automated calculations, determined by an expert assessment, in any other way. In this case, a situation is possible when all clusters are equivalent or, as noted above, when any clusters are generally excluded from use. The result of this step is a plurality of cluster sets with weights for each cluster in each set (step 1015).

[0036] На Фиг. 3 приведен пример этапов получения нечеткого соотнесения слов из списка. Для всех или части слов из словаря проводим разбиение на кластеры N раз (этап 1013) следующим образом: для каждой кластеризации может выбираться определенное количество кластеров; выбирается некоторый метод кластеризации, который позволяет кластеризовать объекты на заданное или неопределенное количество кластеров, и производится кластеризация. Каждую кластеризацию обозначим k_i, где

Тогда общее множество наборов кластеров обозначим K={k₁, k₂, …, k_n-1, k_N}. Каждая i-ая кластеризация k_i будет представлять собой набор из M_i кластеров. Каждый j-й кластер в i-м разбиении на кластеры обозначим C_ij, где

,

. Далее, для данного множество наборов кластеров K={k₁, k₂, …, k_N-1, k_N} выбираем показатели соотнесения слова к кластерам (этап 201). Показатель соотнесения слова к кластерам позволяет определить нечеткое распределение слов по кластерам. В качестве такого показателя может быть расстояние от m-skip-n-грамм до центра кластера. Данное расстояние может нормироваться по расстояниям до центра всех кластеров, или по расстояниям только до нескольких (возможен вариант ближайших) кластеров. Другой возможный вариант реализации изобретения состоит в том, чтобы определять расстояние до оправленного количества ближайших объектов при четкой кластеризации и рассчитывать долю объектов из каждого кластера в этом оправленном количестве. При этом описание изобретение никак не ограничивает способ, меру и показатель получения нечеткого соотнесения объекта к кластеру. Отличительной особенностью изобретения является возможность использовать нечеткое соотнесение m-skip-n-грамм к кластерам. Таким образом, для каждого используемого s-го слова определяется некоторый показатель w^(s) _ij, который характеризует в какой степени s-e слово относится к кластеру C_ij (этап 302).[0036] In FIG. 3 shows an example of the stages of obtaining a fuzzy correlation of words from the list. For all or part of the words from the dictionary, we perform clustering N times (step 1013) as follows: for each clustering, a certain number of clusters can be selected; some clustering method is selected that allows objects to be clustered into a given or indefinite number of clusters, and clustering is performed. Denote each clustering by k _i , where

Then the total set of sets of clusters will be denoted as K={k ₁ , k ₂ , …, k _n-1 , k _N }. Each i-th clustering k _i will be a set of M _i clusters. Each j-th cluster in the i-th partition into clusters will be denoted by C _ij , where

,

. Next, for a given set of sets of clusters K={k ₁ , k ₂ , ..., k _N-1 , k _N }, we select indicators of matching words to clusters (step 201). The word-to-cluster correlation indicator allows you to determine the fuzzy distribution of words by clusters. As such an indicator, the distance from m-skip-n-grams to the center of the cluster can be. This distance can be normalized by the distances to the center of all clusters, or by the distances to only a few (possibly closest) clusters. Another possible implementation of the invention is to determine the distance to a fixed number of nearest objects with a clear clustering and calculate the proportion of objects from each cluster in this adjusted number. At the same time, the description of the invention does not limit in any way the method, measure and indicator of obtaining a fuzzy assignment of an object to a cluster. A distinctive feature of the invention is the ability to use fuzzy mapping of m-skip-n-grams to clusters. Thus, for each s-th word used, some indicator w ^(s) _ij is determined, which characterizes to what extent se the word belongs to the cluster C _ij (step 302).

[0037] Каждое слово мы можем отнести к любому количеству кластеров. Подобную разбивку на кластеры с нечетким разбиением по кластерам можно получить, например, методом нечеткой кластеризации «С-средних», или с использованием оценки специалиста. При этом могут быть выбраны такие показатели, что каждое слово будет относиться только к одному кластеру, например, в случае, когда значение показателя соотнесения равно единице для одного кластера (не обязательно самого близкого к вектору слова) и равно нулям для всех остальных. В некоторых случаях результаты такого выбора будут характеризовать четкую кластеризацию, но возможны варианты реализации изобретения, когда m-skip-n-грамм не всегда будет относится к самому близкому кластеру. То есть, при этом результаты такого выбора показателей соотнесения, в общем случае, не обязательно будут соответствовать результатам работы алгоритмов четкой кластеризации слов. Описание изобретения не ограничивает количество разбивок на кластеры, которых, очевидно, должно быть больше нуля (>0). Описание изобретения не ограничивает метрику и математические пространства для произведения кластеризации и соотнесения слов с кластерами.[0037] We can attribute each word to any number of clusters. Such a clustering with fuzzy clustering can be obtained, for example, using the “C-means” fuzzy clustering method, or using an expert's assessment. In this case, indicators can be chosen such that each word will belong to only one cluster, for example, in the case when the value of the correlation indicator is equal to one for one cluster (not necessarily the word closest to the vector) and equal to zero for all others. In some cases, the results of such a choice will characterize a clear clustering, but embodiments of the invention are possible when m-skip-n-gram will not always refer to the closest cluster. That is, while the results of such a choice of correlation indicators, in the general case, will not necessarily correspond to the results of the work of clear word clustering algorithms. The description of the invention does not limit the number of clusterings, which, obviously, must be greater than zero (>0). The description of the invention does not limit the metric and mathematical spaces for clustering and matching words to clusters.

[0038] Этапы (201) и (202) имеют общие предшествующие шаги с этапами (1014) и (1015). Данными этапами являются этапы (1011, 1012, 1013). Описание изобретения не ограничивает взаимную выполнения упомянутых этапов шагов и их совокупное использование, а дает только пояснения к ним.[0038] Steps (201) and (202) have common preceding steps with steps (1014) and (1015). These steps are steps (1011, 1012, 1013). The description of the invention does not limit the mutual implementation of the mentioned stages of the steps and their combined use, but only provides explanations for them.

[0039] Пример раскрытия этапа (102) представлен на Фиг. 4. На вход этапа (102) подается документ с данными в текстовом виде. В частном варианте реализации способа сначала производится определение m-skip-n-грамм, которые присутствуют в документе и которые присутствуют в списке m-skip-n-грамм модели размещения (этап 1021).[0039] An exemplary disclosure of step (102) is shown in FIG. 4. At the input of stage (102) is a document with data in text form. In a particular implementation of the method, the m-skip-n-grams that are present in the document and that are present in the list of m-skip-n-grams of the placement model are first determined (step 1021).

[0040] При реализации этапа (1021) выполняется подсчет количества встречаемости m-skip-n-грамм из сформированного списка в документе. Затем смотрится к каким кластерам относятся эти m-skip-n-граммы. Подсчитывается количество m-skip-n-грамм из документа внутри каждого кластера.[0040] When implementing step (1021), the number of occurrences of m-skip-n-grams from the generated list in the document is counted. Then it looks to which clusters these m-skip-n-grams belong. The number of m-skip-n-grams from the document within each cluster is counted.

Например, имеем следующее предложение: «министр выступит на этой неделе». В нем, в соответствии с примером из таблицы 1, кластер 457 встречается 1 раз («министр»), кластер 537 - 1 раз («выступит»), кластер 737 - 2 раза («на», «этой»), кластер 368 - 1 раз («неделе»). Тогда итоговый вектор документ будет представлять собой вектор, у которого на позициях 457, 537 и 368 будет стоять значение 1, на позиции 737 значение 2, на всех остальных позициях - значение 0. В другом частном варианте реализации способа используется последующая нормировку данного вектора. В рассмотренном примере m-skip-n-граммы из списка встречаются 5 раз. Тогда нормированный вектор будет представлять собой вектор, у которого на позициях 457, 537 и 368 будет стоять значение 0,2, на позиции 737 значение 0,4, на всех остальных позициях - значение 0.For example, we have the following sentence: "the minister will speak this week." In it, in accordance with the example from Table 1, cluster 457 occurs 1 time (“minister”), cluster 537 - 1 time (“speak”), cluster 737 - 2 times (“on”, “this”), cluster 368 - 1 time ("week"). Then the final document vector will be a vector with the value 1 at positions 457, 537 and 368, the value 2 at positions 737, and the value 0 at all other positions. In another particular embodiment of the method, subsequent normalization of this vector is used. In the considered example m-skip-n-grams from the list occur 5 times. Then the normalized vector will be a vector with the value 0.2 at positions 457, 537 and 368, the value 0.4 at position 737, and the value 0 at all other positions.

[0041] Далее начинается формирование вектора документа. Каждая позиция в этом векторе соответствует определенному кластеру. Сначала берется m-skip-n-грамма из документа, для последующего определения к какому кластеру она относится по сформированной модели размещения m-skip-n-грамм. Затем значение, на соответствующей кластеру позиции в векторе документа, увеличиваем на количество этой m-skip-n-граммы в документе (этап 1022). В одном из частных вариантов способа реализации значение в позиции вектора увеличиваем не на количество m-skip-n-грамм, а на произведение количества и веса m-skip-n-граммы для этого кластера (этан 1022). На следующем этапе (1023) полученные значения на позициях вектора документа умножаем на соответствующие веса кластеров. Результатом является векторное представление документа (этап 1024).[0041] Next, the formation of the document vector begins. Each position in this vector corresponds to a specific cluster. First, an m-skip-n-gram is taken from the document, for subsequent determination to which cluster it belongs according to the generated m-skip-n-gram placement model. The value at the position corresponding to the cluster in the document vector is then incremented by the number of that m-skip-n-gram in the document (step 1022). In one of the particular implementation options, the value in the position of the vector is increased not by the number of m-skip-n-grams, but by the product of the number and weight of the m-skip-n-gram for this cluster (ethane 1022). At the next stage (1023), the obtained values at the positions of the document vector are multiplied by the corresponding weights of the clusters. The result is a vector representation of the document (block 1024).

[0042] В соответствии с одним или более вариантами реализации настоящего изобретения, пример способа автоматизированного получения векторного представления электронного документа может включать в себя следующие этапы. Выбирается вектор, размерность которого совпадает с общим количеством полученных кластеров во всех кластеризациях. Данный вектор инициализируем произвольными начальными значениями. Описание изобретения не ограничивает значения, используемые для начальной инициализации вектора.[0042] In accordance with one or more embodiments of the present invention, an example of a method for automatically obtaining a vector representation of an electronic document may include the following steps. A vector is selected whose dimension coincides with the total number of clusters obtained in all clusterings. This vector is initialized with arbitrary initial values. The description of the invention does not limit the values used for the initial initialization of the vector.

[0043] Возможно также использовать, например, нулевые значения. Каждая позиция вектора строго соответствует определенному кластеру (Фиг. 5). Осуществляется извлечение слов из документа. Данное извлечение можно получить, например, последовательный проходом по документу, использовать некоторое представление документа, которое уже имеет извлеченные слова и их количества, любым иным способом. Для каждого слова используют веса соотношения для каждого кластера в каждом наборе (этап 202).[0043] It is also possible to use, for example, zero values. Each position of the vector strictly corresponds to a certain cluster (Fig. 5). The words are extracted from the document. This extraction can be obtained, for example, by a sequential pass through the document, use some representation of the document, which already has the extracted words and their numbers, in any other way. For each word, ratio weights are used for each cluster in each set (step 202).

[0044] Для примера рассмотрим последовательное извлечение слов из текста, не ограничивая варианты реализации изобретения. Выполняется получение слова из текста. По весам соотношения слова с каждым кластером выполняется поиск позиции в векторе документа, значения в которых необходимо увеличить. Осуществляем увеличение данных значений в выявленных позициях на некоторое значение. Это значение может быть фиксированным, либо изменяемым в зависимости от условий в процессе обработки документа. Данное значение уже может учитывать в себе вес соотношения слова с каждым кластером.[0044] For example, consider the sequential extraction of words from text, without limiting the embodiments of the invention. The word is obtained from the text. According to the weights of the correlation of a word with each cluster, a search is made for a position in the document vector, the values in which must be increased. We increase these values in the identified positions by a certain value. This value can be fixed or change depending on the conditions during document processing. This value can already take into account the weight of the word's relationship with each cluster.

[0045] Возможен также вариант изобретения, когда увеличение значений на позиции в векторе происходит на фиксированное значение, которое затем умножается на вес соотношения слова. Описание настоящего решения не ограничивает методы изменения значений на позициях в векторе, которые связаны с данным словом. Возможно также учитывать в итоговом изменении значений в векторе нечеткое соотнесение слова к различным кластерам. При этом не ограничиваются значения изменений значений, которые, в общем случае, могут быть и отрицательными. Для расчета значений, на которые изменяются значения в позициях векторов, можно использовать различные методы, которые позволяют учитывать и любые иные характеристики слов. Таким примером может быть использование метода «частоты использования слов - обратной частоты документа» (TF-IDF, Term Frequency - Inverse Document Frequency), или просто частоты слов. Также, для расчета значений, можно учитывать вес кластера, характеризующий значимость кластера для векторного представления документа. Проходя так по всему тексту, производим увеличение соответствующих позиций в векторе документа. Итогом прохода может стать векторный вид документа.[0045] It is also possible for the invention to increase the values at a position in the vector by a fixed value, which is then multiplied by the weight of the word ratio. The description of the present solution does not limit methods for changing values at positions in a vector that are associated with a given word. It is also possible to take into account the fuzzy correlation of the word to different clusters in the final change in the values in the vector. In this case, the values of the change in values are not limited, which, in the general case, can be negative. To calculate the values by which the values in the positions of the vectors change, you can use various methods that allow you to take into account any other characteristics of the words. An example would be to use the Term Frequency - Inverse Document Frequency (TF-IDF) method, or just word frequency. Also, to calculate the values, you can take into account the weight of the cluster, which characterizes the significance of the cluster for the vector representation of the document. Passing through the entire text in this way, we increase the corresponding positions in the document vector. The result of the pass can be a vector view of the document.

[0046] Настоящее описание не ограничивает подходы по использованию весов кластеров, характеризующие значимость каждого кластера для векторного представления документа, и весов соотношения слова для каждого кластера в каждом наборе. Заявленное решение дает возможность использовать данные веса при составлении векторного вида документа. Полученные значения могут использоваться для векторного вида документа или любое их множество и подмножество использоваться в обработке данных, в самостоятельном виде или в совокупности с другими данными, для получения нового векторного вида документа. Примером такого совокупного использования может служить конкатенация с вектором метода TF-IDF. В данном случае осуществляется составление векторного представления документа по методу TF-IDF. Составление вектора документа может осуществляться с помощью выполнения заявленного способа. Производится конкатенация двух векторов (операцию соединения, склеивания векторов). Результатом конкатенации и будет являться вектор документа. Дополнительно возможно дальнейшее проведение алгебраических преобразований над полученным вектором документа.[0046] The present description does not limit approaches to using cluster weights that characterize the significance of each cluster for a vector representation of a document, and word ratio weights for each cluster in each set. The claimed solution makes it possible to use weight data when compiling a vector view of a document. The obtained values can be used for the vector view of the document, or any set and subset of them can be used in data processing, in an independent form or in combination with other data, to obtain a new vector view of the document. An example of such cumulative usage would be concatenation with a TF-IDF method vector. In this case, the vector representation of the document is compiled using the TF-IDF method. Compilation of the document vector can be carried out by performing the claimed method. Two vectors are concatenated (the operation of joining, gluing vectors). The result of the concatenation will be the document vector. Additionally, it is possible to carry out further algebraic transformations on the resulting document vector.

[0047][0047]

[0048] В одном частном варианте реализации способа используется нечеткое разбиение списка m-skip-n-грамм на кластера. В этом варианте каждая m-skip-n-грамм соотносится больше чем к одному кластеру. Каждая m-skip-n-грамм имеет свое соотнесение к кластеру в зависимости от расстояния до этого кластера. При этом для расчета данного расстояния могут использоваться и координаты центра кластера, и координаты m-skip-n-грамм из этого и других кластеров. В частном случае такого способа кластеризации m-skip-n-грамма относится ко всем кластерам. Примером алгоритма для реализации данном частном варианте кластеризации является C-Means.[0048] In one particular embodiment of the method, fuzzy partitioning of the list of m-skip-n-grams into clusters is used. In this variant, each m-skip-n-gram corresponds to more than one cluster. Each m-skip-n-gram has its own correlation to a cluster depending on the distance to this cluster. In this case, both the coordinates of the cluster center and the coordinates of m-skip-n-grams from this and other clusters can be used to calculate this distance. In a special case of this clustering method, the m-skip-n-gram applies to all clusters. An example of an algorithm for implementing this particular clustering variant is C-Means.

[0049] В одном из частных вариантов реализации каждая m-skip-n-грамма относится к нескольким кластерам. При этом m-skip-n-грамма может относиться к одному или более кластерам в одинаковой степени.[0049] In one particular implementation, each m-skip-n-gram belongs to several clusters. In this case, the m-skip-n-gram can refer to one or more clusters to the same extent.

[0050] В другом частном варианте реализации каждая m-skip-n-грамма имеет вес, характеризующий ее соотнесение к заданному кластеру. В отличие от предыдущих частных вариантов в данном случае вес может характеризовать не только близость m-skip-n-граммы к кластеру. При расчете данного веса может учитываться, например, плотность кластеров, чтобы m-skip-n-грамму соотносить в большей степени к менее плотному близкому кластеру. Вес является множителем, который используется для каждой m-skip-n-граммы с каждым кластером. Вес может принимать и нулевое значение.[0050] In another particular implementation, each m-skip-n-gram has a weight characterizing its correlation to a given cluster. In contrast to the previous particular variants, in this case, the weight can characterize not only the proximity of the m-skip-n-gram to the cluster. When calculating this weight, for example, the density of clusters can be taken into account in order to correlate the m-skip-n-gram more to a less dense close cluster. The weight is a multiplier that is used for each m-skip-n-gram with each cluster. Weight can also be zero.

[0051] Еще в одном частном варианте реализации кластеризация списка m-skip-n-грамм по их векторным представлениям выполняется более одного раза. В таком варианте кластеризация производится несколько раз на разное количество кластеров одним и тем же алгоритмом кластеризации. Итоговый вектор документа будет представлять собой конкатенацию векторов по каждой из кластеризации (пример на Фиг. 5). В другом случае используются разные алгоритмы кластеризации для разбиения на одинаковое количество кластеров или на разное количество кластеров. В еще одном случае используется одинаковое количество кластеров и один и тот же алгоритм, но с разными начальными инициализациями, если алгоритм подразумевает возможность различного результата при разных инициализациях (K-Means, C-Means, Spectral clustering, Gaussian mixtures и другие). Количество кластеризации, количество кластеров в каждой кластеризации и алгоритм кластеризации подбирается в зависимости от решаемой задачи. В частном случае, если список m-skip-n-грамм более 10000, можно использовать разбивку на кластера размерами: 50, 100, 200, 300, 500, 700, 1000, 1500, 2000, 3000, 5000. В ином частном случае составляются отдельные списки, например, для 0-skip-1-грамм (одиночных слов), 0-skip-2-грамм (биграмм слов), 0-skip-3-грамм (триграмм слов). Каждый из этих списков отдельно кластеризуется несколько раз. Итоговым вектором документа будет являться конкатенация векторов по каждой из кластеризации для каждого списка.[0051] In yet another particular implementation, clustering a list of m-skip-n-grams by their vector representations is performed more than once. In this variant, clustering is performed several times for a different number of clusters by the same clustering algorithm. The resulting document vector will be a concatenation of the vectors for each of the clusterings (example in Fig. 5). In another case, different clustering algorithms are used to partition into the same number of clusters or into a different number of clusters. In another case, the same number of clusters and the same algorithm are used, but with different initial initializations, if the algorithm implies the possibility of different results with different initializations (K-Means, C-Means, Spectral clustering, Gaussian mixtures, and others). The number of clustering, the number of clusters in each clustering and the clustering algorithm is selected depending on the problem being solved. In a particular case, if the list of m-skip-n-grams is more than 10000, you can use a breakdown into clusters with sizes: 50, 100, 200, 300, 500, 700, 1000, 1500, 2000, 3000, 5000. separate lists, for example, for 0-skip-1-grams (single words), 0-skip-2-grams (bigrams of words), 0-skip-3-grams (trigrams of words). Each of these lists is separately clustered multiple times. The final document vector will be the concatenation of vectors for each of the clustering for each list.

[0052] В другом частном варианте реализации для каждого кластера m-skip-n-грамм в каждой кластеризации (если их несколько) используются веса. Эти веса характеризуют значимость кластеров для векторного представления документа. Веса могут иметь и нулевые значения. В таком случае количество встречаемости m-skip-n-грамм в этом кластере обнуляется. Этот подход полезен, когда требуется исключить из вектора кластера, которые содержат стоп-слова - слова, не несущие тематического смысла (и, к, у, о, при, на и прочие). В частных случаях веса для кластеров могут рассчитываться методами машинного обучения, задаваться как экспертная оценка.[0052] In another particular implementation, weights are used for each m-skip-n-gram cluster in each clustering (if there is more than one). These weights characterize the significance of clusters for the vector representation of the document. The weights can also be zero. In this case, the number of occurrences of m-skip-n-grams in this cluster is reset to zero. This approach is useful when it is required to exclude from the vector of a cluster that contain stop words - words that do not carry a thematic meaning (i, k, y, o, at, on, and others). In particular cases, the weights for clusters can be calculated by machine learning methods and given as an expert estimate.

[0053] Также, m-skip-n-грамма может расширяться новыми m-skip-n-граммами. В таком варианте выбирается способ расчета вектора m-skip-n-граммы. Если в документ встречается m-skip-n-грамма, которой нет в списке, то для нее рассчитывается векторное представление (вектор). Далее этот вектор соотносится к кластерам, которые получены на этапе кластеризации списка m-skip-n-грамм. При этом новая m-skip-n-грамма соотносится по меньшей мере к одному кластеру. Дальнейшие шаги способа аналогичны случаю, если эта m-skip-n-грамма присутствует в модели размещения m-skip-n-грамм по кластерам.[0053] Also, the m-skip-n-gram can be expanded with new m-skip-n-grams. In this option, the method for calculating the m-skip-n-gram vector is selected. If the document encounters an m-skip-n-gram that is not in the list, then a vector representation (vector) is calculated for it. Further, this vector is related to the clusters that are obtained at the stage of clustering the list of m-skip-n-grams. In this case, the new m-skip-n-gram corresponds to at least one cluster. Further steps of the method are similar to the case if this m-skip-n-gram is present in the model for placing m-skip-n-grams in clusters.

[0054] Список используемых m-skip-n-грамм может формироваться исходя из встречаемости m-skip-n-грамм в текстовых данных, получаемых из внешних источников данных. Таким множеством текстов может быть любой внешний массив текстовых данных. Это может быть, например, множество анализируемых текстов различной тематики, собираемых, например, через новостные сайты. Обработка, классификация и кластеризация собираемой информации позволяет найти семантические сходства, аналоги, реализовать ранжирование результатов поиска или решить любую другую задачу обработки языка. Применение заявленного решения не ограничивает источник и природу используемого множества тестовых данных.[0054] The list of used m-skip-n-grams can be formed based on the occurrence of m-skip-n-grams in text data obtained from external data sources. Such a set of texts can be any external array of text data. This can be, for example, a set of analyzed texts on various topics, collected, for example, through news sites. Processing, classification and clustering of the collected information allows you to find semantic similarities, analogues, implement the ranking of search results or solve any other language processing problem. The application of the claimed solution does not limit the source and nature of the set of test data used.

[0055] Примером реализации такого варианта является случай решения задачи классификации текстовых документов. Тогда список m-skip-n-грамм формируется по исходному набору документов. Например, берутся все одиночные слова и биграммы, которые встречаются в этих текстах. Или 200000 самых частых биграмм. Для этих списков рассчитываются их векторные представления. Далее эти вектора подаются на кластеризацию, которая реализуется описанными способами.[0055] An example of the implementation of such an option is the case of solving the problem of classifying text documents. Then the list of m-skip-n-grams is formed according to the initial set of documents. For example, all single words and bigrams that occur in these texts are taken. Or 200,000 most frequent bigrams. For these lists, their vector representations are calculated. Further, these vectors are fed to clustering, which is implemented by the described methods.

[0056] При реализации настоящего изобретения все описанные варианты можно использовать в любой возможной совокупности и сочетании. Примером такого частного варианта является случай, когда по имеющемуся набору данных выбираются все встречающиеся одиночные слова, биграмм слов, триграмм слов. В каждом списке рассчитываются вектора n-грамм слов. Далее каждый список векторов подается на многократную кластеризацию. В каждой кластеризации для каждой n-граммы выбирается вес соотнесения ее к кластеру. Затем берем каждый анализируемый документ. По встречаемости n-грамм получаем вектор по каждой кластеризации. Значения в кластерах умножаются на веса кластеров. Каждый такой вектор нормируется. Все вектора конкатенируются в общий вектор. Описание патента не ограничивает совокупности использования различных подходов.[0056] When implementing the present invention, all the described options can be used in any possible combination and combination. An example of such a particular variant is the case when all occurring single words, bigrams of words, trigrams of words are selected from the available data set. In each list, vectors of n-gram words are calculated. Further, each list of vectors is submitted for multiple clustering. In each clustering, for each n-gram, the weight of its correlation with the cluster is selected. Then we take each analyzed document. By the occurrence of n-grams, we obtain a vector for each clustering. The values in the clusters are multiplied by the weights of the clusters. Each such vector is normalized. All vectors are concatenated into a common vector. The description of the patent does not limit the combination of use of various approaches.

[0057] Отличительной особенностью изобретения является предоставление возможности учета ранее неизвестных m-skip-n-грамм. Не ограничивая изобретение можно привести следующий подход для учета новых m-skip-n-грамм. Например, проходя по текстовому представлению документа при выявлении m-skip-n-граммы, которой нет в используемом словаре, формируется последующее векторное представление данной m-skip-n-граммы. После чего m-skip-n-грамма относится к определенному кластеру, центр которого является ближайший к ней.[0057] A distinctive feature of the invention is the ability to account for previously unknown m-skip-n-grams. Without limiting the invention, the following approach can be given to account for new m-skip-n-grams. For example, when passing through the textual representation of the document, when an m-skip-n-gram is found that is not in the used dictionary, the subsequent vector representation of this m-skip-n-gram is formed. After which the m-skip-n-gram refers to a certain cluster, the center of which is closest to it.

[0058] Заявленный способ (100) может применяться в частности для поиска и отбора релевантной информации, например, новостей, связанных с заданной тематикой. Способ (100) может применяться в составе автоматизированной платформы по управлению кибербезопасностью. Сбор новостей является важной функцией для поиска информации, потенциально имеющей отношение к кибербезопаности с различных источников (новостные тематические сайты, соц. сети, группы в мессенджерах и тп). Сбор новостей осуществляется различными известными механизмами (RSS подписки и т.п.).[0058] The claimed method (100) can be used in particular for searching and selecting relevant information, for example, news related to a given topic. Method (100) may be used as part of an automated cybersecurity management platform. News gathering is an important function for finding information potentially related to cybersecurity from various sources (news thematic sites, social networks, groups in instant messengers, etc.). News collection is carried out by various well-known mechanisms (RSS subscriptions, etc.).

[0059] Организация взаимодействия между TIP и N LP-моделью осуществляется по API, путем отправки http get и http patch запросов. В формате json осуществляется получение тела новости из TIP, затем происходит предобработка новости. Следующим шагом выполняется классификация, в случае если новость отнесена к классу «применимой», следует кластеризация. Полученный результат по принадлежности новости к классу и кластеру передается в TIP, путем отправки http patch запроса.[0059] The organization of interaction between the TIP and the N LP model is carried out via the API, by sending http get and http patch requests. In the json format, the body of the news is received from the TIP, then the news is preprocessed. The next step is classification, if the news is classified as “applicable”, clustering follows. The result obtained by news belonging to a class and a cluster is transmitted to TIP by sending an http patch request.

[0060] NLP-модель выполняет классификацию и кластеризацию применимых к кибербезопасности новостей. Для решения задачи классификации используется метод векторного представления текстов, раскрытый в описании. Далее новости, представляющие интерес для целей кибербезопаности, проходят процесс кластеризации на ограниченный набор групп (например, новости на тему появления новых уязвимостей, выход обновлений безопасности, появление нового ВПО и т.п). Для кластеризации используется векторное представление текста новости, раскрытое в описании. Разбиение по кластерам происходит путем расчета некоторого расстояния между векторными представлениями новостей, в случае если полученное значение меньше заданного или адаптивного порога (например, 0.5), новости относятся к одному или нескольким кластерам.[0060] The NLP model performs the classification and clustering of news applicable to cybersecurity. To solve the classification problem, the method of vector representation of texts, disclosed in the description, is used. Further, the news that is of interest for cybersecurity purposes goes through the process of clustering into a limited set of groups (for example, news about the emergence of new vulnerabilities, the release of security updates, the emergence of new malware, etc.). For clustering, a vector representation of the text of the news, disclosed in the description, is used. Clustering occurs by calculating a certain distance between vector representations of news, if the resulting value is less than a given or adaptive threshold (for example, 0.5), the news belongs to one or more clusters.

[0061] На Фиг. 6 представлен общий вид вычислительного устройства (400), пригодного для реализации заявленного решения. Устройство (400) может представлять собой, например, сервер или иной тип вычислительного устройства, который может применяться для реализации заявленного технического решения. В том числе входить в состав облачной вычислительной платформы.[0061] In FIG. 6 shows a general view of a computing device (400) suitable for implementing the claimed solution. Device (400) may be, for example, a server or other type of computing device that can be used to implement the claimed technical solution. Including being part of a cloud computing platform.

[0062] В общем случае вычислительное устройство (400) содержит объединенные общей шиной информационного обмена один или несколько процессоров (401), средства памяти, такие как ОЗУ (402) и ПЗУ (403), интерфейсы ввода/вывода (404), устройства ввода/вывода (405), и устройство для сетевого взаимодействия (406).[0062] In general, the computing device (400) contains one or more processors (401) connected by a common information exchange bus, memory means such as RAM (402) and ROM (403), input / output interfaces (404), input devices / output (405), and a device for networking (406).

[0063] Процессор (401) (или несколько процессоров, многоядерный процессор) могут выбираться из ассортимента устройств, широко применяемых в текущее время, например, компаний Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. В качестве процессора (501) может также применяться графический процессор, например, Nvidia, AMD, Graphcore и пр.[0063] The processor (401) (or multiple processors, multi-core processor) may be selected from a variety of devices currently widely used, such as Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™, and etc. The processor (501) can also be a graphics processor such as Nvidia, AMD, Graphcore, etc.

[0064] ОЗУ (402) представляет собой оперативную память и предназначено для хранения исполняемых процессором (401) машиночитаемых инструкций для выполнение необходимых операций по логической обработке данных. ОЗУ (402), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.).[0064] RAM (402) is a random access memory and is designed to store machine-readable instructions executable by the processor (401) to perform the necessary data logical processing operations. The RAM (402) typically contains the executable instructions of the operating system and associated software components (applications, program modules, etc.).

[0065] ПЗУ (403) представляет собой одно или более устройств постоянного хранения данных, например, жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш-память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD-R/RW, BlueRay Disc, MD) и др.[0065] A ROM (403) is one or more persistent storage devices such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[0066] Для организации работы компонентов устройства (400) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (404). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь: PCI, AGP, PS/2, IrDa, Fire Wire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DV1, VGA, Display Port, RJ45, RS232 и т.п.[0066] Various types of I/O interfaces (404) are used to organize the operation of device components (400) and organize the operation of external connected devices. The choice of appropriate interfaces depends on the specific design of the computing device, which can be, but not limited to: PCI, AGP, PS/2, IrDa, Fire Wire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro , mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DV1, VGA, Display Port, RJ45, RS232, etc.

[0067] Для обеспечения взаимодействия пользователя с вычислительным устройством (400) применяются различные средства (405) В/В информации, например, клавиатура, дисплей (монитор), сенсорный дисплей, тач-пад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п.[0067] To ensure user interaction with the computing device (400), various means (405) of I / O information are used, for example, a keyboard, a display (monitor), a touch screen, a touch pad, a joystick, a mouse, a light pen, a stylus, touch panel, trackball, speakers, microphone, augmented reality, optical sensors, tablet, indicator lights, projector, camera, biometric identification tools (retinal scanner, fingerprint scanner, voice recognition module), etc.

[0068] Средство сетевого взаимодействия (406) обеспечивает передачу данных устройством (400) посредством внутренней или внешней вычислительной сети, например, Интранет, Интернет, ЛВС и т.п. В качестве одного или более средств (406) может использоваться, но не ограничиваться: Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др.[0068] The networking means (406) enables data communication by the device (400) via an internal or external computer network, such as an Intranet, Internet, LAN, and the like. As one or more means (406) can be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and / or BLE module, Wi-Fi module and others

[0069] Дополнительно могут применяться также средства спутниковой навигации в составе устройства (400), например, GPS, ГЛОНАСС, BeiDou, Galileo.[0069] Additionally, satellite navigation tools in the device (400) can also be used, for example, GPS, GLONASS, BeiDou, Galileo.

[0070] Таким образом, заявленное решение позволяет достичь следующих преимуществ:[0070] Thus, the claimed solution achieves the following advantages:

• Повысить сохранность семантического смысла путем кластеризации m-skip-n-грамм слов, а не только отдельных слов как производится в известных патентах.• Increase the preservation of semantic meaning by clustering m-skip-n-gram words, not just individual words as done in known patents.

• Повысить точность представления текста с помощью латентных тематик через осуществление многократной кластеризация одних и тех же m-skip-n-грамм слов, в отличии от известных патентов, где нигде не указано, что может производиться многократная или повторная кластеризация. В подходе нашего патента, например, можно производить кластеризация на 100, 200, 357, 500, 500 (снова на 500, но с новой инициализацией), 1000, 5000, 10000 кластеров.• Increase the accuracy of text representation using latent topics through the implementation of multiple clustering of the same m-skip-n-gram words, in contrast to known patents, where it is not indicated anywhere that multiple or repeated clustering can be performed. In the approach of our patent, for example, it is possible to cluster by 100, 200, 357, 500, 500 (again by 500, but with a new initialization), 1000, 5000, 10000 clusters.

• Сохранить различную семантику слов, производя сопоставление слов к нескольким кластерам, тем самым сохраняется неоднозначность терминов, в отличие от известных патентов, где имеем отнесение каждого слова только к одному (ближайшему) кластеру.• Preserve different semantics of words by comparing words to several clusters, thereby preserving the ambiguity of terms, in contrast to known patents, where we assign each word to only one (closest) cluster.

• Путем использования весов (априорных или апостериорных) для кластеров возможно снизить значимость или вообще исключить кластера с общими словами, тем самым повысить качество анализа, в частности расчета близости двух текстов, в отличие от известных патентов, где нет никаких весов кластеров.• By using weights (a priori or a posteriori) for clusters, it is possible to reduce the significance or even exclude clusters with common words, thereby improving the quality of the analysis, in particular the calculation of the proximity of two texts, in contrast to known patents, where there are no cluster weights.

• Повысить точность анализа через обработку новых, ранее неизвестных слов.• Increase the accuracy of the analysis through the processing of new, previously unknown words.

• Расширить применение получаемого векторного представления и непосредственное использование получаемого вектора, без обучения классификаторов, для задач выявления семантических сходств, поиска аналогов, ранжирование результатов поиска и прочих.• Expand the application of the resulting vector representation and the direct use of the resulting vector, without training classifiers, for the tasks of identifying semantic similarities, searching for analogues, ranking search results, and others.

[0071] Представленные материалы заявки раскрывают предпочтительные примеры реализации технического решения и не должны трактоваться как ограничивающие иные, частные примеры его воплощения, не выходящие за пределы испрашиваемой правовой охраны, которые являются очевидными для специалистов соответствующей области техники.[0071] The submitted application materials disclose preferred examples of the implementation of the technical solution and should not be construed as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

1. A computer-implemented method for obtaining a vector representation of an electronic document, which is performed using a processor and contains the following steps:

- at least one model of placing m-skip-n-grams in clusters is formed, while the m-skip-n-gram represents at least a single word, and when forming the said model, the following is carried out:

- definition of the list of used m-skip-n-grams;

- conversion to vector representation of each m-skip-n-gram from the list;

- clustering of m-skip-n-grams according to their vector representations;

- processing at least one text document using the obtained m-skip-n-gram placement model, during which:

- carry out the calculation of the occurrence of m-skip-n-grams in a text document;

- determine the text document clusters based on the occurrence of m-skip-n-grams;

- summarize the number of occurrences of m-skip-n-grams from each cluster;

- form a vector representation of the text document based on the ordered sequence of sums of m-skip-n-grams.

2. The method according to claim 1, characterized in that fuzzy partitioning of the list of m-skip-n-grams into clusters is used.

3. The method according to claim 1, characterized in that each m-skip-n-gram belongs to several clusters.

4. The method according to p. 1, characterized in that each m-skip-n-gram has a weight characterizing its correlation to a given cluster.

5. The method according to claim 1, characterized in that the clustering of m-skip-n-grams according to their vector representations is performed more than once.

6. The method according to claim 1, characterized in that weights are additionally used for clusters of m-skip-n-grams, characterizing the significance of clusters for the vector representation of the document.

7. The method according to claim 1, characterized in that additionally, vector representations of m-skip-n-grams not included in the list are obtained and their correlation according to vector representations to at least one cluster.

8. The method according to claim 1, characterized in that the list or part of the list of used m-skip-n-grams is formed based on the occurrence of m-skip-n-grams in text data obtained from external data sources.

9. A system for obtaining a vector representation of an electronic document, comprising at least one processor and at least one memory storing machine-readable instructions that, when executed by the processor, implement the method according to any one of paragraphs. 1-8.

10. A computer-implemented method for searching for sources of information, performed using a processor and containing the steps at which:

- forming at least one model for placing m-skip-n-grams in clusters on a given topic, while the m-skip-n-gram represents at least a single word, and when forming the said model, the following is carried out:

- definition of the list of used m-skip-n-grams;

- conversion to vector representation of each m-skip-n-gram from the list;

- clustering of m-skip-n-grams according to their vector representations;

- receive at least one source of information in the form of an electronic text document;

- processing the received text document using the obtained m-skip-n-gram placement model, during which:

- summarize the number of occurrences of m-skip-n-grams from each cluster;

- form a vector representation of a text document on the basis of an ordered sequence of sums of m-skip-n-grams;

- determine whether the vector representation of the document belongs to a given topic.