RU2702978C1

RU2702978C1 - Bayesian rarefaction of recurrent neural networks

Info

Publication number: RU2702978C1
Application number: RU2018136250A
Authority: RU
Inventors: Екатерина Максимовна Лобачева; Надежда Александровна ЧИРКОВА; Дмитрий Петрович ВЕТРОВ
Original assignee: Самсунг Электроникс Ко., Лтд.
Priority date: 2018-10-15
Filing date: 2018-10-15
Publication date: 2019-10-14

Abstract

FIELD: data processing.

SUBSTANCE: invention relates to the field of artificial intelligence, and in particular, to recurrent neural networks (RNN). Disclosed is a new Bayesian rarefaction method for recurrent architectures with gateways, which take into account their recurrence features and a gate mechanism. In the proposed method, neurons are removed from the associated model and the gates are made constant, which provides not only network compression, but considerable acceleration of forward passage. On discriminating tasks, this method provides maximum LSTM compression, so that only a small amount of input and hidden neurons remains with insignificant reduction of quality. Such small model is easy to interpret.

EFFECT: high compression ratio.

19 cl, 2 dwg, 2 tbl

Description

Область техники, к которой относится изобретениеFIELD OF THE INVENTION

Настоящее изобретение относится, в общем, к области искусственного интеллекта, и в частности, к рекуррентным нейронным сетям (РНС, RNC), более конкретно, к новым методам байесовского разреживания для рекуррентных архитектур с гейтами.The present invention relates, in general, to the field of artificial intelligence, and in particular, to recurrent neural networks (RNC), and more particularly, to new Bayesian sparse methods for recurrent gate architectures.

Предшествующий уровень техникиState of the art

1. Введение1. Introduction

Рекуррентные нейронные сети (РНС) входят в число наиболее мощных моделей для обработки естественного языка, распознавания речи, вопросно-ответных систем и других задач с последовательными данными ([3], [1], [8], [27], [23]). Для таких сложных задач, как машинный перевод или распознавание речи [1], в современных архитектурах РНС задействовано огромное количество параметров. Чтобы использовать эти модели на портативных устройствах с ограниченной памятью, например, смартфонах, желательно осуществлять сжатие модели. Высокие уровни сжатия также могут ускорить работу РНС. Кроме того, сжатие обеспечивает регуляризацию РНС и позволяет избежать переобучения.Recurrent neural networks (RNS) are among the most powerful models for natural language processing, speech recognition, question-answer systems and other tasks with serial data ([3], [1], [8], [27], [23] ) For such complex tasks as machine translation or speech recognition [1], a huge number of parameters are involved in modern RNC architectures. To use these models on portable devices with limited memory, such as smartphones, it is desirable to compress the model. High compression levels can also speed up the performance of the RNC. In addition, compression provides regularization of the RNS and avoids retraining.

Уменьшение размера РНС является важной и быстро развивающейся областью исследований. Существует множество методов сжатия РНС, основанных на специальных представлениях матриц весов ([25], [16]) или на разреживании методом прунинга [20], при котором веса РНС отсекаются по некоторому порогу. Narang и др. [20] выбирают такой порог, используя несколько гиперпараметров, которые управляют частотой, скоростью и продолжительностью удаления весов. Wen и др. [2] предложили отсекать веса в моделях РНС с долгой краткосрочной памятью (LSTM) группами, соответствующими каждому нейрону, что позволило ускорить проход вперед по сети.RNA size reduction is an important and rapidly growing field of research. There are many methods for compressing the RNS, based on special representations of the weight matrices ([25], [16]) or on dilution by the pruning method [20], in which the RNS weights are cut off at a certain threshold. Narang et al. [20] choose such a threshold using several hyperparameters that control the frequency, speed, and duration of the removal of weights. Wen et al. [2] proposed cutting off weights in RNS models with long short-term memory (LSTM) groups corresponding to each neuron, which allowed to accelerate the forward passage through the network.

Настоящее изобретение акцентировано на сжатии РНС посредством разреживания. Большинство методов, входящих в эту группу, являются эвристическими и требуют продолжительной настройки гиперпараметров.The present invention focuses on compressing RNs by dilution. Most of the methods included in this group are heuristic and require continuous tuning of hyperparameters.

В недавней работе Molchanov и др. [18] предложили теоретически обоснованный метод для разреживания полносвязных и сверточных сетей, основанный на вариационном дропауте. В [18] предложена вероятностная модель, в которой параметры, регулирующие разреженность, настраиваются автоматически во время обучения нейронной сети. Эта модель, названная "разреживающий вариационный дропаут" (Sparse Variational Dropout, далее SparseVD), позволяет получать чрезвычайно разреженные решения без значительного снижения качества работы итоговой модели. В [4] представлено развитие этого метода, позволяющее удалять группы весов из модели байесовским методом. Однако до настоящего времени еще не производилось исследования этих методов применительно к РНС.In a recent work, Molchanov et al. [18] proposed a theoretically based method for diluting fully connected and convolutional networks based on a variational dropout. In [18], a probabilistic model was proposed in which the parameters that regulate sparseness are automatically adjusted during the training of a neural network. This model, called the “rarefaction variational dropout” (Sparse Variational Dropout, hereinafter SparseVD), allows to obtain extremely sparse solutions without significantly reducing the quality of the final model. In [4], the development of this method is presented, which allows removing groups of weights from the model using the Bayesian method. However, to date, studies of these methods in relation to RNS have not yet been carried out.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

Согласно настоящему изобретению предложен новый метод байесовского разреживания для рекуррентных архитектур с гейтами, в котором учитываются их рекуррентные особенности и механизм с гейтами. В данном изобретении удаляются нейроны из ассоциированной модели и гейты делаются константными, что обеспечивает не только сжатие сети, но и значительное ускорение прохода вперед. На дискриминативных задачах изобретение обеспечивает максимальное сжатие моделей РНС с долгой краткосрочной памятью (LSTM), так что сохраняется лишь небольшое количество входных и скрытых нейронов при незначительном снижении качества. Такую малую модель легко интерпретировать. Изобретение также было протестировано на задачах моделирования языка, где оно также обеспечило двукратное сжатие соответствующей модели лишь с небольшим снижением качества.According to the present invention, a new Bayesian rarefaction method for recurrent gate architectures is proposed that takes into account their recurrence features and gate mechanism. In this invention, neurons are removed from the associated model and the gates are made constant, which provides not only compression of the network, but also a significant acceleration of passage forward. On discriminant tasks, the invention provides maximum compression of RNS models with long short-term memory (LSTM), so that only a small number of input and hidden neurons are preserved with a slight decrease in quality. Such a small model is easy to interpret. The invention was also tested on language modeling problems, where it also provided twofold compression of the corresponding model with only a slight decrease in quality.

В предложенном решении сначала к рекуррентным нейронным сетям применяют SparseVD. Для учета специфики РНС авторы изобретения берут за основу некоторые аналитические выводы работы [6], в которой предлагается метод применения бинарного дропаута в РНС обоснованный с байесовской точки зрения. Однако применение SparseVD не обязательно приводит к групповой разреженности, при которой из модели удаляются не отдельные веса, а целые группы весов. В ряде недавних работ [14], [4] были предложены методы группового разреживания для полносвязных и сверточных сетей. Основываясь на них, авторы настоящего изобретения предлагают новый подход к групповому разреживанию рекуррентных архитектур с гейтами. Основная идея этого подхода состоит в том, что в модель вводится три уровня шума, и нижние уровни помогают верхним удалить группы весов из модели.In the proposed solution, SparseVD is first applied to recurrent neural networks. To take into account the specifics of RNS, the inventors take as a basis some analytical conclusions from [6], which propose a method of using a binary dropout in RNS that is justified from the Bayesian point of view. However, the use of SparseVD does not necessarily lead to group sparseness, in which not separate weights, but entire groups of weights are removed from the model. In a number of recent works [14], [4], group rarefaction methods for fully connected and convolutional networks were proposed. Based on them, the authors of the present invention propose a new approach to group rarefaction of recurrent architectures with gates. The main idea of this approach is that three noise levels are introduced into the model, and the lower levels help the upper ones remove groups of weights from the model.

Согласно одному аспекту изобретения предложен компьютерно-реализуемый способ сжатия рекуррентной нейронной сети (РНС). Способ содержит выполнение разреживания в отношении весов РНС, причем разреживание содержит этапы, на которых: i) выполняют оптимизацию для получения апостериорного распределения весов, которое аппроксимируется вторым распределением, причем при оптимизации используют априорное распределение весов, представляющее собой первое распределение, и при оптимизации используется генерация весов из аппроксимированного апостериорного распределения, и ii) идентифицируют веса и/или одну или более групп весов, каждая из которых имеет ассоциированное значение меньше заданного порога, и удаляют из РНС идентифицированные веса и/или удаляют из РНС идентифицированные группы весов. Первое распределение предпочтительно является полностью факторизованным лог-равномерным распределением, а второе распределение - полностью факторизованным нормальным распределением. Способ дополнительно включает в себя следующие этапы, на которых: вводят первые мультипликативные переменные для элементов набора возможных элементов (также именуемых в данном описании как словарь) входных последовательностей РНС; и выполняют разреживание в отношении первых мультипликативных переменных, причем на этапе i) выполняют оптимизацию с использованием упомянутого априорного распределения и упомянутого аппроксимированного апостериорного распределения для первых мультипликативных переменных, и на этапе ii) идентифицируют первые мультипликативные переменные, каждая из которых имеет ассоциированное значение меньше заданного порога, и удаляют из РНС элементы упомянутой группы, которые ассоциированы с идентифицированными первыми мультипликативными переменными. Способ дополнительно включает в себя следующие операции: вводят вторые мультипликативные переменные для входных и скрытых нейронов РНС; и выполняют разреживание в отношении вторых мультипликативных переменных, причем на этапе i) выполняют оптимизацию с использованием упомянутого априорного распределения и упомянутого аппроксимированного апостериорного распределения для вторых мультипликативных переменных, и на этапе ii) идентифицируют вторые мультипликативные переменные, каждая из которых имеет ассоциированное значение меньше заданного порога, и удаляют из РНС входные и скрытые нейроны, ассоциированные с идентифицированными вторыми мультипликативными переменными.According to one aspect of the invention, there is provided a computer-implemented method for compressing a recurrent neural network (RNS). The method comprises performing dilution with respect to the RNS weights, the dilution comprising the steps of: i) performing optimization to obtain an a posteriori distribution of weights that is approximated by a second distribution, using an a priori distribution of weights representing the first distribution during optimization, and generation using optimization weights from an approximate posterior distribution, and ii) identifying weights and / or one or more groups of weights, each of which has an associated This value is less than a predetermined threshold, and the identified weights are removed from the RNS and / or the identified groups of weights are removed from the RNS. The first distribution is preferably a fully factorized log-uniform distribution, and the second distribution is a fully factorized normal distribution. The method further includes the following steps: first introducing the multiplicative variables for elements of the set of possible elements (also referred to in this description as a dictionary) of input RNC sequences; and perform dilution with respect to the first multiplicative variables, wherein, in step i), optimization is performed using the a priori distribution and said approximated posterior distribution for the first multiplicative variables, and in step ii), the first multiplicative variables are identified, each of which has an associated value less than a predetermined threshold , and remove from the RNS elements of the aforementioned group that are associated with the identified first multiplicative trans changeable. The method further includes the following operations: introducing second multiplicative variables for input and hidden RNS neurons; and perform dilution with respect to the second multiplicative variables, wherein, in step i), optimization is performed using the a priori distribution and said approximated posterior distribution for the second multiplicative variables, and in step ii), second multiplicative variables are identified, each of which has an associated value less than a predetermined threshold , and remove from the RNS input and hidden neurons associated with the identified second multiplicative variables .

Согласно предпочтительному варианту изобретения РНС имеет архитектуру с гейтами, и способ дополнительно включает в себя следующие операции: вводят третьи мультипликативные переменные для преактиваций гейтов РНС; и выполняют разреживание в отношении третьих мультипликативных переменных, причем на этапе i) выполняют оптимизацию с использованием упомянутого априорного распределения и упомянутого аппроксимированного апостериорного распределения для третьих мультипликативных переменных, и на этапе ii) делают гейт константным, если третья мультипликативная переменная, ассоциированная с данным гейтом, имеет ассоциированное значение меньше заданного порога.According to a preferred embodiment of the invention, the RNS has a gate architecture, and the method further includes the following operations: introducing third multiplicative variables for reactivating the RNS gates; and performing dilution with respect to the third multiplicative variables, wherein in step i) optimization is performed using the a priori distribution and said approximated a posteriori distribution for the third multiplicative variables, and in step ii) the gate is constant if the third multiplicative variable associated with this gate has an associated value less than a given threshold.

Согласно предпочтительному варианту изобретения архитектура с гейтами реализована в виде слоя LSTM сети РНС. Согласно предпочтительному варианту осуществления при введении третьих мультипликативных переменных дополнительно вводят третью мультипликативную переменную для преактивации информационного потока в слое LSTM, и на этапе ii) дополнительно делают информационный поток константным, если третья мультипликативная переменная, ассоциированная с данным информационным потоком, имеет ассоциированное значение меньше заданного порога.According to a preferred embodiment of the invention, the gate architecture is implemented as an LSTM layer of the RNS network. According to a preferred embodiment, when introducing the third multiplicative variables, a third multiplicative variable is additionally introduced to preactivate the information stream in the LSTM layer, and in step ii) the information stream is further made constant if the third multiplicative variable associated with this information stream has an associated value less than a predetermined threshold .

Ассоциированное значение предпочтительно представляет собой отношение квадрата среднего к дисперсии. Заданный порог предпочтительно равен 0,05. Элементами упомянутого набора могут быть слова.The associated value is preferably the ratio of the square of the mean to the variance. The predetermined threshold is preferably 0.05. Elements of the said set may be words.

Данный способ применим для классификации текста или моделирования языка.This method is applicable for text classification or language modeling.

Согласно другому аспекту изобретения предложено устройство для сжатия РНС с архитектурой с гейтами. Устройство содержит: один или более процессоров, и один или более машиночитаемых носителей данных, на которых хранятся машиноисполняемые команды. Машиноисполняемые команды при их исполнении одним или несколькими процессорами предписывают одному или более процессорам: выполнять разреживание в отношении весов РНС, причем разреживание содержит: i) выполнение оптимизации для получения апостериорного распределения весов, которое аппроксимируется полностью факторизованным нормальным распределением, причем при оптимизации используют априорное распределение весов, представляющее собой факторизованное лог-равномерное распределение, и при оптимизации генерируют веса из аппроксимированного апостериорного распределения, и ii) идентификацию весов и/или одной или более групп весов, каждая из которых имеет ассоциированное значение меньше заданного порога, и удаление из РНС идентифицированных весов и/или удаление из РНС идентифицированных групп весов; вводить первые мультипликативные переменные для элементов входного набора возможных элементов входных последовательностей РНС; выполнять разреживание в отношении первых мультипликативных переменных, причем операция i) содержит выполнение оптимизации с использованием упомянутого априорного распределения и упомянутого аппроксимированного апостериорного распределения для первых мультипликативных переменных, и операция ii) содержит идентификацию первых мультипликативных переменных, каждая из которых имеет ассоциированное значение меньше заданного порога, и удаление из РНС элементов упомянутого набора, которые ассоциированы с идентифицированными первыми мультипликативными переменными; вводить вторые мультипликативные переменные для входных и скрытых нейронов РНС; и выполнять разреживание в отношении вторых мультипликативных переменных, причем операция i) содержит выполнение оптимизации с использованием упомянутого априорного распределения и упомянутого аппроксимированного апостериорного распределения для вторых мультипликативных переменных, и операция ii) содержит идентификацию вторых мультипликативных переменных, каждая из которых имеет ассоциированное значение меньше заданного порога, и удаление из РНС входных и скрытых нейронов, ассоциированных с идентифицированными вторыми мультипликативными переменными; вводить третьи мультипликативные переменные для преактиваций гейтов РНС; и выполнять разреживание в отношении третьих мультипликативных переменных, причем операция i) содержит выполнение оптимизации с использованием упомянутого априорного распределения и упомянутого аппроксимированного апостериорного распределения для третьих мультипликативных переменных, и в операции ii) гейт делается константным, если третья мультипликативная переменная, ассоциированная с данным гейтом, имеет ассоциированное значение меньше заданного порога.According to another aspect of the invention, there is provided a device for compressing a RNS with a gate architecture. The device comprises: one or more processors, and one or more computer-readable storage media on which computer-executable instructions are stored. When executed by one or several processors, computer-executable instructions require one or more processors: to perform dilution in relation to the RNS weights, and the dilution contains: i) optimization to obtain an a posteriori distribution of weights that is approximated by a fully factorized normal distribution, using an a priori distribution of weights for optimization , which is a factorized log-uniform distribution, and during optimization they generate weights from approximations ovannogo posterior distribution, and ii) identification of the weights and / or one or more groups of weights, each of which has an associated value is less than a predetermined threshold, and removal of scales NSS identified and / or removal of scales NSS identified groups; introduce the first multiplicative variables for the elements of the input set of possible elements of the input RNS sequences; perform dilution with respect to the first multiplicative variables, wherein operation i) comprises performing optimization using said a priori distribution and said approximated a posteriori distribution for the first multiplicative variables, and operation ii) comprises identifying the first multiplicative variables, each of which has an associated value less than a predetermined threshold, and removing from the RNS elements of said set that are associated with the identified first mules replicative variables; introduce second multiplicative variables for input and hidden RNS neurons; and perform dilution with respect to the second multiplicative variables, wherein operation i) comprises performing optimization using the a priori distribution and said approximated posterior distribution for the second multiplicative variables, and operation ii) comprises identifying the second multiplicative variables, each of which has an associated value less than a predetermined threshold , and removal from the RNS of input and hidden neurons associated with the identified second multiples ativnost variables; introduce third multiplicative variables for reactivation of RNS gates; and perform dilution with respect to the third multiplicative variables, wherein operation i) comprises performing optimization using said a priori distribution and said approximated a posteriori distribution for the third multiplicative variables, and in step ii) the gate is made constant if the third multiplicative variable associated with this gate has an associated value less than a given threshold.

Согласно еще одному аспекту изобретения предложен один или несколько машиночитаемых носителей данных, на которых хранятся машиноисполняемые команды. Машиноисполняемые команды при их исполнении одним или более процессорами вычислительного устройства предписывают одному или более процессорам выполнять операции для сжатия РНС с архитектурой с гейтами, которая реализована в виде слоя LSTM сети РНС. Упомянутые операции включают в себя следующее: выполняют разреживание в отношении весов РНС, содержащее этапы, на которых: i) выполняют оптимизацию для получения апостериорного распределения весов, которое аппроксимируется полностью факторизованным нормальным распределением, причем при оптимизации используют априорное распределение весов, представляющее собой полностью факторизованное лог-равномерное распределение, и при оптимизации генерируют веса из аппроксимированного апостериорного распределения, и ii) идентифицируют веса и/или одну или более групп весов, каждая из которых имеет ассоциированное значение меньше заданного порога, и удаляют из РНС идентифицированные веса и/или удаляют из РНС идентифицированные группы весов; вводят первые мультипликативные переменные для элементов входного набора возможных элементов входных последовательностей РНС; выполняют разреживание в отношении первых мультипликативных переменных, причем на этапе i) выполняют оптимизацию с использованием упомянутого априорного распределения и упомянутого аппроксимированного апостериорного распределения для первых мультипликативных переменных, и на этапе ii) идентифицируют первые мультипликативные переменные, каждая из которых имеет ассоциированное значение меньше заданного порога, и удаляют из РНС элементы упомянутого набора, которые ассоциированы с идентифицированными первыми мультипликативными переменными; вводят вторые мультипликативные переменные для входных и скрытых нейронов РНС; и выполняют разреживание в отношении вторых мультипликативных переменных, причем на этапе i) выполняют оптимизацию с использованием упомянутого априорного распределения и упомянутого аппроксимированного апостериорного распределения для вторых мультипликативных переменных, и на этапе ii) идентифицируют вторые мультипликативные переменные, каждая из которых имеет ассоциированное значение меньше заданного порога, и удаляют из РНС входные и скрытые нейроны, ассоциированные с идентифицированными вторыми мультипликативными переменными; вводят третьи мультипликативные переменные для преактиваций гейтов и информационного потока в слое LSTM сети РНС; и выполняют разреживание в отношении третьих мультипликативных переменных, причем на этапе i) выполняют оптимизацию с использованием упомянутого априорного распределения и упомянутого аппроксимированного апостериорного распределения для третьих мультипликативных переменных, и на этапе ii) делают гейт константным, если третья мультипликативная переменная, ассоциированная с данным гейтом, имеет ассоциированное значение меньше заданного порога, и делают информационный поток константным, если третья мультипликативная переменная, ассоциированная с данным информационным потоком, имеет ассоциированное значение меньше заданного порога.According to another aspect of the invention, one or more computer-readable storage media are provided on which computer-executable instructions are stored. When executed by one or more processors of a computing device, computer-executable instructions instruct one or more processors to perform operations for compressing the RNS with a gate architecture, which is implemented as an LSTM layer of the RNS network. Mentioned operations include the following:performing dilution with respect to the RNS weights, comprising the steps of: i) performing optimization to obtain an a posteriori distribution of weights that is approximated by a fully factorized normal distribution, using an a priori distribution of weights, which is a fully factorized log-uniform distribution, and during optimization generating weights from the approximated posterior distribution, and ii) identifying weights and / or one or more groups of weights, each of which s an associate the value is less than a predetermined threshold, and the identified weights are removed from the RNS and / or the identified groups of weights are removed from the RNS; introducing the first multiplicative variables for elements of the input set of possible elements of the input RNS sequences; performing rarefaction with respect to the first multiplicative variables, wherein in step i) optimization is performed using the a priori distribution and said approximated posterior distribution for the first multiplicative variables, and in step ii) the first multiplicative variables are identified, each of which has an associated value less than a predetermined threshold, and remove from the RNS elements of the said set that are associated with the identified first multiplicative trans ennymi; introduce second multiplicative variables for input and hidden RNS neurons; and perform dilution with respect to the second multiplicative variables, wherein, in step i), optimization is performed using the a priori distribution and said approximated posterior distribution for the second multiplicative variables, and in step ii), second multiplicative variables are identified, each of which has an associated value less than a predetermined threshold , and remove from the RNS input and hidden neurons associated with the identified second multiplicative variables ; introducing third multiplicative variables for reactivating gates and information flow in the LSTM layer of the RNS network; and perform dilution with respect to the third multiplicative variables, wherein in step i) optimization is performed using the a priori distribution and said approximated a posteriori distribution for the third multiplicative variables, and in step ii) the gate is constant if the third multiplicative variable associated with this gate has an associated value less than a given threshold, and make the information flow constant if the third multiplicative variable, asso iirovannaya with the information flow, has an associated value is less than a predetermined threshold.

Изобретательский вклад состоит в следующем: (i) SparseVD и его групповые модификации адаптированы к РНС и при этом объясняют специфику полученной модели, и (ii) модель обобщается посредством введения мультипликативных весов для слов для целенаправленного разреживания словаря, а также посредством введения мультипликативных весов для преактиваций гейтов и информационного потока, чтобы целенаправленно сделать гейт и составляющие информационного потока константными. Результаты показывают, что разреживающий вариационный дропаут обеспечивает очень высокий уровень разреженности в рекуррентных моделях без существенного снижения качества. Модели с дополнительным разреживанием словаря повышают степень сжатия в задачах классификации текста, но не улучшают также сильно результаты в задачах моделирования языка. В задачах классификации обеспечивается сжатие словаря в десятки раз и выбор слов можно интерпретировать.The inventive contribution is as follows: (i) SparseVD and its group modifications are adapted to the RNS and at the same time explain the specifics of the resulting model, and (ii) the model is generalized by introducing multiplicative weights for words for targeted dilution of the dictionary, as well as by introducing multiplicative weights for reactivations gates and information flow to purposefully make the gate and components of the information flow constant. The results show that a rarefaction variational dropout provides a very high level of sparsity in recurrence models without a significant reduction in quality. Models with additional dictionary thinning increase the degree of compression in text classification problems, but also do not significantly improve the results in language modeling problems. In classification problems, the dictionary is compressed tenfold and the choice of words can be interpreted.

Краткое описание чертежейBrief Description of the Drawings

Фиг.1 - блок-схема последовательности операций способа сжатия рекуррентной нейронной сети согласно варианту осуществления настоящего изобретения;Figure 1 is a flowchart of a method for compressing a recurrent neural network according to an embodiment of the present invention;

Фиг.2 - высокоуровневая блок-схема вычислительного устройства, в котором могут быть реализованы аспекты настоящего изобретения.2 is a high level block diagram of a computing device in which aspects of the present invention may be implemented.

Подробное описаниеDetailed description

Далее будут описаны аспекты, положенные в основу предложенного подхода, в том числе методы байесовского разреживания для сетей прямого распространения, а затем будет раскрыт сам подход.Next, the aspects underlying the proposed approach will be described, including Bayesian rarefaction methods for direct distribution networks, and then the approach itself will be disclosed.

2. Предисловие2. Foreword

2.1. Байесовские нейронные сети2.1. Bayesian Neural Networks

Рассмотрим нейронную сеть с весами ω, моделирующую зависимость целевых переменных y={y ¹ , …, y ^e} от соответствующих входных объектов X={x ¹ , …, x ^e}. В байесовской нейронной сети веса ω рассматриваются как случайные переменные. При априорном распределении p(ω) осуществляется поиск апостериорного распределения p(ω|X,y), которое позволит найти ожидаемое целевое значение в процессе логического вывода. В случае нейронных сетей истинное апостериорное распределение обычно найти невозможно, но его можно аппроксимировать некоторым параметрическим распределением

. Качество этой аппроксимации измеряется KL-дивергенцией

. Оптимальный параметр λ можно найти путем максимизации вариационной нижней оценки по λ:Consider a neural network with weights ω simulating the dependence of target variablesy ={y ^one , ..., y ^e} from the corresponding input featuresX ={x ^one , ..., x ^e}. In a Bayesian neural network, weights ω are treated as random variables. With a priori distributionp(ω) a posterior distribution is searchedp(ω| X, y), which will allow you to find the expected target value in the process of inference. In the case of neural networks, the true posterior distribution is usually impossible to find, but it can be approximated by some parametric distribution

. The quality of this approximation is measured by KL divergence

. Optimal parameterλ can be found by maximizing the variational lower bound forλ:

(1)

(one)

Ожидаемое значение логарифма правдоподобия в (1) обычно аппроксимируют путем генерации по методу Монте-Карло (МК). Чтобы получить несмещенную оценку MК, осуществляют параметризацию весов детерминированной функцией: ω=g(λ, ξ), где ξ сэмплируется из некоторого непараметрического распределения (трюк репараметризации [12]). Слагаемое KL-дивергенции в (1) действует как регуляризатор и обычно вычисляется или аппроксимируется аналитически.The expected value of the likelihood logarithm in (1) is usually approximated by Monte Carlo (MK) generation. To obtain an unbiased estimate of MK, weights are parameterized with a deterministic function: ω = g ( λ , ξ ), where ξ is sampled from some nonparametric distribution (reparameterization trick [12]). The term KL divergence in (1) acts as a regularizer and is usually calculated or approximated analytically.

Следует подчеркнуть, что основным преимуществом методов байесовского разрежения является то, что они имеют небольшое количество гиперпараметров по сравнению с методами, основанными на прунинге. Кроме того, они обеспечивают более высокий уровень разреженности ([18], [14], [4]).It should be emphasized that the main advantage of Bayesian rarefaction methods is that they have a small number of hyperparameters in comparison with methods based on pruning. In addition, they provide a higher level of sparseness ([18], [14], [4]).

2.2. Разреживающий вариационный дропаут2.2. Cutting Variation Dropout

Дропаут ([24]) - это стандартный метод регуляризации нейронных сетей. Он подразумевает умножение входов каждого слоя на случайно генерируемый вектор шума. Обычно элементы этого вектора генерируются из распределения Бернулли или нормального распределения с параметрами, настраиваемыми с помощью кросс-валидации. В работе Kingma и др. ([13]) описана интерпретация гауссовского дропаута с байесовской точки зрения, которая позволяет настраивать параметры дропаута автоматически во время обучения модели. Позже эта модель была расширена на разреживание полносвязных и сверточных нейронных сетей и названа разреживающим вариационным дропаутом (SparseVD) ([18]).Dropout ([24]) is a standard method for regularizing neural networks. It implies multiplying the inputs of each layer by a randomly generated noise vector. Typically, elements of this vector are generated from a Bernoulli distribution or a normal distribution with parameters customizable using cross-validation. Kingma et al. ([13]) described the interpretation of a Gaussian dropout from the Bayesian point of view, which allows you to adjust the dropout parameters automatically during model training. Later, this model was extended to rarefaction of fully connected and convolutional neural networks and called the rarefaction variational dropout (SparseVD) ([18]).

Рассмотрим один полносвязный слой нейронной сети прямого распространения с входом размера n, выходом размера m и матрицей весов W. Согласно Kingma и др. ([13]), в SparseVD априорное распределение на весах является полностью факторизованным лог-равномерным распределением

и поиск апостериорного распределения осуществляется в виде полностью факторизованного нормального распределения:Consider one fully connected layer of a direct propagation neural network with an input of size n , an output of size m and a weight matrix W. According to Kingma et al. [13], in SparseVD, the a priori distribution on the balance is a fully factorized log-uniform distribution

and the search for the posterior distribution is carried out in the form of a fully factorized normal distribution:

(2)

Применение такой формы апостериорного распределения равносильно наложению мультипликативного ([13]) или аддитивного ([18]) нормального шума на веса следующим образом:The use of this form of a posterior distribution is equivalent to superimposing multiplicative ([13]) or additive ([18]) normal noise on weights as follows:

(3)

(4)

(four)

Представление (4) называется аддитивной репараметризацией ([18]). Она уменьшает дисперсию градиентов

по θ _ij. Кроме того, поскольку сумма нормальных распределений есть нормальное распределение с вычисляемыми параметрами, шум можно накладывать на преактивацию (входной вектор, умноженный на весовую матрицу W), а не на W. Этот прием называется трюком локальной репараметризации ([26], [13]), и он еще сильнее уменьшает дисперсию градиентов, а также повышает эффективность обучения.Representation (4) is called additive reparametrization ([18]). It reduces the dispersion of gradients.

by θ _ij . In addition, since the sum of the normal distributions is the normal distribution with the calculated parameters, the noise can be superimposed on preactivation (the input vector times the weight matrix W) rather than W. This technique is called the local reparametrization trick ([26], [13]) , and it further reduces the dispersion of gradients, and also increases the effectiveness of training.

В SparseVD оптимизация вариационной нижней оценки (1) проводится по {Θ, log σ}. KL-дивергенция факторизуется по отдельным весам, и ее слагаемые зависят только от α_ij в силу специального выбора априорного распределения ([13]):In SparseVD, the optimization of the variational lower bound (1) is carried out by {Θ, logσ}.KL divergence is factorized by individual weights, and its terms depend only on α_ij due to the special choice of the a priori distribution ([13]):

(5)

Каждое слагаемое можно аппроксимировать следующим образом ([18]): Each term can be approximated as follows ([18]) :

(6)

Слагаемое KL-дивергенции обеспечивает большие значения α_ij. Если α_ij → ∞для веса w _ij , то апостериорное распределение этого веса является нормальным распределением с большой дисперсией, и для модели выгодно установить θ _ij=0 и σ _ij=α _ij θ ²=0, чтобы избежать ошибок предсказания. В результате, апостериорное распределение w _ij приближается к центрированной в нуле дельта-функции, и данный вес не влияет на выход сети и может быть проигнорирован.The KL divergence term provides large α_ij. If α_ij → ∞for weightw _ij ,then the posterior distribution of this weight is a normal distribution with a large dispersion, and it is advantageous for the model to establishθ _ij= 0 andσ _ij=α _ij θ ²= 0 to avoid prediction errors. As a result, a posterior distributionw _ij approaches the delta function centered at zero, and this weight does not affect the network output and can be ignored.

2.3. Разреживающий вариационный дропаут для группового разреживания 2.3. Cutting variational dropout for group rarefaction

В (4) модель SparseVD была расширена, чтобы достичь группового разреживания. Под групповым разреживанием понимается, что веса делятся на несколько групп, и вместо отдельных весов удаляются эти группы. В качестве примера рассмотрим группы весов, соответствующие одному входному нейрону в полносвязном слое, и пронумеруем эти группы 1 … n.In (4), the SparseVD model was expanded to achieve group thinning. Group dilution means that weights are divided into several groups, and instead of individual weights, these groups are deleted. As an example, we consider groups of weights corresponding to one input neuron in a fully connected layer, and we number these groups 1 ... n .

Для достижения групповой разреженности авторы изобретения предлагают ввести дополнительные мультипликативные веса z _i для каждой группы и настраивать эти веса в следующем виде:To achieve group sparseness, the inventors propose to introduce additional multiplicative weights z _i for each group and adjust these weights in the following form:

В полносвязном слое это эквивалентно наложению мультипликативных переменных на вход слоя. Поскольку главной задачей является обеспечение z _i=0 и удаление нейрона из модели, используется та же пара априорного и апостериорного распределений для z _i, что и в SparseVD:In a fully connected layer, this is equivalent to superimposing multiplicative variables on the input of the layer. Since the main task is to ensure z _i = 0 and remove the neuron from the model, the same pair of a priori and posterior distributions for z _i is used as in SparseVD:

Для отдельных весов

авторы изобретения используют стандартное нормальное априорное распределение и нормальное аппроксимированное апостериорное распределение с обучаемыми средним и дисперсией:For individual weights

the inventors use the standard normal a priori distribution and the normal approximated posterior distribution with trainees mean and dispersion:

В этой модели априорное распределение на отдельные веса обеспечивает θ _ij → 0, и это помогает приблизить групповые средние

к нулю.In this model, an a priori distribution for individual weights provides θ _ij → 0, and this helps to approximate group averages

to zero.

3. Предлагаемый метод3. The proposed method

В данном разделе описан основной подход к байесовскому разреживанию рекуррентных нейронных сетей, а затем вводится метод группового байесовского разреживания рекуррентных сетей с долгой краткосрочной памятью (LSTM). В данном контексте LSTM рассматривается в силу того, что в настоящее время она является одной из наиболее популярных рекуррентных архитектур.This section describes the basic approach to Bayesian rarefaction of recurrent neural networks, and then introduces the method of group Bayesian rarefaction of recurrent networks with long short-term memory (LSTM). In this context, LSTM is considered due to the fact that at present it is one of the most popular recurrent architectures.

3.1. Байесовское разреживание рекуррентных нейронных сетей3.1. Bayesian rarefaction of recurrent neural networks

Рекуррентная нейронная сеть принимает последовательность

в качестве входа и преобразует ее в последовательность скрытых состояний:Recurrent neural network takes the sequence

as an input and converts it into a sequence of hidden states:

(7)

По всему описанию принято допущение, что выход РНС зависит только от последнего скрытого состояния:Throughout the description, the assumption is made that the RNS output depends only on the last latent state:

(8)

Здесь g _h и g _y - некоторые нелинейные функции. Однако все рассматриваемые далее методы можно применить к более сложным случаям, например, к языковой модели с несколькими выходами для одной входной последовательности (по одному выходу для каждого временного шага).Hereg _h andg _y - some nonlinear functions. However, all the methods considered below can be applied to more complex cases, for example, to a language model with several outputs for one input sequence (one output for each time step).

Для разреживания весов мы применяем SparseVD к РНС. Однако рекуррентные нейронные сети обладают некоторыми особенностями, которые необходимо учитывать при построении данной вероятностной модели.To dilute the balance, we apply SparseVD to the RNS. However, recurrent neural networks have some features that must be taken into account when constructing this probabilistic model.

В соответствии с Molchanov и др. ([18]) используется полностью факторизованное лог-равномерное априорное распределение, а апостериорное распределение аппроксимируется полностью факторизованным нормальным распределением на веса ω={WAccording to Molchanov et al. ([18]), a completely factorized log-uniform a priori distribution is used, and the posterior distribution is approximated by a fully factorized normal distribution on weights ω = {W ^xx , W, W ^hh }:}:

(9)

где

и

имеют то же самое значение, что и в аддитивной репараметризации (4).Where

and

have the same meaning as in additive reparametrization (4).

Для обучения этой модели максимизируется аппроксимация вариационной нижней оценкиTo train this model, the approximation of the variational lower bound is maximized

(10)

по параметрам {Θ, log σ} с использованием стохастических методов оптимизации по минибатчам. При этом рекуррентность в ожидаемом слагаемом логарифма правдоподобия разворачивается как в (7), а KL аппроксимируется с использованием (6). Интеграл в (10) оценивается одним сэмплом

на каждый минибатч. Трюк репараметризации (для несмещенной интегральной оценки) и аддитивная репараметризация (для уменьшения дисперсии градиентов) используются для сэмплирования матриц весов "входной→скрытый" W ^x и "скрытый→скрытый" W ^h.by parameters {Θ, logσ} using stochastic methods of optimization for minibatches. Moreover, the recurrence in the expected term of the likelihood logarithm unfolds as in (7), and KL is approximated using (6). The integral in (10) is estimated by one sample

for every minibatch. The reparametrization trick (for an unbiased integral estimate) and additive reparametrization (to reduce the variance of the gradients) are used to sample the input → hidden weight matricesW ^x and "hidden → hidden"W ^h.

Трюк локальной репараметризации невозможно применить ни к матрице "скрытый→скрытый" W ^h, ни к матрице "входной→скрытый" W ^x. Поскольку применение трехмерного шума (две размерности для W ^h и размер минибатча) слишком ресурсоемко, генерируется одна матрица шума для всех объектов в минибатче для эффективности:The local reparameterization trick cannot be applied either to the hidden → hidden matrix W ^h , or to the input → hidden matrix W ^x . Since the use of three-dimensional noise (two dimensions for W ^h and the size of the minibatch) is too resource-intensive, one noise matrix is generated for all objects in the minibatch for efficiency:

(11)

(eleven)

(12)

Полученный метод работает следующим образом: генерируют матрицы весов "входной→скрытый" и "скрытый→скрытый" (по одной на каждый минибатч), оптимизируют вариационную нижнюю оценку (10) по {Θ, log σ}, и для многих весов получают апостериорное распределение в виде δ-функции в нуле, поскольку KL-дивергенция обеспечивает разреженность. Эти веса можно затем безопасно удалить из модели.The obtained method works as follows: generate the input → hidden and hidden → hidden weight matrices (one for each minibatch), optimize the variational lower bound (10) with respect to { Θ , log σ }, and for many weights obtain an a posteriori distribution in the form of a δ-function at zero, since KL-divergence provides sparseness. These weights can then be safely removed from the model.

В LSTM рассматривается та же пара априорного и апостериорного распределений для всех матриц "входной→скрытый" и "скрытый→скрытый", и все вычисления остаются такими же. Матрицы шума для соединений "входной→скрытый" и "скрытый→скрытый" генерируются отдельно для каждого гейта i, o, f и информационного потока g.In LSTM, the same pair of a priori and a posteriori distributions is considered for all matrices “input → hidden” and “hidden → hidden”, and all calculations remain the same. Noise matrices for input → hidden and hidden → hidden connections are generated separately for each gate i, o, f and information flow g .

3.2. Групповое байесовское разреживание LSTM3.2. Group Bayesian thinning LSTM

В (4) имеются два уровня шума: шум на группы весов и шум на отдельные веса. Однако, популярные рекуррентные нейронные сети обычно имеют более сложную структуру с гейтами, которую можно использовать для достижения лучшего уровня сжатия и ускорения. В LSTM имеется внутренняя память c _t и три гейта управляют обновлением, стиранием и выдачей информации из этой памяти:In (4), there are two noise levels: noise for groups of weights and noise for individual weights. However, popular recurrent neural networks usually have a more complex gate structure that can be used to achieve a better level of compression and acceleration. In LSTM there is an internal memory c _t and three gates control the updating, erasing and output of information from this memory:

(13)

(14)

(fourteen)

(15)

(fifteen)

Для учета этой структуры с гейтами предлагается ввести промежуточный уровень шума в слой LSTM наряду с шумом на весах и на входных (z ^x) и скрытых нейронах (z ^h). В частности, мультипликативный шум z ⁱ, z ^f, z°, z ^g накладывается на преактивации каждого гейта и информационный поток g. Полученный слой LSTM выглядит следующим образом:To account for this structure with gates, it is proposed to introduce an intermediate noise level in the LSTM layer along with noise on the scales and at the input (z ^x) and hidden neurons (z ^h) In particular, multiplicative noisez ⁱ,z ^f,z °,z ^g superimposed on the reactivation of each gate and the information flowg. The resulting LSTM layer is as follows:

(16)

(17)

(18)

(eighteen)

(19)

(20)

(twenty)

Эта модель эквивалентна наложению групповых мультипликативных переменных не только на столбцы весовых матриц (как в (4)), но и на их строки. Например, для матрицы

эта параметризация выглядит следующим образом:This model is equivalent to superimposing group multiplicative variables not only on columns of weight matrices (as in (4)), but also on their rows. For example, for a matrix

this parameterization is as follows:

Формулы для остальных семи весовых матриц LSTM получают таким же образом.The formulas for the remaining seven LSTM weight matrices are obtained in the same way.

Как и в (4), при приближении какой-либо компоненты z ^x или z ^h к нулю можно удалить из модели соответствующий нейрон. Но аналогичное свойство также имеет место и для гейтов: при приближении какой-либо компоненты z ⁱ, z ^f, z°, z ^g к нулю соответствующий гейт или компонента информационного потока становится константной. Это означает, что данный гейт не надо вычислять и проход вперед по LSTM ускоряется.As in (4), as some component approachesz ^x orz ^h to zero, you can remove the corresponding neuron from the model. But a similar property also holds for gates: when approaching any componentz ⁱ,z ^f,z °,z ^g to wellliuthe corresponding gate or component of the information flow becomes constant. This means that this gate does not need to be calculated, and the LSTM forward speedup is faster.

Кроме того, новый промежуточный уровень шума помогает разреживать входные и скрытые нейроны. Такая трехуровневая иерархия работает следующим образом: шум на отдельные веса позволяет обнулить значения отдельных весов, промежуточный уровень шума на гейты и информационный поток улучшает разреживание промежуточных переменных (гейтов и информационного потока), а последний уровень шума на нейроны помогает разреживать уже нейроны целиком.In addition, a new intermediate noise level helps to dilute input and hidden neurons. Such a three-level hierarchy works as follows: noise on individual weights allows zeroing the values of individual weights, an intermediate noise level on gates and an information flow improves dilution of intermediate variables (gates and information flow), and the last noise level on neurons helps to dilute entire neurons already.

В (4) авторы изобретения накладывают стандартное нормальное априорное распределение на отдельные веса. Например, модель для компонент

выглядит следующим образом:In (4), the inventors impose a standard normal a priori distribution on individual weights. For example, a model for components

as follows:

(21)

(22)

(23)

Авторы изобретения доказали экспериментально, что применение лог-равномерного априорного распределения вместо стандартного нормального для отдельных весов усиливает разреживание групповых переменных. Поэтому используется та же самая пара априорного и апостериорного распределений, что и в SparseVD для всех переменных.The inventors proved experimentally that the use of a log-uniform a priori distribution instead of the standard normal distribution for individual weights enhances the dilution of group variables. Therefore, the same pair of a priori and posterior distributions is used as in SparseVD for all variables.

Для обучения модели используется тот же процесс, что и в SparseVD для РНС, но в дополнение к генерации

также генерируются мультипликативные групповые переменные.To train the model, the same process is used as in SparseVD for RNS, but in addition to generating

Multiplicative group variables are also generated.

4. Байесовское сжатие для обработки естественного языка4. Bayesian compression for natural language processing

В задачах обработки естественного языка большинство весов в РНС часто сосредоточены в первом слое, который связан со словарем, например, в слое представления (embedding layer). Однако для некоторых задач большинство слов не являются необходимыми для точных прогнозов. В предлагаемой модели авторы изобретения вводят мультипликативные веса для слов, чтобы осуществить разреживание словаря (см. подраздел 4.3). Эти мультипликативные веса обнуляются во время обучения и тем самым отфильтровываются соответствующие ненужные слова из модели. Это позволяет еще больше повысить уровень разреживания РНС.In natural language processing problems, most weights in the RNC are often concentrated in the first layer, which is associated with the dictionary, for example, in the presentation layer (embedding layer). However, for some tasks, most words are not necessary for accurate predictions. In the proposed model, the inventors introduce multiplicative weights for words in order to sparse the dictionary (see subsection 4.3). These multiplicative weights are zeroed out during training and thus the corresponding unnecessary words from the model are filtered out. This allows you to further increase the level of dilution of the RNS.

4.1. Обозначения4.1. Designations

В остальной части описания изобретения x=[x ₀,..., x _T] является входной последовательностью, y - истинным выходом и

является результатом, предсказанным РНС (y и

могут быть векторами, последовательностями векторов и т.д.). X, Y обозначает обучающую выборку {(x ¹, y ¹), …, (x ^N, y ^N)}. Все веса РНС, кроме смещений, обозначаются символом ω, а один вес (элемент любой весовой матрицы) обозначается w _ij. Следует отметить, что смещения здесь отделяются и обозначаются буквой B, потому что они не разреживаются.In the rest of the description of the invention, x = [ x ₀ , ..., x _T ] is an input sequence, y is a true output, and

is the result predicted by the RNS ( y and

can be vectors, sequences of vectors, etc.). X, Y denotes the training sample {( x ¹ , y ¹ ), ..., ( x ^N , y ^N )}. All RNS weights, except for offsets, are denoted by the symbol ω, and one weight (an element of any weight matrix) is denoted by w _ij . It should be noted that the offsets here are separated and indicated by the letter B, because they are not rarefied.

Для определенности, авторы изобретения иллюстрируют модель на примерной архитектуре для задачи моделирования языка, где y=[x ₁,..., x _T]:For definiteness, the inventors illustrate a model on an exemplary architecture for the task of modeling a language, where y = [ x ₁ , ..., x _T ]:

● слой представления:

;● presentation layer:

;

● рекуррентный слой:

;● recurrence layer:

;

● полносвязный слой:

.● fully connected layer:

.

В этом примере

, B={b ^r, b ^d}. Однако данную модель можно непосредственно применить к любой рекуррентной архитектуре.In this example

, B = { b ^r , b ^d }. However, this model can be directly applied to any recurrent architecture.

4.2. Разреживающий вариационный дропаут для РНС4.2. RNS Cutting Variation Dropout

Как отмечалось выше, следуя [4], [18], авторы изобретения накладывают полностью факторизованное лог-равномерное распределение на веса:As noted above, following [4], [18], the inventors impose a fully factorized log-uniform distribution on the weight:

,

и аппроксимируют апостериорное распределение полностью факторизованным нормальным распределением:and approximate the posterior distribution by the fully factorized normal distribution:

.

Задача аппроксимации апостериорного распределенияThe problem of approximating the posterior distribution

min_θ _, _σ _, _B KL(q(ω|θ, σ)||p(ω|X, Y, i)) эквивалентна оптимизации вариационной нижней оценки ([18]):min _θ _, _σ _, _B KL ( q ( ω | θ , σ ) || p ( ω | X , Y , i )) is equivalent to optimizing the variational lower bound ([18]):

(24)

Здесь первое слагаемое, специфическая для задачи функция потерь, аппроксимируется с использованием одного сэмпла из q(ω|θ, σ). Второе слагаемое является регуляризатором, который делает апостериорное распределение более похожим на априорное и обеспечивает разреженность. Упомянутый регуляризатор можно с высокой точностью аппроксимировать аналитическиHere, the first term, the task-specific loss function, is approximated using a single sample from q ( ω | θ , σ ). The second term is a regularizer, which makes the posterior distribution more similar to the a priori distribution and provides sparseness. The mentioned regularizer can be approximated with high accuracy analytically

, (25)

.

Чтобы получить несмещенную оценку интеграла, генерацию из апостериорного распределения выполняют с использованием трюка репараметризации [12]:To obtain an unbiased estimate of the integral, generation from the posterior distribution is performed using the reparametrization trick [12]:

. (26)

Важным отличием РНС от сетей прямого распространения является использование одних и тех же весов в различных временных шагах. Таким образом, один и тот же сэмпл весов следует использовать для каждого временного шага t при вычислении вероятности

([6], [7], [5]).An important difference between RNS and direct distribution networks is the use of the same weights at different time steps. Thus, the same sample of weights should be used for each time step t in calculating the probability

([6], [7], [5]).

Kingma и др. [13], Molchanov и др. [18] также используют трюк локальной репараметризации (ТЛР, LRT), в котором производится сэмплирование преактиваций вместо отдельных весов. Например,Kingma et al . [13], Molchanov et al . [18] also use the local reparametrization trick (TLR, LRT), in which preactivations are sampled instead of individual weights. For example,

.

Связанное сэмплирование весов делает ТЛР неприменимым к весовым матрицам, которые используются более чем в одном временном шаге в РНС.Associated weight sampling makes TLR not applicable to weight matrices that are used in more than one time step in the RNS.

Для матрицы "скрытый→скрытый" W ^h линейная комбинация (W ^h h _t) не распределена нормально, так как h _t зависит от W ^h из предыдущего временного шага. В результате, правило о сумме независимых нормальных распределений с постоянными коэффициентами неприменимо. На практике сеть с ТЛР на весах "скрытый→скрытый" невозможно обучить должным образом.For the “hidden → hidden” matrix W ^{h, the} linear combination ( W ^h h _t ) is not distributed normally, since h _t depends on W ^h from the previous time step. As a result, the rule on the sum of independent normal distributions with constant coefficients is not applicable. In practice, a network with a TLR on a “hidden → hidden” scale cannot be trained properly.

Для матрицы "входной→скрытый" W ^x линейная комбинация (W ^x x _t) распределена нормально. Однако сэмплирование одной и той же W ^x для всех временных шагов не эквивалентно сэмплированию одного и того же шума

для преактиваций для всех временных шагов. Один и тот же сэмпл W ^x соответствует различным сэмплам шума

на разных временных шагах из-за разных x _t. Следовательно, теоретически ТЛР здесь ннеприменим. На практике сети с ТЛР на весах "входной→скрытый" могут давать похожие результаты, и в некоторых экспериментах они даже сходятся немного быстрее.For the matrix “input → hidden” W ^{x the} linear combination ( W ^x x _t ) is normally distributed. However, sampling the same W ^x for all time steps is not equivalent to sampling the same noise

for preactivation for all time steps. The same sample W ^x corresponds to different noise samples

at different time steps due to different x _t . Therefore, theoretically, TLR is not applicable here. In practice, networks with TLRs on an “input → hidden” scale can give similar results, and in some experiments they even converge a little faster.

Поскольку процедура обучения эффективна только с двумерным тензором шума, авторы изобретения предлагают генерировать шум на веса на каждом минибатче, а не на каждом отдельном объекте. Since the training procedure is effective only with a two-dimensional noise tensor, the inventors propose to generate noise by weight on each minibatch, and not on each individual object .

Обобщая вышесказанное, процедура обучения выглядит следующим образом. Для выполнения прохода вперед для минибатча авторы изобретения сначала генерируют все веса ω, следуя (26), а затем применяют РНС как обычно. Затем вычисляются градиенты (24) относительно θ, log σ, B. На этапе тестирования используются средние веса θ [18]. Регуляризатор (25) приводит к приближению большинства компонент θ к нулю, и веса разреживаются. Точнее сказать, удаляются веса с низким отношением сигнал/шум

[18].Summarizing the above, the training procedure is as follows. To perform a forward walk for a minibatch, the inventors first generate all weights ω, following (26), and then apply the RNC as usual. Then, gradients (24) with respect to θ , log σ , B are calculated. At the testing stage, the average weights θ are used [18]. The regularizer (25) leads to the approach of the majority of the components of θ to zero, and the weights are rarefied. More precisely, weights with a low signal to noise ratio are removed

[eighteen].

4.3. Мультипликативные веса для разреживания словаря4.3. Multiplicative Weights for Dictionary Thinning

Одним из преимуществ байесовского разреживания является легкое обобщение для разреживания любой группы весов, которое не усложняет процедуру обучения ([4]). Для этого следует ввести общий мультипликативный вес для каждой группы, и удаление этого мультипликативного веса будет означать удаление соответствующей группы. Авторы изобретения используют данный подход для разреживания словаря.One of the advantages of Bayesian rarefaction is an easy generalization for rarefaction of any group of weights, which does not complicate the training procedure ([4]). To do this, enter the total multiplicative weight for each group, and the removal of this multiplicative weight will mean the removal of the corresponding group. The inventors use this approach to sparse the dictionary.

В частности, авторы изобретения вводят мультипликативные вероятностные веса

для слов в словаре (здесь V - размер словаря). Проход вперед с z выглядит следующим образом:In particular, the inventors introduce multiplicative probability weights

for words in the dictionary (here V is the size of the dictionary). Going forward with z is as follows:

1. сэмплируют вектор z ⁱ из текущей аппроксимации апостериорного распределения для каждой входной последовательности x ⁱ из минибатча;1. sample the vector z ⁱ from the current approximation of the posterior distribution for each input sequence x ⁱ from the minibatch;

2. умножают каждый элемент x_t ⁱ (закодированный вектором из 0 и 1 с одной 1 - one-hot) из последовательности x ⁱ на z ⁱ (здесь x ⁱ and z ⁱ V-размерные);2. Multiply each element x _t ⁱ (encoded by a vector of 0 and 1 with one 1 - one-hot) from the sequence x ⁱ by z ⁱ (here x ⁱ and z ^{i are} V-dimensional);

3. продолжают проход вперед как обычно.3. Continue forward as usual.

Авторы изобретения работают с z так же, как и с другими весами W: используется лог-равномерное априорное распределение, а апостериорное распределение аппроксимируется с помощью полностью факторизованного нормального распределения, имеющего обучаемое среднее и дисперсию. Однако, поскольку z является одномерным вектором, его можно генерировать отдельно для каждого объекта в минибатче, чтобы уменьшить дисперсию градиентов. После обучения элементы z с низким отношением сигнал/шум обрезаются, и затем соответствующие слова из словаря не используются, а столбцы весов удаляются из представления или весовых матриц "входной→скрытый".The inventors work with z in the same way as with other W weights: a log-uniform a priori distribution is used, and the posterior distribution is approximated using a fully factorized normal distribution having a trained mean and variance. However, since z is a one-dimensional vector, it can be generated separately for each object in the minibatch to reduce gradient dispersion. After training, the z elements with a low signal-to-noise ratio are trimmed, and then the corresponding words from the dictionary are not used, and the columns of the weights are removed from the input → hidden weight matrix or view.

4.4. Эксперименты4.4. The experiments

Авторы изобретения провели эксперименты с архитектурой LSTM для двух типов задач: классификация текста и моделирование языка. Здесь приводится сравнение трех моделей: основной модели без какой-либо регуляризации, модели SparseVD и модели SparseVD с мультипликативными весами для разреживания словаря (SparseVD-Voc) согласно настоящему изобретению.The inventors experimented with LSTM architecture for two types of tasks: text classification and language modeling. Here is a comparison of three models: the main model without any regularization, the SparseVD model and the SparseVD model with multiplicative dictionary thinning weights (SparseVD-Voc) according to the present invention.

Чтобы измерить уровень разреженности этих моделей, рассчитывают степень сжатия отдельных весов как |w|/|w≠0|. Разреживание весов может привести не только к сжатию, но также и к ускорению РНС в результате групповой разреженности. Таким образом, авторы изобретения сообщают количество оставшихся нейронов во всех слоях: входном слое (словарь), слое представления и рекуррентном слое. Чтобы вычислить это число для слоя словаря в SparseVD-Voc, используются введенные переменные z_v. Для всех остальных слоев в SparseVD и SparseVD-Voc нейрон отбрасывается, если удаляются все веса, связанные с этим нейроном.To measure the sparseness of these models, the compression ratio of individual weights is calculated as | w | / | w ≠ 0 |. The dilution of the scales can lead not only to compression, but also to acceleration of the RNS as a result of group sparseness. Thus, the inventors report the number of remaining neurons in all layers: the input layer (dictionary), the presentation layer and the recurrence layer. To calculate this number for the dictionary layer in SparseVD-Voc, the variables z _{v are used} . For all other layers in SparseVD and SparseVD-Voc, a neuron is discarded if all weights associated with this neuron are removed.

В данном случае сети оптимизируются с использованием [11]. Базовые сети переобучаются для всех анализируемых задач, поэтому авторы изобретения представляют для них результаты с ранним остановом. Для всех разреживаемых весов log σ инициализировался как -3. Веса с отношением сигнал-шум менее τ=0,05 удаляются. Более подробная информация об организации эксперимента представлена в Приложении А.In this case, the networks are optimized using [11]. Core networks are retrained for all the problems being analyzed, therefore, the inventors present results with an early shutdown for them. For all rarefied weights, log σ was initialized as -3. Weights with a signal-to-noise ratio less than τ = 0.05 are removed. More detailed information on the organization of the experiment is presented in Appendix A.

4.4.1. Классификация текста4.4.1. Text classification

Предложенный подход, отвечающий настоящему изобретению, оценивался на двух стандартных наборах данных для классификации текста: набор данных IMDb ([9]) для двухклассовой классификации и набор данных AGNews ([10]) для четырехклассовой классификации. Авторы изобретения выделили 15% и 5% обучающих данных для целей проверки, соответственно. Для обоих наборов данных использовался словарь из 20000 наиболее часто встречающихся слов.The proposed approach, consistent with the present invention, was evaluated on two standard data sets for text classification: the IMDb data set ([9]) for two-class classification and the AGNews data set ([10]) for four-class classification. The inventors identified 15% and 5% of training data for verification purposes, respectively. For both data sets, a dictionary of 20,000 most frequently used words was used.

Авторы изобретения использовали сети с одним слоем представления из 300 единиц, одним слоем LSTM из 128/512 скрытых единиц для IMDb/AGNews и, наконец, полносвязным слоем, применявшемся к последнему выходу LSTM. Слой представления инициализировался word2vec ([15])/GloVe ([17]), а модели SparseVD и SparseVD-Voc обучались 800/150 эпох на IMDb/AGNews.The inventors used networks with a single presentation layer. of 300 units, one LSTM layer of 128/512 hidden units for IMDb / AGNews, and finally a fully connected layer applied to the last LSTM output. The presentation layer was initialized with word2vec ([15]) / GloVe ([17]), and the SparseVD and SparseVD-Voc models were trained 800/150 epochs on IMDb / AGNews.

Результаты представлены в Таблице 1. SparseVD приводит к очень высокой степени сжатия без значительного снижения качества. SparseVD-Voc повышает степени сжатия все еще без значительного снижения точности. Такие высокие степени сжатия достигаются, главным образом, благодаря разреживанию словаря: для классификации текстов необходимо прочесть только некоторые важные слова из них. Остающиеся после разреживания слова в предложенных моделях преимущественно являются интерпретируемыми для данной задачи (см. Приложение B со списком остальных слов для IMBb).The results are presented in Table 1. SparseVD results in a very high compression ratio without a significant reduction in quality. SparseVD-Voc improves compression ratios without significantly reducing accuracy. Such high degrees of compression are achieved mainly due to the thinning of the dictionary: to classify texts, it is necessary to read only some important words from them. The words remaining after thinning in the proposed models are mainly interpreted for this task (see Appendix B for a list of other words for IMBb).

ЗадачаA task МетодMethod Точность %% Accuracy СжатиеCompression СловарьDictionary Нейроны

Neurons

IMDb Source
SparseVD
SparceVD-voc 84.1
85.1
83.6 1x
1135x
12985x 20000
4611
292 300-128
16-17
1-8 AGNews Source
SparseVD
SparceVD-voc 90.6
88.8
89.2 1x
322x
469x 20000
5727
2444 300-512
179-56
127-32 Table 1. Results for text classification tasks. The compression is | w | / | w ≠ 0 |. The last two columns show the number of neurons remaining in the input layer, presentation layer, and recurrence layer.

4.4.2. Моделирование языка4.4.2. Language modeling

Авторы изобретения оценили предложенные модели на задаче моделирования языка на уровне знаков и на уровне слов на корпусе Penn Treebank ([19]) в соответствии с делением корпуса на обучающий/валидационный/тестовый по работе [21]. Этот набор данных имеет словарь из 50 знаков или 10000 слов.The inventors evaluated the proposed model on the task of modeling the language at the level of characters and word level on the Penn Treebank corpus ([19]) in accordance with the division of the corps into training / validation / test according to [21]. This dataset has a dictionary of 50 characters or 10,000 words.

Для решения задач на уровнях знаков/слов авторы изобретения использовали сети с одним слоем LSTM из 1000/256 скрытых единиц и полносвязный слой с активацией softmax для прогнозирования следующего знака или слова. Модели SparseVD и SparseVD-Voc обучались 250/150 эпох на задачах на уровне знаков/на уровне слов.To solve problems at the level of signs / words, the inventors used networks with one LSTM layer of 1000/256 hidden units and a fully connected layer with softmax activation to predict the next character or word. SparseVD and SparseVD-Voc models trained 250/150 eras on tasks at the level of signs / at the level of words.

Результаты представлены в Таблице 2. Для получения этих результатов применяли ТЛР на последнем полносвязном слое. В экспериментах с моделированием языка ТЛР на последнем слое ускорил обучение без ухудшения конечного результата. При этом не были достигнуты такие экстремальные степени сжатия, как в предыдущем эксперименте, однако способность сжимать модели в несколько раз с достижением лучшего качества относительно исходного уровня все еще сохраняется благодаря эффекту регуляризации SparseVD. Входной словарь не разредился в задаче на уровне знаков, потому что имеется всего 50 знаков и все они имеют значение. В задаче на уровне слов было отброшено более половины слов. Однако, поскольку в моделировании языка важны почти все слова, разреживание словаря затрудняет задачу для сети и приводит к снижению качества и общего сжатия (сеть требует более сложной динамики в рекуррентном слое).The results are presented in Table 2. To obtain these results, TLR was used on the last fully connected layer. In experiments with language modeling, TLR on the last layer accelerated learning without affecting the end result. At the same time, such extreme degrees of compression were not achieved as in the previous experiment, however, the ability to compress models several times to achieve better quality relative to the initial level is still preserved due to the regularization effect of SparseVD. The input vocabulary has not spread out in the problem at the level of characters, because there are only 50 characters and all of them have meaning. In the word-level task, more than half of the words were dropped. However, since almost all words are important in language modeling, dictionary thinning complicates the task for the network and leads to lower quality and overall compression (the network requires more complex dynamics in the recurrence layer).

ЗадачаA task МетодMethod Досто
верныйWorthy
right ТестовыйTest СжатиеCompression СловарьDictionary Нейроны hNeurons h Char PTB
bits-per-charChar ptb
bits-per-char Исходный
SparseVD
SparceVD-vocSource
SparseVD
SparceVD-voc 1.498
1.472
1.4584 1.498
1.472
1.4584 1.454
1.429
1.4165 1.454
1.429
1.4165 1х
4.2х
3.53х1x
4.2x
3.53x 50
50
48 fifty
fifty
48 1000
431
5101000
431
510 Word PTB
Перплек
сивностьWord ptb
Perplek
sivnost Исходный
SparseVD
SparceVD-vocSource
SparseVD
SparceVD-voc 135.6
115.0
126.3135.6
115.0
126.3 129.5
109.0
120.6129.5
109.0
120.6 1х
14.0х
11.1х1x
14.0x
11.1x 10000
9985
4353 10,000
9985
4353 256
153
207256
153
207 Таблица 2. Результаты по задачам моделирования языка. Сжатие равно |w|/|w↑0|. В последних двух столбцах представлено число оставшихся нейронов во входном и рекуррентных слоях.Table 2. Results for language modeling tasks. The compression is | w | / | w ↑ 0 |. The last two columns show the number of remaining neurons in the input and recurrent layers.

A. Экспериментальная модельA. Experimental Model

Инициализация для классификации текста. Весовые матрицы скрытый→скрытый W ^h инициализируются ортогонально, а все остальные матрицы инициализируются равномерно с использованием метода [22]. Initialization for text classification . Weighted matrices hidden → hidden W ^{h are} initialized orthogonally, and all other matrices are initialized uniformly using the method [22].

Сети обучались с использованием минибатчей размера 128 и скоростью обучения 0,0005.Networks trained using 128 mini-batches and a learning speed of 0.0005.

Инициализация для моделирования языка. Все весовые матрицы сетей инициализировались ортогонально, и все смещения инициализировались нулями. Исходные значения скрытых элементов и элементов памяти LSTM не являются обучаемыми и равны нулю. Initialization for language modeling . All network weight matrices were initialized orthogonally, and all offsets were initialized to zeros. The initial values of hidden elements and LSTM memory elements are not trainable and equal to zero.

Для задачи на уровне знаков сети обучались на неперекрывающихся последовательностях длиной в 100 знаков в минибатчах размером 64 с использованием скорости обучения 0,002 и отсечения градиентов с порогом 1.For the task at the level of signs of the network, they trained on non-overlapping sequences of 100 characters in mini-batches of size 64 using a learning speed of 0.002 and clipping gradients with a threshold of 1.

Для задачи на уровне слов сети разворачивались на 35 шагов. Конечные скрытые состояния текущего минибатча использовались в качестве исходного скрытого состояния последующего минибатча (последовательные минибатчи последовательно покрывают обучающую выборку). Размер каждого минибатча равен 32. Эти модели обучались с использованием скорости обучения 0,002 и отсечения градиентов с порогом 10.For a task at the word level, networks were deployed in 35 steps. The final latent states of the current minibatch were used as the initial latent state of the subsequent minibatch (sequential minibatches consistently cover the training set). The size of each minibatch is 32. These models were trained using a learning speed of 0.002 and clipping gradients with a threshold of 10.

B. Список оставшихся слов в IMDBB. List of remaining words in IMDB

SparseVD с мультипликативными весами сохранил следующие слова в задаче IMDB (отсортированные по нисходящей частоте во всем корпусе):SparseVD with multiplicative weights saved the following words in the IMDB task (sorted by descending frequency in the whole case):

start, oov, and, to, is, br, in, it, this, was, film, t, you, not, have, It, just, good, very, would, story, if, only, see, even, no, were, my, much, well, bad, will, great, first, most, make, also, could, too, any, then, seen, plot, acting, life, over, off, did, love, best, better, i, If, still, man, some- thing, m, re, thing, years, old, makes, director, nothing, seems, pretty, enough, own, original, world, series, young, us, right, always, isn, least, interesting, bit, both, script, minutes, making, 2, performance, might, far, anything, guy, She, am, away, woman, fun, played, worst, trying, looks, especially, book, DVD, reason, money, actor, shows, job, 1, someone, true, wife, beautiful, left, idea, half, excellent, 3, nice, fan, let, rest, poor, low, try, classic, production, boring, wrong, enjoy, mean, No, instead, awful, stupid, remember, wonderful, often, become, terrible, others, dialogue, perfect, liked, supposed, entertaining, waste, His, problem, Then, worse, definitely, 4, seemed, lives, example, care, loved, Why, tries, guess, genre, history, enjoyed, heart, amazing, starts, town, favorite, car, today, decent, brilliant, horrible, slow, kill, attempt, lack, interest, strong, chance, wouldn, sometimes, except, looked, crap, highly, wonder, annoying, Oh, simple, reality, gore, ridiculous, hilarious, talking, female, episodes, body, saying, running, save, disappointed, 7, 8, OK, word, thriller, Jack, silly, cheap, Oscar, predictable, enjoyable, moving, Un- fortunately, surprised, release, effort, 9, none, dull, bunch, comments, realistic, fantastic, weak, atmosphere, apparently, premise, greatest, believable, lame, poorly, NOT, superb, badly, mess, perfectly, unique, joke, fails, masterpiece, sorry, nudity, flat, Good, dumb, Great, D, wasted, unless, bored, Tony, language, incredible, pointless, avoid, trash, failed, fake, Very, Stewart, awesome, garbage, pathetic, genius, glad, neither, laughable, beautifully, excuse, disappointing, disappointment, outstanding, stunning, noir, lacks, gem, F, redeeming, thin, absurd, Jesus, blame, rubbish, unfunny, Avoid, irritating, dreadful, skip, racist, Highly, MST3K.start, oov, and, to, is, br, in, it, this, was, film, t, you, not, have, It, just, good, very, would, story, if, only, see, even, no, were, my, much, well, bad, will, great, first, most, make, also, could, too, any, then, seen, plot, acting, life, over, off, did, love, best, better, i, if, still, man, some- thing, m, re, thing, years, old, makes, director, nothing, seems, pretty, enough, own, original, world, series, young, us, right, always, isn, least, interesting, bit, both, script, minutes, making, 2, performance, might, far, anything, guy, She, am, away, woman, fun, played, worst, trying, looks, especially, book, DVD, reason, money, actor, shows, job, 1, someone, true, wife, beautiful, left, idea, half, excellent, 3, nice, fan, let, rest, poor, low, try, classic, production, boring, wrong, enjoy, mean, No, instead, awful, stupid, remember, wonderful, often, become, terrible, others, dialogue, perfect, liked, supposed, entertaining, waste, His, problem, Then, worse, definitely, 4, seemed, lives, example, care, loved, Why, tries, guess, genre, history, enjoyed, heart, amazing, starts, town, favorite, car, today, decent, brilliant, horrible, slow, kill, attempt, lack, interest, strong, chance, wouldn, sometimes, except, looked, crap, highly, wonder, annoying, Oh, simple, reality, gore, ridiculous, hilarious, talking, female, episodes, body, saying, running, save, disappointed, 7, 8, OK, word, thriller, Jack, silly, cheap, Oscar, predictable, enjoyable, moving, Unfortunately, surprised, release, effort, 9, none, dull, bunch, comments, realistic, fantastic, weak, atmosphere, apparently, premise, greatest, believable, lame, poorly, NOT, superb, badly, mess, perfectly, unique, joke, fails, masterpiece, sorry, nudity, flat, Good, dumb, Great, D, wasted, unless, bored, Tony, language, incredible, pointless, avoid, trash, failed, fake, Very, Stewart, awesome, garbage, pathetic, genius, glad, neither, laughable, beautifully, excuse, disappointing, disappointment, outsta nding, stunning, noir, lacks, gem, F, redeeming, thin, absurd, Jesus, blame, rubbish, unfunny, Avoid, irritating, dreadful, skip, racist, Highly, MST3K.

В заключение следует отметить, что способ (100) сжатия РНС согласно настоящему изобретению описывается со ссылками на Фиг. 1. Предполагается, что для ввода РНС обеспечивается набор возможных элементов входных последовательностей. Элементами этого набора могут быть слова.In conclusion, it should be noted that the RNC compression method (100) according to the present invention is described with reference to FIG. 1. It is assumed that a set of possible elements of input sequences is provided for entering the RNS. Elements of this set can be words.

На этапе 110 выполняется разреживание весов РНС.At 110, dilution of the RNC weights is performed.

Это разреживание включает в себя выполнение оптимизации для получения апостериорного распределения весов, которое аппроксимируется вторым распределением. Оптимизация использует априорное распределение весов, являющееся первым распределением. В соответствии с вышесказанным, в качестве первого распределения предпочтительно используется полностью факторизованное лог-равномерное распределение, а в качестве второго распределения предпочтительно используется полностью факторизованное нормальное распределение. Тем не менее, специалисту будет понятно, что, в рассматриваемом контексте могут использоваться другие виды подходящих распределений. Оптимизация включает в себя генерацию весов из аппроксимированного апостериорного распределения. Следует отметить, что каждый сэмпл веса идентичен для любого временного шага в одной входной последовательности.This dilution involves performing optimization to obtain an posterior distribution of weights that is approximated by a second distribution. Optimization uses the a priori distribution of weights, which is the first distribution. In accordance with the foregoing, a fully factorized log-uniform distribution is preferably used as the first distribution, and a fully factorized normal distribution is preferably used as the second distribution. However, one skilled in the art will understand that, in the context under consideration, other types of suitable distributions may be used. Optimization involves the generation of weights from an approximated posterior distribution. It should be noted that each weight sample is identical for any time step in one input sequence.

Разреживание на этапе 110 дополнительно включает в себя идентифицирование весов и/или одной или более групп весов, каждая из которых имеет ассоциированное значение меньше заданного порога. После этого идентифицированные веса удаляются из РНС, и/или идентифицированные группы весов удаляются из РНС.The thinning at step 110 further includes identifying the weights and / or one or more groups of weights, each of which has an associated value less than a predetermined threshold. After that, the identified weights are removed from the RNS, and / or the identified groups of weights are removed from the RNS.

Как отмечалось выше, в качестве упомянутого ассоциированного значения предпочтительно используется отношение квадрата среднего к дисперсии, а заданный порог предпочтительно установлен на 0,05. Тем не менее, специалисту будет понятно, что в рассматриваемом контексте могут использоваться другие пороги и значения.As noted above, the ratio of the mean square to the variance is preferably used as the associated value, and the predetermined threshold is preferably set to 0.05. However, one skilled in the art will appreciate that other thresholds and values may be used in the context under consideration.

На этапе 120 вводятся первые мультипликативные переменные для элементов входного набора РНС, как было описано выше. Затем, на этапе 120 к первым мультипликативным переменным применяется разреживание, подобное разреживанию на этапе 110. В частности, на этапе 120 выполняется вышеупомянутая оптимизация с использованием упомянутой пары априорного и аппроксимированного апостериорного распределения для первых мультипликативных переменных; после этого идентифицируются первые мультипликативные переменные, каждая из которых имеет ассоциированное значение меньше заданного порога, и элементы этого набора, ассоциированные с идентифицированными первыми мультипликативными переменными, удаляются из РНС на этапе 120.At step 120, the first multiplicative variables for the elements of the input set of the RNS are introduced, as described above. Then, in step 120, dilution similar to the dilution in step 110 is applied to the first multiplicative variables. In particular, in step 120, the aforementioned optimization is performed using the aforementioned pair of a priori and approximated posterior distributions for the first multiplicative variables; after that, the first multiplicative variables are identified, each of which has an associated value less than a predetermined threshold, and the elements of this set associated with the identified first multiplicative variables are deleted from the RNC at step 120.

На этапе 130 вводятся вторые мультипликативные переменные для входных и скрытых нейронов РНС, как обсуждалось выше, и к этим мультипликативным переменным применяется разреживание, аналогичное разреживанию на этапе 110. В частности, на этапе 130 выполняют вышеупомянутую оптимизацию, используя упомянутую пару априорного и аппроксимированного апостериорного распределений для вторых мультипликативных переменных. Затем, на этапе 130 идентифицируют вторые мультипликативные переменные, каждая из которых имеет ассоциированное значение меньше заданного порога, и затем на этапе 130 удаляют из РНС входные и скрытые нейроны РНС, которые ассоциированы с идентифицированными переменными.At step 130, the second multiplicative variables for the input and hidden RNS neurons are introduced, as discussed above, and dilution similar to dilution is applied to these multiplicative variables at step 110. In particular, at step 130, the above optimization is performed using the aforementioned pair of a priori and approximated posterior distributions for second multiplicative variables. Then, at step 130, second multiplicative variables are identified, each of which has an associated value less than a predetermined threshold, and then, at step 130, the input and hidden RNS neurons that are associated with the identified variables are removed from the RNC.

В соответствии с вышеприведенным обсуждением, РНС может иметь архитектуру с гейтами, и такая архитектура с гейтами может быть реализована в виде слоя LSTM сети РНС. Тем не менее, здесь следует подчеркнуть, что предлагаемый подход, в принципе, применим к РНС без гейтов.According to the discussion above, the RNS can have a gate architecture, and such a gate architecture can be implemented as an LSTM layer of the RN network. However, it should be emphasized here that the proposed approach is, in principle, applicable to RNS without gates.

Если РНС имеет архитектуру с гейтами (например, на основе LSTM), то предложенный способ 100 предпочтительно дополнительно содержит этап 104 (обозначенный пунктирной линией на Фиг. 1).If the RNC has a gate architecture (eg, based on LSTM), then the proposed method 100 preferably further comprises step 104 (indicated by a dashed line in FIG. 1).

На этапе 140 вводятся третьи мультипликативные переменные для преактиваций гейтов и информационного потока в слое LSTM сети РНС, как обсуждалось выше, и выполняется разреживание, аналогичное разреживанию на этапе 110, для этих третьих мультипликативных переменных. В частности, на этапе 140 выполняется вышеупомянутая оптимизация с использованием упомянутой пары априорного и аппроксимированного апостериорного распределений для третьих мультипликативных переменных. Затем, на этапе 140 любой гейт делается константным, если ассоциированная с ним третья мультипликативная переменная имеет ассоциированное значение меньше заданного порога. Аналогичным образом, информационный поток делается константным на этапе 140, если ассоциированная с ним третья мультипликативная переменная имеет ассоциированное значение меньше заданного порога.At step 140, the third multiplicative variables are introduced to preactivate the gates and the information flow in the LSTM layer of the RNS network, as discussed above, and underpressure, similar to the dilution at step 110, is performed for these third multiplicative variables. In particular, at step 140, the above optimization is performed using the aforementioned pair of a priori and approximated posterior distributions for the third multiplicative variables. Then, at step 140, any gate is made constant if the associated third multiplicative variable has an associated value less than a predetermined threshold. Similarly, the information flow is made constant at step 140 if the associated third multiplicative variable has an associated value less than a predetermined threshold.

Как отмечалось выше, предложенный способ можно эффективно применять к задачам классификации текста или моделирования языка. Тем не менее, следует понимать, что он также применим к различным другим задачам, для которых используются рекуррентные нейронные сети, в частности, к машинному переводу и распознаванию речи. В соответствии с вышесказанным, изобретение было протестировано на задачах моделирования языка, где предложенные методы обеспечили двукратное сжатие соответствующей модели с незначительным снижением качества. Следовательно, можно ожидать значительное сжатие и ускорение моделей для других подобных задач.As noted above, the proposed method can be effectively applied to problems of text classification or language modeling. However, it should be understood that it is also applicable to various other tasks for which recurrent neural networks are used, in particular, to machine translation and speech recognition. In accordance with the foregoing, the invention was tested on language modeling problems, where the proposed methods provided twofold compression of the corresponding model with a slight decrease in quality. Consequently, we can expect significant compression and acceleration of models for other similar problems.

Кроме того, специалисту будет понятно, что описанные выше способы согласно изобретению могут быть реализованы на практике с помощью обычного программного обеспечения и компьютерных технологий. В частности, на Фиг. 2 показана высокоуровневая блок-схема обычного вычислительного устройства (200), в котором могут быть реализованы аспекты настоящего изобретения. Вычислительное устройство 200 содержит блок 210 обработки данных и блок 220 хранения данных, подключенный к блоку 210 обработки данных. Блок 210 обработки данных обычно содержит один или несколько процессоров (ЦПУ), которые могут быть универсальными или специализированными, выпускаемыми серийно или изготовленными на заказ процессорами. Блок 220 хранения данных обычно содержит различные машиночитаемые носители и/или запоминающие устройства, как энергонезависимые (например, ПЗУ, жесткие диски, флэш-накопители и т.п.), так и энергозависимые (например, различные виды ОЗУ и т.п.).In addition, the specialist will be clear that the above methods according to the invention can be implemented in practice using conventional software and computer technology. In particular, in FIG. 2 shows a high-level block diagram of a conventional computing device (200) in which aspects of the present invention may be implemented. The computing device 200 comprises a data processing unit 210 and a data storage unit 220 connected to the data processing unit 210. The data processing unit 210 typically comprises one or more processors (CPUs), which may be universal or specialized, commercially available, or custom-made processors. The storage unit 220 typically comprises various computer-readable media and / or memory devices, both non-volatile (e.g., ROMs, hard disks, flash drives, etc.) and volatile (e.g., various types of RAM, etc.) .

В контексте настоящего изобретения блок 220 хранения данных содержит машиночитаемые носители данных с записанным на них соответствующим программным обеспечением. Программное обеспечение содержит машиноисполняемые команды, которые при их исполнении блоком 210 обработки данных предписывают ему выполнять способ согласно изобретению. Такое программное обеспечение может быть соответственно разработано и внедрено с использованием известных технологий и сред программирования.In the context of the present invention, the data storage unit 220 comprises computer-readable storage media with corresponding software recorded thereon. The software contains computer-executable instructions which, when executed by the data processing unit 210, require it to execute the method according to the invention. Such software can be accordingly developed and implemented using well-known technologies and programming environments.

Специалисту будет понятно, что вычислительное устройство 200, задействованное в реализации изобретения, включает в себя также другое известное оборудование, такое как устройства ввода/вывода, интерфейсы связи и т. д., а также базовое программное обеспечение, такое как операционная система, пакеты протоколов, драйверы и т. д., которые могут быть как серийными, так и заказными. Компьютерное устройство 200 может быть выполнено с возможностью связи с другими вычислительными устройствами, инфраструктурами и сетями с помощью известных технологий проводной и/или беспроводной связи.One skilled in the art will understand that the computing device 200 involved in the implementation of the invention also includes other known equipment, such as input / output devices, communication interfaces, etc., as well as basic software, such as an operating system, protocol packets , drivers, etc., which can be either serial or custom. Computer device 200 may be configured to communicate with other computing devices, infrastructures, and networks using known wired and / or wireless communication technologies.

Следует понимать, что проиллюстрированные варианты осуществления изобретения являются всего лишь предпочтительными, но не единственными возможными примерами изобретения. Напротив, объем изобретения определяется нижеследующей формулой изобретения и эквивалентами. Раскрытие вместе с перечисленными ниже документами включено в данное описание посредством ссылок.It should be understood that the illustrated embodiments of the invention are only preferred, but not the only possible examples of the invention. On the contrary, the scope of the invention is defined by the following claims and equivalents. The disclosure, together with the documents listed below, is incorporated herein by reference.

5. Процитированные публикации5. Cited Publications

[1] Amodei, Dario, Ananthanarayanan, Sundaram, Anubhai, Rishita, and et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of The 33rd International Conference on Machine Learning, 2016.[1] Amodei, Dario, Ananthanarayanan, Sundaram, Anubhai, Rishita, and et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of The 33rd International Conference on Machine Learning, 2016.

[2] Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, and Hai Li. 2018. Learning intrinsic sparse structures within long short-term memory. In International Conference on Learning Representations.[2] Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, and Hai Li. 2018. Learning intrinsic sparse structures within long short-term memory. In International Conference on Learning Representations.

[3] Chan, William, Jaitly, Navdeep, Le, Quoc V., and Vinyals, Oriol. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP, 2016.[3] Chan, William, Jaitly, Navdeep, Le, Quoc V., and Vinyals, Oriol. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP, 2016.

[4] Christos Louizos, Karen Ullrich, Max Welling. Bayesian compression for deep learning. In arXiv preprint arXiv:1705.08665, 2017.[4] Christos Louizos, Karen Ullrich, Max Welling. Bayesian compression for deep learning. In arXiv preprint arXiv: 1705.08665, 2017.

[5] Meire Fortunato, Charles Blundell, and Oriol Vinyals. 2017. Bayesian recurrent neural networks. Computing Research Repository, arXiv:1704.02798.[5] Meire Fortunato, Charles Blundell, and Oriol Vinyals. 2017. Bayesian recurrent neural networks. Computing Research Repository, arXiv: 1704.02798.

[6] Gal, Yarin and Ghahramani, Zoubin. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2016.[6] Gal, Yarin and Ghahramani, Zoubin. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2016.

[7] Gal, Yarin and Ghahramani, Zoubin. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems 29 (NIPS), 2016.[7] Gal, Yarin and Ghahramani, Zoubin. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems 29 (NIPS), 2016.

[8] Ha, David, Dai, Andrew, and Le, Quoc V. Hypernetworks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.[8] Ha, David, Dai, Andrew, and Le, Quoc V. Hypernetworks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.

[9] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT '11, pp. 142-150, Stroudsburg, PA, USA. Association for Computational Linguistics.[9] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT '11, pp. 142-150, Stroudsburg, PA, USA. Association for Computational Linguistics.

[10] X. Zhang, J. Zhao, and Y. LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NIPS).[10] X. Zhang, J. Zhao, and Y. LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NIPS).

[11] Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations, 2015.[11] Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations, 2015.

[12] Kingma, Diederik P. and Welling, Max. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.[12] Kingma, Diederik P. and Welling, Max. Auto-encoding variational bayes. CoRR, abs / 1312.6114, 2013.

[13] Kingma, Diederik P., Salimans, Tim, and Welling, Max. Variational dropout and the local reparameterization trick. CoRR, abs/1506.02557, 2015.[13] Kingma, Diederik P., Salimans, Tim, and Welling, Max. Variational dropout and the local reparameterization trick. CoRR, abs / 1506.02557, 2015.

[14] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, Dmitry Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In arXiv preprint arXiv:1705.07283, 2017.[14] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, Dmitry Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In arXiv preprint arXiv: 1705.07283, 2017.

[15] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pp. 3111-3119.[15] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pp. 3111-3119.

[16] Le, Quoc V., Jaitly, Navdeep, and Hinton, Geoffrey E. A simple way to initialize recurrent networks of rectified linear units. CoRR, abs/1504.00941, 2015.[16] Le, Quoc V., Jaitly, Navdeep, and Hinton, Geoffrey E. A simple way to initialize recurrent networks of rectified linear units. CoRR, abs / 1504.00941, 2015.

[17] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 14, pp. 1532-1543.[17] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 14, pp. 1532-1543.

[18] Molchanov, Dmitry, Ashukha, Arsenii, and Vetrov, Dmitry. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 2017.[18] Molchanov, Dmitry, Ashukha, Arsenii, and Vetrov, Dmitry. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 2017.

[19] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The penn treebank. Comput. Linguist., 19(2):313-330.[19] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The penn treebank. Comput. Linguist., 19 (2): 313-330.

[20] Narang, Sharan, Diamos, Gregory F., Sengupta, Shubho, and Elsen, Erich. Exploring sparsity in recurrent neural networks. CoRR, abs/1704.05119, 2017.[20] Narang, Sharan, Diamos, Gregory F., Sengupta, Shubho, and Elsen, Erich. Exploring sparsity in recurrent neural networks. CoRR, abs / 1704.05119, 2017.

[21] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. Khudanpur. 2011. Extensions of recurrent neural network language model. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528-5531.[21] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. Khudanpur. 2011. Extensions of recurrent neural network language model. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528-5531.

[22] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 249-256, Chia Laguna Resort, Sardinia, Italy. Proceedings of Machine Learning Research.[22] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 249-256, Chia Laguna Resort, Sardinia, Italy. Proceedings of Machine Learning Research.

[23] Ren, Mengye, Kiros, Ryan, and Zemel, Richard S. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, 2015.[23] Ren, Mengye, Kiros, Ryan, and Zemel, Richard S. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, 2015.

[24] Srivastava, Nitish. Improving neural networks with dropout. PhD thesis, University of Toronto, 2013.[24] Srivastava, Nitish. Improving neural networks with dropout. PhD thesis, University of Toronto, 2013.

[25] Tjandra, Andros, Sakti, Sakriani, and Nakamura, Satoshi. Compressing recurrent neural network with tensor train. CoRR, abs/1705.08052, 2017.[25] Tjandra, Andros, Sakti, Sakriani, and Nakamura, Satoshi. Compressing recurrent neural network with tensor train. CoRR, abs / 1705.08052, 2017.

[26] Wang, Sida and Manning, Christopher. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning, 2013.[26] Wang, Sida and Manning, Christopher. Fast dropout training. In Proceedings of the 30th International Conference on Machine Learning, 2013.

[27] Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, and et al. Google's neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016.[27] Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, and et al. Google's neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs / 1609.08144, 2016.

Claims

1. A computer-implemented method for compressing a recurrent neural network (RNS), comprising stages in which:

perform rarefaction in relation to the weights of the RNS, and the vacuum contains stages in which:

i) carry out optimization to obtain an a posteriori distribution of weights, while the a posteriori distribution of weights is approximated by a second distribution, using an a priori distribution of weights representing the first distribution during optimization, and when optimizing, weights are generated from an approximated a posteriori distribution, and

ii) weights and / or one or more weight groups are identified, each of which has an associated value less than a predetermined threshold, and identified weights are removed from the PHS and / or identified groups of weights are removed from the PHS;

introducing the first multiplicative variables for elements of the input set of possible elements of the input RNS sequences;

performing dilution with respect to the first multiplicative variables, wherein in step i) optimization is performed using the a priori distribution and said approximated a posteriori distribution for the first multiplicative variables, and in step ii) the first multiplicative variables are identified, each of which has an associated value less than said predetermined threshold , and remove from the RNS elements of the aforementioned group that are associated with the identified first multiplicate an apparent variable;

introduce second multiplicative variables for input and hidden RNS neurons; and

performing dilution with respect to the second multiplicative variables, wherein in step i) optimization is performed using the a priori distribution and said approximated a posteriori distribution for the second multiplicative variables, and in step ii) the second multiplicative variables are identified, each of which has an associated value less than the specified threshold , and the input and hidden neurons associated with the identified second multiplicative n by the variables.

2. The method according to claim 1, in which the first distribution is completely factorized log-uniform distribution, and the second distribution - a fully factorized normal distribution.

3. Способ по п.1, в котором каждая выборка веса идентична для любого временного шага в одной входной последовательности.3. The method according to claim 1, in which each weight sample is identical for any time step in one input sequence.

4. The method according to claim 1, in which the RNS has an architecture with gates, the method further comprising the steps of:

introducing third multiplicative variables for reactivation of RNS gates;

perform dilution with respect to the third multiplicative variables, wherein in step i) optimization is performed using the a priori distribution and said approximated a posteriori distribution for the third multiplicative variables, and in step ii) the gate is constant if the third multiplicative variable associated with this gate has the associated value is less than said predetermined threshold.

5. The method of claim 4, wherein the gate architecture is implemented as an LSTM layer of the RNS network.

6. The method according to claim 5, in which

when introducing the third multiplicative variables, an additional third multiplicative variable is added to preactivate the information flow in the LSTM layer, and

in step ii), the information flow is additionally made constant if the third multiplicative variable associated with this information flow has an associated value less than the specified threshold.

7. The method according to claim 1, wherein said associated value is the ratio of the square of the mean to the variance.

8. The method according to claim 1, wherein said predetermined threshold is 0.05.

9. The method according to claim 1, in which the elements of the said set are words, and this method is applicable for classifying text or modeling the language in which the said words are used.

10. A device for compressing a recurrent neural network (RNS) with a gate architecture, comprising:

one or more processors, and

one or more computer-readable storage media on which computer-executable instructions are stored, the execution of which by one or more processors requires one or more processors:

perform dilution in relation to the weights of the RNS, and dilution contains:

i) performing optimization to obtain an a posteriori distribution of weights, while the a posteriori distribution of weights is approximated by a fully factorized normal distribution, and the optimization uses an a priori distribution of weights, which is a fully factorized log-uniform distribution, and the optimization includes generating weights from an approximated a posteriori distribution, and

ii) identifying weights and / or one or more groups of weights, each of which has an associated value less than a predetermined threshold, and removing identified weights from the PHC and / or removing identified groups of weights from the PHC;

introduce the first multiplicative variables for the elements of the input set of possible elements of the input RNS sequences;

perform dilution with respect to the first multiplicative variables, wherein operation i) comprises performing optimization using said a priori distribution and said approximated posterior distribution for the first multiplicative variables, and operation ii) comprises identifying the first multiplicative variables, each of which has an associated value less than said predetermined threshold , and removing from the RNS elements of the said set that are associated with GOVERNMENTAL first multiplicative variables;

introduce second multiplicative variables for input and hidden RNS neurons; and

perform dilution with respect to the second multiplicative variables, wherein operation i) comprises performing optimization using said a priori distribution and said approximated a posteriori distribution for the second multiplicative variables, and operation ii) comprises identifying second multiplicative variables, each of which has an associated value less than said predetermined threshold , and removal from the RNS of input and hidden neurons associated with identified WTOs Multiplicative variables

introduce third multiplicative variables for reactivation of RNS gates; and

perform dilution with respect to the third multiplicative variables, wherein operation i) comprises performing optimization using said a priori distribution and said approximated a posteriori distribution for the third multiplicative variables, and in step ii) the gate is made constant if the third multiplicative variable associated with this gate has the associated value is less than said predetermined threshold.

11. The device according to claim 10, with each sample weight is identical for any time step in one input sequence.

12. The device of claim 10, wherein said associated value is the ratio of the square of the mean to the variance.

13. The device according to claim 10, wherein said predetermined threshold is 0.05.

14. The device of claim 10, wherein the gate architecture is implemented as an LSTM layer of the RNS network.

15. The device according to 14, in which

when introducing the third multiplicative variables, a third multiplicative variable is additionally introduced to preactivate the information flow in the LSTM layer, and

in operation ii) additionally, the information flow is made constant if the third multiplicative variable associated with this information flow has an associated value less than said predetermined threshold value.

16. The device according to claim 10, wherein the elements of said set are words, and the RNS is used for the task of classifying a text or modeling a language in which said words are used.

17. Один или более машиночитаемых носителей данных, на которых хранятся машиноисполняемые команды, которые при их исполнении одним или более процессорами вычислительного устройства предписывают одному или более процессорам выполнять операции для сжатия рекуррентной нейронной сети (РНС) с архитектурой с гейтами, которая реализована в виде слоя LSTM сети РНС, причем упомянутые операции содержат этапы, на которых:17. One or more computer-readable storage media on which computer-executable instructions are stored, which when executed by one or more processors of a computing device, instruct one or more processors to perform operations to compress a recurrent neural network (RNS) with a gate architecture that is implemented as a layer LSTM networks of the RNS, and the above operations contain stages in which:

perform dilution in relation to the weights of the RNS, containing stages in which:

i) perform optimization to obtain an a posteriori distribution of weights, while the a posteriori distribution of weights is approximated by a fully factorized normal distribution, and the optimization uses an a priori distribution of weights, which is a fully factorized log-uniform distribution, and the optimization includes generating weights from an approximated a posteriori distribution, and

ii) identify weights and / or one or more groups of weights, each of which has an associated the value is less than a predetermined threshold, and the identified weights are removed from the RNS and / or the identified groups of weights are removed from the RNS;

performing dilution with respect to the first multiplicative variables, wherein in step i) optimization is performed using the a priori distribution and said approximated a posteriori distribution for the first multiplicative variables, and in step ii) the first multiplicative variables are identified, each of which has an associated value less than said predetermined threshold , and remove from the RNS elements of the said set, which are associated with the identified first multiplier ivnymi variables,

introduce second multiplicative variables for input and hidden RNS neurons; and

performing dilution with respect to the second multiplicative variables, wherein in step i) optimization is performed using the a priori distribution and said approximated a posteriori distribution for the second multiplicative variables, and in step ii) the second multiplicative variables are identified, each of which has an associated value less than the specified threshold , and the input and hidden neurons associated with the identified second multiplicative n Variable

introducing third multiplicative variables for reactivating gates and information flow in the LSTM layer of the RNS network; and

performing dilution with respect to the third multiplicative variables, wherein in step i) optimization is performed using said a priori distribution and said approximated posterior distribution for the third multiplicative variables, and in step ii)

make the gate constant if the third multiplicative variable associated with the given gate has an associated value less than the specified threshold, and

make the information flow constant if the third multiplicative variable associated with this information stream has an associated value less than the specified threshold.

18. The machine-readable storage medium according to 17, wherein

said associated value is the ratio of the square of the mean to the variance,

said predetermined threshold is 0.05, and

each weight sample is identical for any time step in one input sequence.

19. The computer-readable storage medium according to claim 17, wherein the elements of said set are words, the RNS being used for the task of classifying a text or modeling a language in which said words are used.