WO2023085852A1

WO2023085852A1 - Deep neural network training device and method for executing statistical regularization

Info

Publication number: WO2023085852A1
Application number: PCT/KR2022/017760
Authority: WO
Inventors: 이재진; 김예하; 정우근
Original assignee: 서울대학교산학협력단
Priority date: 2021-11-11
Filing date: 2022-11-11
Publication date: 2023-05-19

Abstract

The present invention relates to a deep neural network training device and method for executing regularization and, more specifically, to a deep neural network training device and method for executing statistical regularization, wherein, rather than a random neuron being removed, a neuron corresponding to an anomaly value is selected from among output values of an activation function to increase the efficiency of regularization for the number of removed neurons and preserve meaningful information better in a training phase to increase training performance.

Description

Apparatus and method for learning deep neural networks that implement statistics-based normalization

The present invention relates to a deep neural network learning apparatus and method for performing normalization, and more particularly, to normalize the number of neurons to be removed by selecting neurons corresponding to outliers among output values of an activation function instead of randomly removing neurons. It relates to a deep neural network learning apparatus and method for enhancing the efficiency of learning and performing statistical regularization to improve learning performance by better preserving meaningful information in the learning stage.

Deep learning is a technology that solves various problems using a deep neural network (DNN) that mimics the behavior of the human brain. A deep neural network identifies characteristics of training data in a training phase and executes an operation based on a learned pattern in an inference phase.

That is, when developing/applying a deep learning application, the model is trained using the training data, the generalization performance of the network is monitored with verification data during the training process, and after training is completed, the trained model is used to Execute inference on actual usage.

In the process of training a deep neural network (DNN), if the training data is insufficient compared to the number of parameters of the network, the network shows high accuracy in the training data, but overfits not in new data or validation data ( overfitting may occur.

The surest way to solve this phenomenon is to secure more training data, but in a situation where the number of DNN parameters is exponentially increasing, collecting a sufficient amount of training data is difficult due to time and financial costs. It is work.

The overfitting problem is a representative factor that reduces the accuracy of DNN, and it greatly hinders learning depending on the type of DNN and learning/inference data. Several regularization techniques have been developed to solve this problem, but they do not completely solve the problem.

On the other hand, the activation function most used in DNN is ReLU. After the ReLU function was proposed, activation functions such as GELU, SiLU, and Mish that have regularization effects that prevent overfitting, such as Dropout, have been proposed. The dual GELU function is also applied to BERT and GPT-3, which are widely used in the field of natural language processing. However, although these functions have a normalization effect, they are still used together with Dropout due to their lack of effect.

The dropout technique is a technique that reduces the number of parameters whose values are updated in backward computation by removing specific neurons with a certain probability during learning and proceeds with learning.

In general, the dropout technique is implemented in the form of a dropout layer. The dropout layer is a layer that receives N neurons as input and outputs N neurons as output. At every training iteration, some of the input/output neuron pairs are selected with a certain probability and excluded from the learning process. The output neuron of the excluded neuron pair outputs 0 regardless of the input value (Forward computation), and during the gradient propagation process, the gradient value of the input neuron is delivered as 0 regardless of the gradient value of the output neuron (Backward computation). Dropout technology is currently widely used in the field of deep learning, and many popular DNN models include dropout layers.

Accordingly, there is a need for a deep neural network system capable of achieving higher accuracy in a faster time.

Therefore, the present invention has been proposed to solve the above-mentioned problems, and instead of randomly removing neurons, the activation function is modified to select neurons to be more efficiently removed, and a deep neural network learning device that performs statistics-based normalization and The purpose is to provide that method.

Objects of the present invention are not limited to those mentioned above, and other objects not mentioned above will be clearly understood by those skilled in the art from the description below.

To achieve the above object, an apparatus for learning a deep neural network performing statistical-based normalization according to an embodiment of the present invention includes an input unit for inputting training data for learning a deep neural network (DNN) model; A learning processor for learning the DNN model using the training data, wherein the DNN model includes a statistics-based normalization layer between a first neural network layer and a second neural network layer, and the statistics-based normalization layer comprises: Based on statistical information based on the output value of the activation function for one neural network layer, neurons to be excluded from the first neural network layer are determined, and output values of the activation function corresponding to the remaining neurons are output to the second neural network layer. .

A method for learning a statistics-based normalization layer according to an aspect of the present invention includes receiving learning data; obtaining output values of an activation function of a first neural network layer; determining outliers by performing statistical normalization based on the output values; removing the output value determined as the outlier; The method may include outputting remaining output values other than the outlier among the output values to a second neural network layer.

According to the deep neural network learning apparatus and method for performing statistical-based normalization according to an embodiment of the present invention, instead of removing random neurons, removing neurons corresponding to outliers by using statistical information of data in each layer is , By removing outliers that have a greater impact on decision-making in the DNN learning and inference process than other values due to the nature of deep neural network operations, the efficiency of normalization compared to the number of neurons to be removed is improved, and meaningful information is better preserved in the learning phase. learning performance can be improved.

Effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

1 is a block diagram showing the configuration of a neural network device according to an embodiment of the present invention.

2 is a diagram showing the structure of an artificial neural network model including a statistics-based normalization layer according to an embodiment of the present invention.

3 is an exemplary diagram for explaining a statistics-based normalization layer according to an embodiment of the present invention.

4 schematically illustrates the concept of an activation function and a transfer function in a perceptron.

5 shows an example of a graph for comparing performance between a statistics-based normalization layer and dropout according to an embodiment of the present invention.

6 is a flowchart for explaining a statistics-based normalization layer training method according to an embodiment of the present invention.

Objects and effects of the present invention, and technical configurations for achieving them will become clear with reference to embodiments to be described later in detail in conjunction with the accompanying drawings. In describing the present invention, if it is determined that a detailed description of a known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. In addition, the terms described later are terms defined in consideration of the structure, role, and function in the present invention, which may vary according to the intention or custom of a user or operator.

However, the present invention is not limited to the embodiments disclosed below and may be implemented in a variety of different forms. Only these embodiments are provided to complete the disclosure of the present invention and to fully inform those skilled in the art of the scope of the invention, and the present invention is described only in the claims. It is only defined by the scope of the claims. Therefore, the definition should be made based on the contents throughout this specification.

Throughout the specification, when a certain component is said to "include", it means that it may further include other components without excluding other components unless otherwise stated.

Terms including ordinal numbers, such as first and second, may be used to describe various components, but the components are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

Artificial intelligence (AI) is a field of computer science and information technology that studies ways to enable computers to do thinking, learning, and self-development that human intelligence can do. This means that behavior can be imitated.

Also, artificial intelligence does not exist by itself, but is directly or indirectly related to other fields of computer science. In particular, in modern times, attempts to introduce artificial intelligence elements in various fields of information technology and use them to solve problems in those fields are being actively made.

Machine learning is a branch of artificial intelligence, a field of study that gives computers the ability to learn without being explicitly programmed.

Specifically, machine learning is a technology that studies and builds a system that learns based on empirical data, makes predictions, and improves its own performance, as well as algorithms for it. Machine learning algorithms build specific models to make predictions or decisions based on input data, rather than executing rigidly defined, static program instructions.

The term 'machine learning' may be used interchangeably with the term 'machine learning'.

In machine learning, many machine learning algorithms have been developed regarding how to classify data. Representative examples include decision trees, Bayesian networks, support vector machines (SVMs), and artificial neural networks (ANNs).

A decision tree is an analysis method that performs classification and prediction by charting decision-making rules in a tree structure.

A Bayesian network is a model that expresses a stochastic relationship (conditional independence) among multiple variables in a graph structure. Bayesian networks are suitable for data mining through unsupervised learning.

A support vector machine is a supervised learning model for pattern recognition and data analysis, and is mainly used for classification and regression analysis.

An artificial neural network is an information processing system in which a number of neurons, called nodes or processing elements, are connected in the form of a layer structure by modeling the operating principle of biological neurons and the connection relationship between neurons.

An artificial neural network is a model used in machine learning and is a statistical learning algorithm inspired by neural networks in biology (particularly the brain in the central nervous system of animals) in machine learning and cognitive science.

Specifically, an artificial neural network may refer to an overall model having a problem-solving ability by changing synaptic coupling strength through learning of artificial neurons (nodes) formed by synapse coupling.

The term artificial neural network may be used interchangeably with the term neural network.

An artificial neural network may include a plurality of layers, and each of the layers may include a plurality of neurons. In addition, the artificial neural network may include neurons and synapses connecting neurons.

Artificial neural networks generally use the following three factors: (1) connection patterns between neurons in different layers, (2) a learning process that updates the weights of connections, and (3) an output value from the weighted sum of the inputs received from the previous layer. It can be defined by the activation function you create.

The artificial neural network may include network models such as a deep neural network (DNN), a recurrent neural network (RNN), a bidirectional recurrent deep neural network (BRDNN), a multilayer perceptron (MLP), and a convolutional neural network (CNN). , but not limited thereto.

In this specification, the term 'layer' may be used interchangeably with the term 'layer'.

Artificial neural networks are classified into single-layer neural networks and multi-layer neural networks according to the number of layers.

A typical single-layer neural network consists of an input layer and an output layer.

In addition, a general multilayer neural network is composed of an input layer, one or more hidden layers, and an output layer.

The input layer is a layer that accepts external data. The number of neurons in the input layer is the same as the number of input variables. The hidden layer is located between the input layer and the output layer. do. The output layer receives a signal from the hidden layer and outputs an output value based on the received signal. The input signal between neurons is multiplied by each connection strength (weight) and then summed. If this sum is greater than the neuron's threshold, the neuron is activated and outputs the output value obtained through the activation function.

Meanwhile, a deep neural network including a plurality of hidden layers between an input layer and an output layer may be a representative artificial neural network implementing deep learning, which is a type of machine learning technology.

Meanwhile, the term 'deep learning' may be used interchangeably with the term 'deep learning'.

The artificial neural network may be trained using training data. Here, learning refers to a process of determining parameters of an artificial neural network using learning data to achieve a goal such as classification, regression, or clustering of input data. can do. As a representative example of parameters of an artificial neural network, a weight assigned to a synapse or a bias applied to a neuron may be mentioned.

The artificial neural network learned from the training data may classify or cluster input data according to patterns of the input data.

Meanwhile, an artificial neural network trained using training data may be referred to as a trained model in this specification.

Next, the learning method of the artificial neural network is explained.

Learning methods of artificial neural networks can be largely classified into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.

Supervised learning is a method of machine learning to infer a function from training data.

Among the inferred functions, outputting a continuous value is called regression analysis, and predicting and outputting a class of an input vector is called classification.

In supervised learning, an artificial neural network is trained with labels for training data given.

Here, the label may mean a correct answer (or a result value) that the artificial neural network should infer when training data is input to the artificial neural network.

In this specification, when training data is input, an answer (or a result value) to be inferred by an artificial neural network is referred to as a label or labeling data.

Also, in this specification, setting labels on training data for learning of an artificial neural network is referred to as labeling labeling data on training data.

In this case, training data and labels corresponding to the training data constitute one training set, and may be input to the artificial neural network in the form of a training set.

Meanwhile, the training data represents a plurality of features, and labeling the training data with a label may mean that a label is attached to a feature represented by the training data. In this case, the training data may represent the characteristics of the input object in the form of a vector.

The artificial neural network may infer a function for a relation between training data and labeling data using training data and labeling data. In addition, parameters of the artificial neural network may be determined (optimized) through evaluation of the function inferred from the artificial neural network.

Unsupervised learning is a type of machine learning in which labels are not given to the training data.

Specifically, unsupervised learning may be a learning method for learning an artificial neural network to find and classify a pattern in training data itself rather than an association between training data and a label corresponding to the training data.

Examples of unsupervised learning include clustering or independent component analysis.

In this specification, the term 'clustering' may be used interchangeably with the term 'clustering'.

Examples of artificial neural networks using unsupervised learning include a generative adversarial network (GAN) and an auto encoder (AE).

A generative adversarial network is a machine learning method in which two different artificial intelligences, a generator and a discriminator, compete to improve performance.

In this case, the generator is a model that creates new data and can generate new data based on original data.

In addition, the discriminator is a model that recognizes data patterns and can play a role in discriminating whether input data is original data or new data generated by a generator.

And the generator learns by receiving the data that has not deceived the discriminator, and the discriminator can learn by receiving the deceived data from the generator. Accordingly, the generator can evolve to deceive the discriminator as best as possible, and the discriminator can evolve to distinguish well between the original data and the data generated by the generator.

An autoencoder is a neural network that aims to reproduce the input itself as an output.

An auto-encoder includes an input layer, at least one hidden layer, and an output layer.

In this case, since the number of nodes in the hidden layer is smaller than the number of nodes in the input layer, the dimensionality of data is reduced, and compression or encoding is performed accordingly.

Also, the data output from the hidden layer goes into the output layer. In this case, since the number of nodes in the output layer is greater than the number of nodes in the hidden layer, the dimensionality of data increases, and accordingly, decompression or decoding is performed.

On the other hand, the autoencoder adjusts the connection strength of neurons through learning, so that input data is expressed as hidden layer data.

In the hidden layer, information is expressed with fewer neurons than in the input layer, and being able to reproduce input data as an output may mean that the hidden layer discovered and expressed a hidden pattern from the input data.

As one of the techniques of semi-supervised learning, there is a technique of inferring the label of unlabeled training data and then performing learning using the inferred label. This technique is useful when the cost required for labeling is high. can

Reinforcement learning is the theory that if an agent is given an environment in which it can judge what action to take every moment, it can find the best way through experience without data.

Reinforcement learning can be performed mainly by Markov Decision Process (MDP).

To explain the Markov decision process, first, an environment in which the information necessary for the agent to take the next action is given, second, how the agent will behave in that environment, and third, if the agent does well, a reward ( Fourth, the optimal policy is derived by repeating experience until the future reward reaches the highest point.

The structure of an artificial neural network is specified by the model configuration, activation function, loss function or cost function, learning algorithm, optimization algorithm, etc., and hyperparameters are set in advance before learning. It is set, and later, a model parameter is set through learning so that the contents can be specified.

For example, the factors that determine the structure of an artificial neural network include the number of hidden layers, the number of hidden nodes included in each hidden layer, an input feature vector, and a target feature vector. The content can be specified. there is.

For example, factors determining the structure of an artificial neural network may include the number of hidden layers, the number of hidden nodes included in each hidden layer, an input feature vector, a target feature vector, and the like.

Hyperparameters include various parameters that must be initially set for learning, such as initial values of model parameters. And, the model parameters include several parameters to be determined through learning.

For example, hyperparameters may include an initial value of weight between nodes, an initial value of bias between nodes, the number of learning iterations, a learning rate, and the like. In addition, model parameters may include weights between nodes, biases between nodes, and the like.

The loss function may be used as an index (reference) for determining optimal model parameters in the learning process of an artificial neural network. Learning in an artificial neural network means a process of manipulating model parameters to reduce a loss function, and the purpose of learning can be seen as determining model parameters that minimize a loss function.

The loss function may mainly use mean squared error (MSE) or cross entropy error (CEE), but the present invention is not limited thereto.

Cross entropy error can be used when the correct answer label is one-hot encoded. One-hot encoding is an encoding method in which the correct answer label value is set to 1 only for neurons corresponding to the correct answer, and the correct answer label value is set to 0 for neurons with no correct answer.

In machine learning or deep learning, learning optimization algorithms can be used to minimize the loss function, and learning optimization algorithms include Gradient Descent (GD), Stochastic Gradient Descent (SGD), Momentum ), NAG (Nesterov Accelerate Gradient), Adagrad, AdaDelta, RMSProp, Adam, and Nadam.

Gradient descent is a technique that adjusts model parameters in the direction of reducing the value of the loss function by considering the slope of the loss function in the shape state.

The direction in which model parameters are adjusted is called a step direction, and the size to be adjusted is called a step size.

In this case, the step size may mean a learning rate.

In the gradient descent method, a gradient may be obtained by partial differentiation of a loss function with respective model parameters, and the model parameters may be updated by changing the model parameters in the direction of the obtained gradient by a learning rate.

Stochastic gradient descent is a technique that increases the frequency of gradient descent by dividing training data into mini-batches and performing gradient descent for each mini-batch.

Adagrad, AdaDelta, and RMSProp are techniques that increase optimization accuracy by adjusting the step size in SGD. In SGD, momentum and NAG are techniques that increase optimization accuracy by adjusting the step direction. Adam is a technique that increases optimization accuracy by adjusting the step size and step direction by combining momentum and RMSProp. Nadam is a technique that increases optimization accuracy by adjusting the step size and step direction by combining NAG and RMSProp.

The learning speed and accuracy of an artificial neural network are characterized by being largely dependent on the hyperparameters as well as the structure of the artificial neural network and the type of learning optimization algorithm. Therefore, in order to obtain a good learning model, it is important to set appropriate hyperparameters as well as to determine an appropriate artificial neural network structure and learning algorithm.

Typically, hyperparameters are experimentally set to various values to train the artificial neural network, and as a result of learning, the optimal values are set to provide stable learning speed and accuracy.

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

1 is a block diagram showing the configuration of a neural network device 100 according to an embodiment of the present invention.

The neural network device 100 is a device that can perform machine learning using learning data, and may include a device that learns using a model composed of an artificial neural network.

That is, the neural network device 100 may be configured to receive, classify, store, and output information to be used for data mining, data analysis, intelligent decision making, and machine learning algorithms. Here, the machine learning algorithm may include a deep learning algorithm.

That is, the neural network device 100 can communicate with at least one external device (not shown) or terminal (not shown), and can derive results by analyzing or learning data in place of or with the external device. Here, the meaning of helping other devices may mean distribution of computing power through distributed processing.

The neural network device 100 is a variety of devices for learning an artificial neural network, and may generally mean a server, and may be referred to as a neural network learning device or a neural network learning server.

In particular, the neural network device 100 may be implemented as a single server, a plurality of server sets, a cloud server, or a combination thereof.

That is, a plurality of neural network devices 100 may be configured to form a neural network learning device set (or a cloud server), and at least one neural network device 100 included in the neural network learning device set analyzes or analyzes data through distributed processing. Learn to derive results.

The neural network device 100 may transmit a model learned by machine learning or deep learning to an external device (not shown) periodically or upon request.

Referring to FIG. 1 , a neural network device 100 may include a communication unit 110, an input unit 120, a memory 130, a running processor 140, a power supply unit 150, and a processor 160.

The communication unit 110 may refer to a configuration including a wireless communication unit (not shown) and an interface unit (not shown). That is, the communication unit 110 may transmit/receive data with other devices through wired/wireless communication or an interface.

The input unit 120 may obtain training data for model learning or input data for obtaining an output using the learned model.

The input unit 120 may obtain raw input data. In this case, the learning processor 140 or the processor 160 preprocesses the acquired data to generate training data or preprocessed input data that can be input to model learning. can do.

At this time, the input unit 120 may obtain raw input data. In this case, the learning processor 140 or the processor 160 preprocesses the acquired data to obtain training data or preprocessed input data that can be input to model learning. can create

In this case, the pre-processing of the input data performed by the input unit 120 may mean extracting input features from the input data.

Also, the input unit 120 may acquire data by receiving data through the communication unit 110 .

The memory 130 may store a model learned by the learning processor 140 or the neural network device 100 .

At this time, the memory 130 may store the learned model by dividing it into a plurality of versions according to the learning time or learning progress, as needed.

In this case, the memory 130 may store input data obtained from the input unit 120, learning data (or training data) used for model learning, and a learning history of the model.

At this time, the input data stored in the memory 130 may be not only processed data suitable for model learning, but also unprocessed input data itself.

The memory 130 may include a model storage unit 131 and a database 132 and the like.

The model storage unit 131 stores a neural network model (or artificial neural network, 131a) that is being learned or learned through the learning process 140, and stores the updated model when the model is updated through learning. At this time, the model storage unit 131 may store the learned model by dividing it into a plurality of versions according to the learning time or learning progress, if necessary.

The artificial neural network 131a shown in FIG. 1 is only one example of an artificial neural network including a plurality of hidden layers, and the artificial neural network of the present invention is not limited thereto.

The artificial neural network 131a may be implemented as hardware, software, or a combination of hardware and software. When part or all of the artificial neural network 131a is implemented as software, one or more instructions constituting the artificial neural network 131a may be stored in the memory 130 .

The database 132 may store input data obtained from the input unit 120, learning data (or training data) used for model learning, and a learning history of the model.

The input data stored in the database 132 may be processed data suitable for model learning as well as unprocessed input data itself.

The learning processor 140 may train the artificial neural network 131a using training data or a training set.

The learning processor 140 directly acquires preprocessed input data obtained by the processor 160 through the input unit 120 to learn the artificial neural network 131a or obtains preprocessed input data stored in the database 132 Thus, the artificial neural network 131a can be learned.

Specifically, the learning processor 140 may determine optimized model parameters of the artificial neural network 131a by iteratively training the artificial neural network 131a using various learning techniques described above.

In the present specification, an artificial neural network whose parameters are determined by learning using learning data may be referred to as a learning model or a learned model.

At this time, the learning model may infer a resultant value while being loaded in the learning device 100 of the artificial neural network, or may be transmitted and installed in another device such as a terminal or an external device through the communication unit 110.

Also, when the learning model is updated, the updated learning model may be transmitted to and installed in another device such as a terminal or an external device through the communication unit 110 .

Also, the learning model may be used to infer result values for new input data other than learning data.

Learning processor 140 may be configured to receive, classify, store, and output information to be used for data mining, data analysis, intelligent decision making, and machine learning algorithms and techniques.

The learning processor 140 may include a memory integrated or implemented in the neural network device 100 . In some embodiments, learning processor 140 may be implemented using memory 130 .

Alternatively or additionally, the learning processor 140 may be implemented using memory maintained in a cloud computing environment, or other remote memory location accessible by the terminal via a communication scheme such as a network.

Learning processor 140 typically stores data in one or more databases to identify, index, categorize, manipulate, store, retrieve, and output data for use in supervised or unsupervised learning, data mining, predictive analytics, or other machines. It can be configured to store in. Here, the database may be implemented using the memory 130, a memory maintained in a cloud computing environment, or another remote memory location accessible by the terminal through a communication method such as a network.

Information stored in learning processor 140 may be used by processor 160 using any of a variety of different types of data analysis algorithms and machine learning algorithms.

Examples of such algorithms include k-nearest neighbor systems, fuzzy logic, neural networks, Boltzmann machines, vector quantization, pulsed neural networks, support vector machines, maximum margin classifiers, hill climbing, inductive logic systems, Bayesian networks, ferritnets (e.g. Finite State Machines, Millie Machines, Moore Finite State Machines), Classifier Trees (e.g. Perceptron Trees, Support Vector Trees, Markov Trees, Decision Tree Forests, Random Forests), Interpretation Models and Systems, Artificial Fusion, Sensor Fusion, Images It includes convergence, reinforcement learning, augmented reality, pattern recognition, automated planning, and more.

The processor 160 may determine or predict at least one executable operation of the neural network device 100 based on information determined or generated using data analysis and machine learning algorithms. To this end, the processor 160 may request, search, receive, or utilize data of the learning processor 140, and a neural network device to execute a predicted operation or an operation determined to be desirable among the at least one executable operation 100) can be controlled.

Processor 160 may perform various functions that implement intelligent emulation (ie, knowledge-based systems, reasoning systems, and knowledge acquisition systems). It can be applied to various types of systems (eg, fuzzy logic systems), including adaptive systems, machine learning systems, artificial neural networks, and the like.

Processor 160 also includes operations involving speech and natural language speech processing, such as I/O processing modules, environmental conditions modules, speech-to-text (STT) processing modules, natural language processing modules, workflow processing modules, and service processing modules. may include submodules that enable

Each of these submodules may have access to one or more systems or data and models in the terminal, or a subset or superset thereof. In addition, each of these sub-modules may provide various functions including vocabulary index, user data, workflow model, service model and automatic speech recognition (ASR) system.

In other embodiments, processor 160 or other aspects of neural network device 100 may be implemented as sub-modules, systems, or data and models.

In some examples, based on data from learning processor 140, processor 160 may be configured to detect and sense a requirement based on a user's intent or a contextual condition expressed as user input or natural language input.

Processor 160 may actively derive and obtain information necessary to fully determine a requirement based on contextual conditions or user intent. For example, processor 160 may actively derive information needed to determine requirements by analyzing historical data including historical inputs and outputs, pattern matching, unambiguous words, input intent, and the like.

Processor 160 may determine a task flow for executing a function that responds to a request based on context conditions or user intent.

The processor 160 collects, senses, extracts, detects, and collects signals or data used in analysis and machine learning tasks through one or more sensing components in the terminal to collect information for processing and storage in the learning processor 140. / or configured to receive.

Information collection may include sensing information through a sensor, extracting information stored in the memory 130, or receiving information from an external terminal, entity, or external storage device through a communication means.

The processor 160 may collect usage history information from the neural network device 100 and store it in the memory 130 .

Processor 160 may use stored usage history information and predictive modeling to determine the best match for executing a particular function.

The processor 160 may receive image information (or a corresponding signal), audio information (or a corresponding signal), data, or user input information from the input unit.

The processor 160 collects information in real time, processes or classifies the information (eg, knowledge graph, command policy, personalized database, conversation engine, etc.), and transfers the processed information to the memory 130 or the learning processor 140. ) can be stored.

When the operation of the neural network device 100 is determined based on data analysis and machine learning algorithms and techniques, the processor 160 may control components of the neural network device 100 to execute the determined operation. Also, the processor 160 may perform the determined operation by controlling the neural network device 100 according to the control command.

When a specific operation is performed, the processor 160 analyzes history information representing the execution of the specific operation through data analysis and machine learning algorithms and techniques, and updates previously learned information based on the analyzed information. can

Accordingly, processor 160, in conjunction with learning processor 140, may improve the accuracy of data analysis and future performance of machine learning algorithms and techniques based on the updated information.

The power supply 150 includes a device for receiving external power and internal power under the control of the processor 160 and supplying power to each component included in the neural network device 100 .

In addition, the power supply unit 150 includes a battery, and the battery may be a built-in battery or a replaceable battery.

In detail, a method for learning the statistics-based normalization layer of the artificial neural network model learned by the learning processor 140 will be described.

Referring to FIG. 2 , the artificial neural network model may include a first neural network layer 210 , a statistics-based normalization layer 220 and a second neural network layer 230 .

The first and second neural network layers 210 and 230 are the most basic layers of DNN, and nodes (neurons) may be connected to nodes of the next layer.

The first neural network layer 210 and the second neural network layer 230 may be dense layers or fully connected layers.

In a DNN in which a dense layer-activation layer-dropout layer-dense layer structure is stacked, a statistics-based normalization layer according to an embodiment of the present invention may be employed instead of at least one of an activation layer and a dropout layer.

As an embodiment, a deep neural network learning apparatus performing statistics-based normalization replaces an existing activation layer (function) and a dropout layer with a statistics-based normalization layer according to an embodiment of the present invention, as shown in FIG. it can be a way

In one embodiment, the statistics-based normalization layer may be used only as an activation layer even without a dropout layer.

In practice, at least some of the plurality of neural network layers may be provided with statistics-based normalization layers, but in the present invention, for convenience of description, a dense layer-statistics-based normalization layer-the first part of a DNN in which dense layers are stacked The first neural network layer 210 - the statistics-based normalization layer 220 - the second neural network layer 230 will be described as an example.

The statistics-based normalization layer 220 may perform statistics-based normalization based on output values of the activation function of the first neural network layer 210 .

As an embodiment, the statistics-based normalization layer 220 may apply an activation function to the first neural network layer 210, find an outlier in the output values of the activation function, and perform an operation to remove it.

As an embodiment, when the statistics-based normalization layer 220 is included in at least a portion of the neural network layers, an operation for finding outliers in output values of the activation function for each layer and removing them may be performed.

To this end, the statistics-based normalization layer 220 determines neurons to be excluded from the first neural network layer 210 based on statistical information of output values transmitted from the activation function of the first neural network layer 210, and other than the excluded neurons. Output values of the remaining neurons of the first neural network layer 210 of may be output to the second neural network layer 230.

Here, it can be seen that the statistics-based normalization layer 220 employs a modified activation function of a general activation function. For example, the statistics-based normalization layer 220 may employ a variant of ReLU.

Here, the activation function refers to a non-linear function that changes an output value according to a certain criterion when outputting a value received from the transfer function.

The ReLU function (Rectified Linear Unit function) is an activation function widely used in the field of artificial intelligence. It outputs 0 when the input (x) is negative and outputs x when it is positive.

Here, the nonlinear function means a function that expresses a relationship between data that cannot be expressed in a straight line. Neurons in the brain are also estimated to have large changes in output values at certain critical points when transmitting signals from one neuron to another. From this point of view, an artificial neural network is a digital world, but since it mimics the structure of the brain, it uses this activation function that sets a threshold and changes the output value. There are several types of activation functions, such as a sigmoid function, a ReLU function, an identity function, and a softmax.

4 schematically illustrates the concept of an activation function and a transfer function in this perceptron. The transfer function refers to a function that, when a weighted sum of nodes is calculated, reflects a bias in the weighted sum and sends it to the activation function. A weight is an element for transmitting data with different weights, and a bias is a constant that adds up to a weighted sum of all values input to a single neuron. It plays a role in adjusting the output value.

In one embodiment, the statistics-based normalization layer 220 calculates the average and standard deviation of the output values of the activation function of the first neural network layer 210, and sets the output value corresponding to a predetermined standard or higher in the average to an outlier (outlier). ) can be determined.

In one embodiment, the statistics-based normalization layer 220 may determine a value that is far from the average of the output values of the activation function by a predetermined multiple of the standard deviation as an outlier, and select a neuron (node or unit) corresponding to the outlier as a first It can be excluded from the neural network layer 210.

The method for determining the outlier by the statistics-based normalization layer 220 is not limited thereto, and other methods using statistical distribution may be employed.

Here, the method for the statistics-based normalization layer 220 to exclude neurons corresponding to outliers is the same method as converting the output value of the corresponding activation function to 0, excluding some neurons during learning from dropout, and the corresponding neurons. It may be at least one of the methods of reducing the activation degree of .

While the conventional drop-out method selects neurons randomly in the drop-out step, the statistics-based normalization layer 220 according to an embodiment of the present invention is differentiated in that it selects neurons to be removed based on statistical characteristics. do.

That is, the deep neural network learning apparatus performing normalization according to an embodiment of the present invention removes neurons corresponding to outliers based on statistical information of input data in each neural network layer, and it is a DNN learning and inference process rather than other values. It is characterized by removing outliers that have a greater impact on decision-making in , increasing the efficiency of normalization compared to the number of neurons removed, and improving learning performance by better preserving meaningful information in the learning stage.

On the other hand, the activation function of the deep neural network learning apparatus performing statistical-based normalization according to an embodiment of the present invention can be applied differently to two cases.

In one embodiment, the parameter k used to determine whether the output value of the activation function is an outlier may be classified according to whether it is a fixed constant or a differentiable parameter.

This parameter k may be determined in the design stage of the artificial intelligence model.

First, a first activation function corresponding to the case where the parameter k used to determine whether the output value of the activation function is an outlier is set to a constant, as shown in Equation 1 below.

[Equation 1]

here, above

of the DNN

is the average of the outputs of f(x) in the th layer,

Is

It is the standard deviation of the output values of f(x) in the th layer, and k is a constant or hyperparameter. As the value of k increases, the ratio corresponding to the outlier among the input values decreases.

When the first activation function is employed, the statistics-based normalization layer 220 selects among the output values of the first activation function.

Larger values are determined as outliers. Then, a node corresponding to the determined outlier is determined as a node to be removed.

Second, as a second activation function corresponding to the case where the parameter k used to determine whether the output value of the activation function is an outlier is set as a learnable parameter whose value is updated in each layer during backpropagation, the following equation Same as Equation 2.

[Equation 2]

The k is a learnable parameter,

is the initial value of k, and by applying Equation 1, the k value when the accuracy in the verification data is the highest through the experiment is Equation 2

set by

in Equation 2

is in g (x) of Equation 1

is multiplied by

As the value of k increases in Equation 2,

The reason why the output value is set to decrease is that when updating in the direction of increasing k in the backpropagation process, the normalization effect decreases as the ratio of the input data considered as an outlier decreases. It is there to compensate for it.

On the other hand, if k is a learnable parameter, there may be two more possible activation functions. One is an equation that connects the intervals with and without outliers removed by a linear function with a negative (-) slope, and the other is the equation that connects the intervals with and without outliers removed with a negative (-) slope. It is an expression connected by an exponential function.

5 illustrates an example of a graph for comparing performance between a first activation function (ZeroLiers) and a second activation function (ZeroLiers-L-K) and dropout according to an embodiment of the present invention. The comparison is for the experimental results for the multi-layer perceptron (MLP).

5 shows, in detail, when a multilayer perceptron is trained with CIFAR-10 by applying a statistics-based regularization layer according to an embodiment of the present invention to each transform function of each ReLU, validation data (validation set) in each iteration ) is measured and graphed.

The first activation function and the second activation function are cases in which a statistics-based normalization layer according to an embodiment of the present invention is applied, and in the first activation function, k is a constant used to determine whether input data is an outlier in each layer. As k increases, the proportion of input data that is determined as an outlier decreases. Unlike the first activation function, the second activation function is a case where k is not a constant but a learnable parameter whose value is updated in each layer during backpropagation. 5, the speed at which the accuracy of verification data inference increases much faster in the early stage of learning when the statistics-based normalization layer according to an embodiment of the present invention is applied than the conventional dropout method, and consequently achieves higher accuracy. can know that

In the deep neural network learning method that performs statistical normalization according to an embodiment of the present invention, instead of removing random neurons, removing neurons corresponding to outliers by using statistical information of data in each layer is By removing outliers that have a greater impact on decision-making in the DNN learning and reasoning process than other values due to the nature of the operation, it increases the efficiency of normalization compared to the number of neurons to be removed, and improves learning performance by better preserving meaningful information in the learning stage. can

Referring to FIG. 6 , in step S110 , the artificial neural network model may receive learning data from the input unit 120 . Accordingly, the first neural network layer 210 may receive learning data from the input unit 120 or the previous neural network layer.

In step S120, the first neural network layer 210 may obtain output values of the activation function using the input learning data.

In step S130, the statistics-based normalization layer 220 may determine an outlier by performing statistics-based normalization in order to use the output values of the activation function as input values of the second neural network layer 230.

Here, the first neural network layer 210 and the second neural network layer 230 are the most basic layers of the DNN, and nodes (neurons) may be connected to nodes of the next layer.

As an embodiment, the deep neural network learning apparatus performing statistics-based normalization may be a method in which an existing activation layer and a dropout layer are replaced with a statistics-based normalization layer according to an embodiment of the present invention, as shown in FIG. there is.

As an embodiment, the statistics-based normalization layer 220 may perform statistics-based normalization based on output values of an activation function of the first neural network layer 210 .

As an embodiment, the statistics-based normalization layer 220 may apply an activation function to the first neural network layer 210 and detect an outlier among output values of the activation function.

As an embodiment, when the statistics-based normalization layer 220 is provided in at least some of the plurality of neural network layers, an outlier may be detected among output values of an activation function for each layer. In this case, it is obvious that the outliers of each activation function corresponding to each layer are different.

Specifically, the statistics-based normalization layer 220 may employ a modified activation function of a general activation function. In one embodiment, the statistics-based normalization layer 220 may employ a variant of ReLU.

In one embodiment, the statistics-based normalization layer 220 calculates the average and standard deviation of the output values of the activation function of the first neural network layer 210 based on the activation function, and the average corresponds to a predetermined standard or more. output values can be determined as outliers.

Here, the activation function of the deep neural network learning apparatus performing statistics-based normalization according to an embodiment of the present invention can be applied differently to two cases.

First, the first activation function corresponding to the case where the parameter k used to determine whether the output value of the activation function is an outlier is set to a constant, as shown in Equation 1 below.

here, above

of the DNN

is the average of the outputs of f(x) in the th layer,

Is

The k is a learnable parameter,

set as

in Equation 2

is in g (x) of Equation 1

is multiplied by

As the value of k increases in Equation 2,

In step S140, the output value determined as an outlier may be removed.

To this end, the statistics-based normalization layer 220 determines neurons to be excluded from the first neural network layer 210 based on the statistical information of the input data transmitted from the first neural network layer 210, and determines the neurons to be excluded from the first neural network layer 210. ) from which the determined neurons can be excluded.

That is, the deep neural network learning method that performs normalization according to an embodiment of the present invention removes neurons corresponding to outliers based on statistical information of input data in each neural network layer, which is a DNN learning and inference process rather than other values. It is characterized by removing outliers that have a greater impact on decision-making in , increasing the efficiency of normalization compared to the number of neurons removed, and improving learning performance by better preserving meaningful information in the learning stage.

The remaining output values other than the output values excluded in step S150 may be output to the second neural network.

In the above, specific embodiments of the present invention have been described in detail. However, the spirit and scope of the present invention is not limited to these specific embodiments, and it is common knowledge in the technical field to which the present invention belongs that various modifications and variations are possible without changing the gist of the present invention. Anyone who has it will understand.

Therefore, since the embodiments described above are provided to completely inform those skilled in the art of the scope of the invention to which the present invention pertains, it should be understood that it is illustrative in all respects and not limiting, The invention is only defined by the scope of the claims.

Claims

an input unit for inputting learning data for learning a deep neural network (DNN) model;

A learning processor for learning the DNN model using the learning data;

The DNN model includes a statistics-based normalization layer between the first and second neural network layers,

The statistics-based normalization layer determines neurons to be excluded from the first neural network layer based on statistical information based on output values of the activation function of the first neural network layer, and outputs the output values of the activation function corresponding to the remaining neurons. output to the second neural network layer,

A deep neural network training device that performs statistics-based regularization.
According to claim 1,

The statistics-based normalization layer,

Calculating the average and standard deviation of the output values of the activation function, and determining an output value corresponding to a predetermined standard or more from the average as an outlier,

A deep neural network training device that performs statistics-based regularization.
According to claim 2,

The statistics-based normalization layer,

Determining a value that is far from the average of the output values of the activation function by a predetermined multiple of the standard deviation as the outlier, and excluding neurons corresponding to the outlier,

A deep neural network training device that performs statistics-based regularization.
According to claim 3,

The statistics-based normalization layer is a method of excluding neurons corresponding to the outliers, such as a method of converting the output value of the corresponding activation function to 0, the same method as the method of excluding some neurons during learning from dropout, and the activation degree of the corresponding neurons. Executing at least one of the ways to reduce

A deep neural network training device that performs statistics-based regularization.
According to claim 1,

The activation function is

When the parameter k used to determine whether the output value of the activation function is an outlier is set to a constant, it corresponds to the following first activation function,

remind
of the DNN
is the average of the outputs of f(x) in the th layer,
Is
The standard deviation of the output values of f(x) in the th layer,

A deep neural network training device that performs statistics-based regularization.
According to claim 5,

The activation function is

When the parameter k is set as a learnable parameter whose value is updated in each layer during backpropagation, it corresponds to the following second activation function,

The k is a learnable parameter,
is the initial value of k,
is at g(x) of the first activation function
which is multiplied by

A deep neural network training device that performs statistics-based regularization.
In the statistics-based regularization layer learning method,

receiving learning data;

obtaining output values of an activation function applied to a first neural network layer;

determining outliers by performing statistical normalization based on the output values;

removing the output value determined as the outlier; and

Outputting the remaining output values other than the outlier among the output values to a second neural network layer,

A deep neural network training method that implements statistics-based regularization.
According to claim 7,

The step of determining the outlier is,

Calculating the average and standard deviation of the output values of the activation function, and determining an output value corresponding to a predetermined standard or more from the average as an outlier,

A deep neural network training method that implements statistics-based regularization.
According to claim 8,

The step of determining the outlier is,

Determining a value that is far from the average of the output values of the activation function by a predetermined multiple of the standard deviation as the outlier,

A deep neural network training method that implements statistics-based regularization.
According to claim 8,

The step of removing the output value is,

Executing at least one of a method of converting the output value determined as the outlier to 0, a method identical to a method of excluding some neurons during learning in dropout, and a method of reducing the activation degree of neurons corresponding to the outlier,

A deep neural network training method that implements statistics-based regularization.
According to claim 7,

The activation function is

When the parameter k used to determine whether the output value of the activation function is an outlier is set to a constant, it corresponds to the following first activation function,

remind
of the DNN
is the average of the outputs of f(x) in the th layer,
Is
The standard deviation of the output values of f(x) in the th layer,

A deep neural network training method that implements statistics-based regularization.
According to claim 11,

The activation function is

When the parameter k is set as a learnable parameter whose value is updated in each layer during backpropagation, it corresponds to the following second activation function,

The k is a learnable parameter,
is the initial value of k,
is at g(x) of the first activation function
which is multiplied by

A deep neural network training method that implements statistics-based regularization.