US20230325673A1 - Neural network training utilizing loss functions reflecting neighbor token dependencies - Google Patents
Neural network training utilizing loss functions reflecting neighbor token dependencies Download PDFInfo
- Publication number
- US20230325673A1 US20230325673A1 US18/209,337 US202318209337A US2023325673A1 US 20230325673 A1 US20230325673 A1 US 20230325673A1 US 202318209337 A US202318209337 A US 202318209337A US 2023325673 A1 US2023325673 A1 US 2023325673A1
- Authority
- US
- United States
- Prior art keywords
- token
- value
- neural network
- tag
- loss
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 107
- 230000006870 function Effects 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 claims abstract description 77
- 230000015654 memory Effects 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 11
- 238000003058 natural language processing Methods 0.000 claims description 4
- 239000013598 vector Substances 0.000 description 26
- 230000008569 process Effects 0.000 description 16
- 238000002372 labelling Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 7
- 210000002569 neuron Anatomy 0.000 description 7
- 230000000877 morphologic effect Effects 0.000 description 6
- 210000004556 brain Anatomy 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000000644 propagated effect Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 101100278307 Caenorhabditis elegans dohh-1 gene Proteins 0.000 description 1
- 101100150905 Caenorhabditis elegans ham-3 gene Proteins 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000013529 biological neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 210000000225 synapse Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Definitions
- the present disclosure is generally related to computer systems, and is more specifically related to systems and methods for neural network training utilizing specialized loss functions.
- Neuronal network herein shall refer to a computational model vaguely inspired by the biological neural networks that constitute human brains.
- the neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs. Such a system would “learn” to perform tasks by considering examples, generally without being programmed with any task-specific rules.
- an example method of neural network training utilizing loss functions reflecting neighbor token dependencies may comprise: receiving a training dataset comprising a plurality of labeled tokens; determining, by a neural network, a first tag associated with a current token processed by the neural network, a second tag associated with a previous token which has been processed by the neural network before processing the current token, and a third tag associated with a next token to be processed by the neural network after processing the current token; computing, for the training dataset, a value of a loss function reflecting a first loss value, a second loss value, and a third loss value, wherein the first loss value is represented by a first difference of the first tag and a first label associated with the current token by the training dataset, wherein the second loss value is represented by a second difference of the second tag and a second label associated with the previous token by the training dataset, and wherein the third loss value is represented by a third difference of the third tag and a third label associated with the next token by the training dataset;
- another example method of neural network training utilizing loss functions reflecting neighbor token dependencies may comprise: receiving a training dataset comprising a plurality of labeled natural language words, wherein each label identifies a part of speech (POS) associated with a respective word; determining, by a neural network, a first tag identifying a first POS associated with a current word processed by the neural network, a second tag identifying a second POS associated with a previous word which has been processed by the neural network before processing the current word, and a third tag identifying a third POS associated with a next word to be processed by the neural network after processing the current word; computing, for the training dataset, a value of a loss function reflecting a first loss value, a second loss value, and a third loss value, wherein the first loss value is represented by a first difference of the first tag and a first label associated with the current word by the training dataset, wherein the second loss value is represented by a second difference of the second tag and a second
- an example computer-readable non-transitory storage medium may comprise executable instructions that, when executed by a computer system, cause the computer system to: receive a training dataset comprising a plurality of labeled natural language words, wherein each label identifies a part of speech (POS) associated with a respective word; determine, by a neural network, a first tag identifying a first POS associated with a current word processed by the neural network, a second tag identifying a second POS associated with a previous word which has been processed by the neural network before processing the current word, and a third tag identifying a third POS associated with a next word to be processed by the neural network after processing the current word; compute, for the training dataset, a value of a loss function reflecting a first loss value, a second loss value, and a third loss value, wherein the first loss value is represented by a first difference of the first tag and a first label associated with the current word by the training dataset, wherein the second loss value is represented
- FIG. 1 schematically illustrates an example baseline neural network which may be utilized for performing sequence labeling tasks, e.g., part-of-speech (POS) tagging, in accordance with one or more aspects of the present disclosure
- sequence labeling tasks e.g., part-of-speech (POS) tagging
- FIG. 2 schematically illustrates an example neural network operating in accordance with one or more aspects of the present disclosure
- FIG. 3 depicts a flow diagram of an example method of neural network training utilizing loss functions reflecting neighbor token dependencies, in accordance with one or more aspects of the present disclosure
- FIG. 4 depicts a flow diagram of an example method of neural-network-based sequence labeling, in accordance with one or more aspects of the present disclosure.
- FIG. 5 depicts a component diagram of an example computer system which may be employed for implementing the methods described herein.
- Neural networks trained by the methods described herein may be utilized for sequence labeling, i.e., processing an input sequence of tokens and associating each token with a label of a predetermined set of labels.
- POS tagging refers to assigning, to each work of the natural language text, a tag identifying a set of grammatical, morphological, and/or semantic attributes.
- a neural network trained by the methods described herein is capable of resolving the homonymy, i.e., utilizing the context (relationships between words) for distinguishing between identical words having different meanings.
- the tags produced by neural networks trained by the methods described herein may identify various other grammatical and/or morphological attributes of words of natural language texts processed by the neural networks.
- Each tag may be represented by a tuple of grammatical and/or morphological associated with a natural language word.
- These tags may be utilized for performing a wide range of natural language processing tasks, e.g., for performing syntactic and/or semantic analysis of natural language texts, machine translation, named entity recognition, etc.
- a neural network includes multiple connected nodes called “artificial neurons,” which loosely simulate the neurons in a human brain.
- Each connection like the synapses in the human brain, can transmit a signal from one artificial neuron to another.
- An artificial neuron that receives a signal would process it and then transmit the transformed signal to other additional artificial neurons.
- the output of each artificial neuron is computed by a function of a linear combination of its inputs.
- the connections between artificial neurons are called “edges.”
- Edge weights which increase or attenuate the signals being transmitted through respective edges, are defined at the network training stage based on a training dataset that includes a plurality of labeled inputs (i.e., inputs with known classification).
- all the edge weights are initialized to random or predetermined values.
- the neural network is activated.
- the observed output of the neural network is compared with the desired output specified by the training data set, and the error is propagated back to the previous layers of the neural network, in which the weights are adjusted accordingly. This process may be repeated until the output error falls below a predetermined threshold.
- the methods described herein utilize recurrent neural networks, which are capable of maintaining the network state reflecting the information about the inputs which have been processed by the network, thus allowing the network to use their internal state for processing subsequent inputs.
- common neural networks are susceptible to the gradient attenuation effect, which renders a network practically incapable of processing long input sequences (such as input sequences of more than five tokens).
- the gradient attenuation effect may be avoided by utilizing long short-term memory (LSTM) layers, which are utilizing a gating mechanism allowing the network to choose, for performing the next layer of processing, between the own state and the input. Since the LSTM neural networks exhibit very low gradient attenuation, such networks are capable of processing longer input sequences (such as input sequences of tens of tokens).
- LSTM long short-term memory
- BiLSTM bi-directional LSTM networks
- FIG. 1 schematically illustrates an example baseline neural network which may be utilized for performing sequence labeling tasks, e.g., POS tagging.
- the baseline neural network includes the feature extraction layer 110 , the BiLSTM layer 120 , and the prediction layer 140 .
- the feature extraction layer 110 is employed for producing feature vectors representing the input tokens 120 A- 120 N, which are sequentially fed to the feature extraction layer 110 .
- each feature vector may be represented by a corresponding embedding, i.e., a vector of real numbers which may be produced, e.g., by a neural network implementing a mathematical transformation from a space with one dimension per word to a continuous vector space with a much lower dimension.
- the feature extraction layer 110 may utilized a predetermined set of embeddings, which may be pre-built on a large corpus of natural language texts. Accordingly, word embeddings carry the semantic information and at least some morphological information about the words, such that the words which are utilized in similar contexts, as well as synonyms, would be assigned feature vectors which are located close to each other in the feature space.
- the BiLSTM layer 130 processes the feature vectors produced by the feature extraction layer 110 and yields a set of vectors, such that each vector encodes information about a corresponding input tokens and its context.
- the prediction layer 140 which may be implemented as a feed-forward network, processes the set of vectors produced by the BiLSTM layer 130 and for each vector yields a tag of a predetermined set of tags 150 A- 150 N (e.g., a tag indicative of a POS of the corresponding input token).
- the network training may involve processing, by the neural network, a training dataset that may include one or more input sequences with classification tags assigned to each token (e.g., a corpus of natural language texts with part of speeches assigned to each word).
- a value of a loss function may be computed based on the observed output of the neural network (i.e., the tag produced by the neural network for a given token) and the desired output specified by the training dataset for the same token.
- the error reflected by the loss function may be propagated back to the previous layers of the neural network, in which the edge weights and/or other network parameters may be adjusted accordingly in order to minimize the loss function. This process may be repeated until the value of the loss function would stabilize in the vicinity of a predetermined value or fall below a predetermined threshold.
- the baseline neural network 100 may be enhanced by adding two secondary outputs which would, in addition to the tag yields by the prediction layer 140 for the current token, yield a tag associated with the previous token and a tag associated with the next token, as schematically illustrated by FIG. 2 .
- the three tags may be utilized at the network training stage for computing the loss function, thus forcing the network to recognize the relationships between neighboring tokens.
- the neural networks and training methods described herein represent significant improvements over various common systems and methods.
- employing loss functions that are specifically aimed at training the neural network to recognize neighbor token dependencies yields significant improvement of the overall quality and efficiency of sequence labeling methods.
- FIG. 2 schematically illustrates an example neural network operating in accordance with one or more aspects of the present disclosure.
- the example neural network 200 may be utilized for performing sequence labeling tasks, e.g., POS tagging.
- sequence labeling tasks e.g., POS tagging.
- the example neural network 200 includes the feature extraction layer 210 , the BiLSTM layer 220 , and the prediction layer 230 .
- the baseline neural network 100 may process word embeddings, which are built in such a way such that the words which are utilized in similar contexts, as well as synonyms, would be assigned feature vectors which are located close to each other in the feature space.
- a word embedding matrix in which every dictionary word is mapped to a vector in the feature space, while having an enormous size, would still be unable to produce an embedding corresponding to a word which is not found in the dictionary.
- the relatively large size of a word embedding vector is explained by the fact that the vector carries the semantic information about the initial word, while such information may not always be useful on the labeling task (e.g., POS tagging) to be performed by the neural network.
- the neural networks implemented in accordance with one or more aspects of the present disclosure are designed to process inputs which, in addition to word-level embeddings, may include character-level embeddings and grammeme-level embeddings.
- the character-level embeddings do not rely on a dictionary, but rather view each input token as a sequence of characters.
- a vector may be assigned to a given input token, e.g., by processing an input sequence of tokens (e.g., a natural language text represented by a sequence of words) by a neural network (such as an LSTM network and/or a fully-connected network).
- the input tokens may be truncated by a predetermined size (e.g., 12 characters).
- Character-level embeddings carry grammatical and/or morphological information about the input tokens.
- the grammeme-level embedding of a given word may be produced by a neural network that, for each input word, would construct a vector each element of which is related to a specific grammatical attribute of the word (e.g., reflects a probability of the input word to be associated with the specific grammatical attribute).
- the neural network may apply an additional dense layer to the intermediate representation of the word, such that the resulting vector produced by the neural network would represent not only individual grammatical attributes, but also certain interactions between them.
- the neural network 200 is designed to process character-level embeddings and grammeme-level embeddings.
- the feature extraction layer 210 is employed for producing feature vectors representing the input tokens, which are sequentially fed to it.
- the state of the network reflects the “current” token 202 being processed by the network, as well as the “previous” token 204 which has already been processed by the network, and attempts to predict certain features of the “next” token 206 to be processed by the network.
- the feature extraction layer 210 produces the grammeme embeddings 212 (e.g., by processing the input tokens by an LSTM network and/or a fully-connected network) and character-level embeddings 214 (e.g., by processing the input tokens by another LSTM network and/or a fully-connected network).
- the gramemme embedding 212 are then fed to the dense layer 216 , the output of which is concatenated with the character-level embedding 214 and is fed to the dense layer 218 .
- a dense layer performs a transformation in which every input is connected to every output by a linear transformation characterized by a weight value, which may be followed by a non-linear activation function (e.g., ReLU, Softmax, etc.).
- the neural networks implementing the methods described herein may process input vectors representing any combinations (e.g., concatenations) of word-level embeddings, character-level embeddings, and/or grammeme-level embeddings.
- the output of the dense layer 218 is fed to the backward LSTM 224 and forward LSTM 226 of the LSTM layer 220 .
- the outputs of the LSTMs 224 - 226 are fed to the BiLSTM 228 , the output of which in the main operational mode is processed by the main prediction pipeline of the prediction layer 230 (i.e., the dense layer 234 optionally followed by the conditional random field (CRF) 238 ) in order to produce the tag 246 associated with the current token 202 .
- the main prediction pipeline of the prediction layer 230 i.e., the dense layer 234 optionally followed by the conditional random field (CRF) 238
- two auxiliary prediction pipelines of the prediction layer 230 may be utilized, such that the first auxiliary prediction pipeline that includes the dense layers 232 and 233 receives its input from the backward LSTM 224 and produces the tag 242 associated with the previous token 204 ; the second auxiliary prediction pipeline that includes the dense layers 236 and 237 receives its input from the forward LSTM 226 and produces the tag 248 associated with the next token 206 .
- the loss function may be computed which takes into account the differences between the respective predicted tags and the tags specified by the training dataset for the current, previous, and next tokens.
- the two auxiliary prediction pipeline of the prediction layer 230 are only utilized in the network training mode.
- the loss function may be represented as a weighted sum reflecting the differences between the respective predicted tags and the tags specified by the training dataset for the current, previous, and next tokens:
- d is the distance metric in the tag space
- w 1 , w 2 , and w 3 are the weight coefficients
- T prev is the tag produced by the neural network for the previous token
- T′ prev is the tag associated with the previous token by the training dataset
- T cur is the tag produced by the neural network for the current token
- T′ cur is the tag associated with the current token by the training dataset
- T next is the tag produced by the neural network for the next token
- T′ next is the tag associated with the next token by the training dataset.
- the network training may involve processing, by the neural network, a training dataset that may include one or more input sequences with classification tags assigned to each token (e.g., a corpus of natural language texts with part of speeches assigned to each word).
- a value of a loss function may be computed based on the observed output of the neural network (i.e., the tag produced by the neural network for a given token) and the desired output specified by the training dataset for the same token.
- the error reflected by the loss function may be propagated back to the previous layers of the neural network, in which the edge weights and/or other network parameters may be adjusted accordingly in order to minimize the loss function. This process may be repeated until the value of the loss function would stabilize in the vicinity of a predetermined value or fall below a predetermined threshold.
- utilizing the loss function based on the three tags would force the neural network to recognize neighbor token dependencies (e.g., relationships between the neighboring tokens of the input sequences) and would thus yield a significant improvement of the overall quality and efficiency of sequence labeling methods.
- FIG. 3 depicts a flow diagram of an example method 300 of neural network training utilizing loss functions reflecting neighbor token dependencies, in accordance with one or more aspects of the present disclosure.
- Method 300 and/or each of its individual functions, routines, subroutines, or operations may be performed by one or more processors of the computer system (e.g., example computer system 500 of FIG. 5 ) executing the method.
- method 300 may be performed by a single processing thread.
- method 300 may be performed by two or more processing threads, each thread executing one or more individual functions, routines, subroutines, or operations of the method.
- the processing threads implementing method 300 may be synchronized (e.g., using semaphores, critical sections, and/or other thread synchronization mechanisms). Alternatively, the processing threads implementing method 300 may be executed asynchronously with respect to each other. Therefore, while FIG. 3 and the associated description lists the operations of method 300 in certain order, various implementations of the method may perform at least some of the described operations in parallel and/or in arbitrary selected orders.
- a computer system implementing the method may receive a training dataset comprising a plurality of labeled tokens (e.g., a natural language text in which each word is labeled by a tag identifying a grammatical attribute of the word, such as a POS associated with the word).
- a training dataset comprising a plurality of labeled tokens (e.g., a natural language text in which each word is labeled by a tag identifying a grammatical attribute of the word, such as a POS associated with the word).
- the computer system may determine, by a neural network, the first tag associated with the current token processed by the neural network, the second tag associated with the previous token which has been processed by the neural network before processing the current token, and the third tag associated with the next token to be processed by the neural network after processing the current token.
- the tags may represent respective grammatical attributes (such as POS) associated with the tokens.
- the neural network may include a feature extraction layer, a bi-directional long-short term memory (BiLSTM) layer, and a prediction layer, such that the BiLSTM layer further includes a BiLSTM, a backward LSTM and a forward LSTM, and the outputs of the backward LSTM and the forward LSTM are fed to the BiLSTM, as described in more detail herein above.
- BiLSTM bi-directional long-short term memory
- the computer system may compute, for the training dataset, a value of a loss function reflecting the differences between the respective computed tags and corresponding labels specified by the training dataset.
- the loss function may be represented by a weighted sum of the difference of the computed tag for the current token and the label associated with the current token by the training dataset, the difference of the computed tag for the previous token and the label associated with the previous token by the training dataset, and the difference of the computed tag for the next token and the label associated with the next token by the training dataset, as described in more detail herein above.
- the computer system may adjust, based on the computed value of the loss function, one or more parameters of the neural network which undergoes the training.
- the error reflected by the loss function value is back-propagated starting from the last layer of the neural network, and the weights and/or other network parameters are adjusted in order to minimize the loss function.
- the process described by blocks 320 - 340 may be repeated until the value of the loss function would stabilize in a vicinity of a certain value or fall below a predetermined threshold or fall below a predetermined threshold.
- the computer system may employ the trained neural network for performing a sequence labeling task, such as a natural language processing task (e.g., POS tagging) of one or more input natural language texts, and the method may terminate.
- a sequence labeling task such as a natural language processing task (e.g., POS tagging) of one or more input natural language texts
- FIG. 4 depicts a flow diagram of an example method 400 of neural-network-based sequence labeling, in accordance with one or more aspects of the present disclosure.
- Method 400 and/or each of its individual functions, routines, subroutines, or operations may be performed by one or more processors of the computer system (e.g., example computer system 500 of FIG. 5 ) executing the method.
- method 400 may be performed by a single processing thread.
- method 400 may be performed by two or more processing threads, each thread executing one or more individual functions, routines, subroutines, or operations of the method.
- the processing threads implementing method 400 may be synchronized (e.g., using semaphores, critical sections, and/or other thread synchronization mechanisms). Alternatively, the processing threads implementing method 400 may be executed asynchronously with respect to each other. Therefore, while FIG. 4 and the associated description lists the operations of method 400 in certain order, various implementations of the method may perform at least some of the described operations in parallel and/or in arbitrary selected orders.
- a computer system implementing the method may receive an input dataset comprising a plurality of tokens (e.g., a natural language text comprising a plurality of words).
- a plurality of tokens e.g., a natural language text comprising a plurality of words.
- the computer system may employ a neural network (e.g., a neural network having the architecture of the neural network 200 of FIG. 2 ) to compute the feature vectors representing the respective tokens.
- a neural network e.g., a neural network having the architecture of the neural network 200 of FIG. 2
- Each feature vector may represented by a combination of a word embedding, a character-level embedding, and/or a grammeme embedding representing the corresponding token, as described in more detail herein above.
- the computer system may processes the feature vectors produced by the feature extraction layer and yield a set of vectors, such that each vector encodes information about a corresponding input tokens and its context, as described in more detail herein above.
- the computer system may process the set of information encoding vectors and for each vector may yield a tag of a predetermined set of tags (e.g., a tag indicative of a grammatical attributed of the corresponding input token).
- a tag of a predetermined set of tags e.g., a tag indicative of a grammatical attributed of the corresponding input token.
- FIG. 5 depicts a component diagram of an example computer system which may be employed for implementing the methods described herein.
- the computer system 500 may be connected to other computer system in a LAN, an intranet, an extranet, or the Internet.
- the computer system 500 may operate in the capacity of a server or a client computer system in client-server network environment, or as a peer computer system in a peer-to-peer (or distributed) network environment.
- the computer system 500 may be a provided by a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, or any computer system capable of executing a set of instructions (sequential or otherwise) that specify operations to be performed by that computer system.
- PC personal computer
- PDA Personal Digital Assistant
- STB set-top box
- a cellular telephone or any computer system capable of executing a set of instructions (sequential or otherwise) that specify operations to be performed by that computer system.
- Exemplary computer system 500 includes a processor 502 , a main memory 504 (e.g., read-only memory (ROM) or dynamic random access memory (DRAM)), and a data storage device 518 , which communicate with each other via a bus 530 .
- main memory 504 e.g., read-only memory (ROM) or dynamic random access memory (DRAM)
- DRAM dynamic random access memory
- Processor 502 may be represented by one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, processor 502 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. Processor 502 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. Processor 502 is configured to execute instructions 526 for performing the methods described herein.
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- DSP digital signal processor
- Computer system 500 may further include a network interface device 522 , a video display unit 510 , a character input device 512 (e.g., a keyboard), and a touch screen input device 514 .
- a network interface device 522 may further include a network interface device 522 , a video display unit 510 , a character input device 512 (e.g., a keyboard), and a touch screen input device 514 .
- Data storage device 518 may include a computer-readable storage medium 524 on which is stored one or more sets of instructions 526 embodying any one or more of the methods or functions described herein. Instructions 526 may also reside, completely or at least partially, within main memory 504 and/or within processor 502 during execution thereof by computer system 500 , main memory 504 and processor 502 also constituting computer-readable storage media. Instructions 526 may further be transmitted or received over network 516 via network interface device 522 .
- instructions 526 may include instructions of method 300 of neural network training utilizing loss functions reflecting neighbor token dependencies, implemented in accordance with one or more aspects of the present disclosure.
- instructions 526 may include instructions of method 400 of neural-network-based sequence labeling, implemented in accordance with one or more aspects of the present disclosure.
- computer-readable storage medium 524 is shown in the example of FIG. 5 to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions.
- computer-readable storage medium shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methods of the present disclosure.
- computer-readable storage medium shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media.
- the methods, components, and features described herein may be implemented by discrete hardware components or may be integrated in the functionality of other hardware components such as ASICS, FPGAs, DSPs or similar devices.
- the methods, components, and features may be implemented by firmware modules or functional circuitry within hardware devices.
- the methods, components, and features may be implemented in any combination of hardware devices and software components, or only in software.
- the present disclosure also relates to an apparatus for performing the operations herein.
- This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer.
- a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
Description
- This application is a continuation of U.S. patent application Ser. No. 16/236,382, filed Dec. 29, 2018, which claims priority under 35 USC § 119 to Russian patent application No. 2018146352 filed Dec. 25, 2018. Both above-referenced applications are incorporated by reference herein.
- The present disclosure is generally related to computer systems, and is more specifically related to systems and methods for neural network training utilizing specialized loss functions.
- “Neural network” herein shall refer to a computational model vaguely inspired by the biological neural networks that constitute human brains. The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs. Such a system would “learn” to perform tasks by considering examples, generally without being programmed with any task-specific rules.
- In accordance with one or more aspects of the present disclosure, an example method of neural network training utilizing loss functions reflecting neighbor token dependencies may comprise: receiving a training dataset comprising a plurality of labeled tokens; determining, by a neural network, a first tag associated with a current token processed by the neural network, a second tag associated with a previous token which has been processed by the neural network before processing the current token, and a third tag associated with a next token to be processed by the neural network after processing the current token; computing, for the training dataset, a value of a loss function reflecting a first loss value, a second loss value, and a third loss value, wherein the first loss value is represented by a first difference of the first tag and a first label associated with the current token by the training dataset, wherein the second loss value is represented by a second difference of the second tag and a second label associated with the previous token by the training dataset, and wherein the third loss value is represented by a third difference of the third tag and a third label associated with the next token by the training dataset; and adjusting a parameter of the neural network based on the value of the loss function.
- In accordance with one or more aspects of the present disclosure, another example method of neural network training utilizing loss functions reflecting neighbor token dependencies may comprise: receiving a training dataset comprising a plurality of labeled natural language words, wherein each label identifies a part of speech (POS) associated with a respective word; determining, by a neural network, a first tag identifying a first POS associated with a current word processed by the neural network, a second tag identifying a second POS associated with a previous word which has been processed by the neural network before processing the current word, and a third tag identifying a third POS associated with a next word to be processed by the neural network after processing the current word; computing, for the training dataset, a value of a loss function reflecting a first loss value, a second loss value, and a third loss value, wherein the first loss value is represented by a first difference of the first tag and a first label associated with the current word by the training dataset, wherein the second loss value is represented by a second difference of the second tag and a second label associated with the previous word by the training dataset, and wherein the third loss value is represented by a third difference of the third tag and a third label associated with the next word by the training dataset; and adjusting a parameter of the neural network based on the value of the loss function.
- In accordance with one or more aspects of the present disclosure, an example computer-readable non-transitory storage medium may comprise executable instructions that, when executed by a computer system, cause the computer system to: receive a training dataset comprising a plurality of labeled natural language words, wherein each label identifies a part of speech (POS) associated with a respective word; determine, by a neural network, a first tag identifying a first POS associated with a current word processed by the neural network, a second tag identifying a second POS associated with a previous word which has been processed by the neural network before processing the current word, and a third tag identifying a third POS associated with a next word to be processed by the neural network after processing the current word; compute, for the training dataset, a value of a loss function reflecting a first loss value, a second loss value, and a third loss value, wherein the first loss value is represented by a first difference of the first tag and a first label associated with the current word by the training dataset, wherein the second loss value is represented by a second difference of the second tag and a second label associated with the previous word by the training dataset, and wherein the third loss value is represented by a third difference of the third tag and a third label associated with the next word by the training dataset; and adjust a parameter of the neural network based on the value of the loss function.
- The present disclosure is illustrated by way of examples, and not by way of limitation, and may be more fully understood with references to the following detailed description when considered in connection with the figures, in which:
-
FIG. 1 schematically illustrates an example baseline neural network which may be utilized for performing sequence labeling tasks, e.g., part-of-speech (POS) tagging, in accordance with one or more aspects of the present disclosure; -
FIG. 2 schematically illustrates an example neural network operating in accordance with one or more aspects of the present disclosure; -
FIG. 3 depicts a flow diagram of an example method of neural network training utilizing loss functions reflecting neighbor token dependencies, in accordance with one or more aspects of the present disclosure; -
FIG. 4 depicts a flow diagram of an example method of neural-network-based sequence labeling, in accordance with one or more aspects of the present disclosure; and -
FIG. 5 depicts a component diagram of an example computer system which may be employed for implementing the methods described herein. - Described herein are methods and systems for neural network training utilizing loss functions reflecting neighbor token dependencies (e.g., relationships between the tokens of the input sequence being processed by the neural network). Neural networks trained by the methods described herein may be utilized for sequence labeling, i.e., processing an input sequence of tokens and associating each token with a label of a predetermined set of labels. The sequence labeling task may be defined as follows: producing, for an input sequence of tokens w1, . . . , wn=: w1 n, a corresponding sequence of tags t1, . . . , tn=:tn 1; ti∈T, where T denotes a set of possible tags.
- An example of the sequence labeling task is part-of-speech (POS) tagging, such that a neural network would process a natural language text and assign a POS-identifying tag to each work of the natural language text. “Part of speech” herein shall refer to a category of words. Words that are assigned to the same part of speech generally exhibit similar morphological attributes (e.g., similar inflection patterns). Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, and article. In certain implementations, POS tagging refers to assigning, to each work of the natural language text, a tag identifying a set of grammatical, morphological, and/or semantic attributes.
- Accordingly, the POS labeling task may be defined as follows: producing, for an input natural language text represented by a sequence of words w1, . . . , wn=:w1 n, a corresponding sequence of tags t1, . . . , tn=:tn 1; ti∈T, where T denotes a set of defined parts of speech. In particular, a neural network trained by the methods described herein is capable of resolving the homonymy, i.e., utilizing the context (relationships between words) for distinguishing between identical words having different meanings.
- In other examples, the tags produced by neural networks trained by the methods described herein may identify various other grammatical and/or morphological attributes of words of natural language texts processed by the neural networks. Each tag may be represented by a tuple of grammatical and/or morphological associated with a natural language word. These tags may be utilized for performing a wide range of natural language processing tasks, e.g., for performing syntactic and/or semantic analysis of natural language texts, machine translation, named entity recognition, etc.
- A neural network includes multiple connected nodes called “artificial neurons,” which loosely simulate the neurons in a human brain. Each connection, like the synapses in the human brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal would process it and then transmit the transformed signal to other additional artificial neurons. In common neural network implementations, the output of each artificial neuron is computed by a function of a linear combination of its inputs. The connections between artificial neurons are called “edges.” Edge weights, which increase or attenuate the signals being transmitted through respective edges, are defined at the network training stage based on a training dataset that includes a plurality of labeled inputs (i.e., inputs with known classification). In an illustrative example, all the edge weights are initialized to random or predetermined values. For every input in the training dataset, the neural network is activated. The observed output of the neural network is compared with the desired output specified by the training data set, and the error is propagated back to the previous layers of the neural network, in which the weights are adjusted accordingly. This process may be repeated until the output error falls below a predetermined threshold.
- The methods described herein utilize recurrent neural networks, which are capable of maintaining the network state reflecting the information about the inputs which have been processed by the network, thus allowing the network to use their internal state for processing subsequent inputs. However, common neural networks are susceptible to the gradient attenuation effect, which renders a network practically incapable of processing long input sequences (such as input sequences of more than five tokens).
- The gradient attenuation effect may be avoided by utilizing long short-term memory (LSTM) layers, which are utilizing a gating mechanism allowing the network to choose, for performing the next layer of processing, between the own state and the input. Since the LSTM neural networks exhibit very low gradient attenuation, such networks are capable of processing longer input sequences (such as input sequences of tens of tokens).
- However, a common LSTM neural network would only yield information about one of the two generally available contexts (left or right) of a given word. Accordingly, the systems and methods of the present disclosure utilize bi-directional LSTM networks (BiLSTM). A BiLSTM outputs a concatenation of the forward and backward passes of an ordinary LSTM.
-
FIG. 1 schematically illustrates an example baseline neural network which may be utilized for performing sequence labeling tasks, e.g., POS tagging. As schematically illustrated byFIG. 1 , the baseline neural network includes the feature extraction layer 110, the BiLSTM layer 120, and theprediction layer 140. - The feature extraction layer 110 is employed for producing feature vectors representing the
input tokens 120A-120N, which are sequentially fed to the feature extraction layer 110. In certain implementations, each feature vector may be represented by a corresponding embedding, i.e., a vector of real numbers which may be produced, e.g., by a neural network implementing a mathematical transformation from a space with one dimension per word to a continuous vector space with a much lower dimension. In an illustrative example, the feature extraction layer 110 may utilized a predetermined set of embeddings, which may be pre-built on a large corpus of natural language texts. Accordingly, word embeddings carry the semantic information and at least some morphological information about the words, such that the words which are utilized in similar contexts, as well as synonyms, would be assigned feature vectors which are located close to each other in the feature space. - The BiLSTM
layer 130 processes the feature vectors produced by the feature extraction layer 110 and yields a set of vectors, such that each vector encodes information about a corresponding input tokens and its context. Theprediction layer 140, which may be implemented as a feed-forward network, processes the set of vectors produced by the BiLSTMlayer 130 and for each vector yields a tag of a predetermined set oftags 150A-150N (e.g., a tag indicative of a POS of the corresponding input token). - The network training may involve processing, by the neural network, a training dataset that may include one or more input sequences with classification tags assigned to each token (e.g., a corpus of natural language texts with part of speeches assigned to each word). A value of a loss function may be computed based on the observed output of the neural network (i.e., the tag produced by the neural network for a given token) and the desired output specified by the training dataset for the same token. The error reflected by the loss function may be propagated back to the previous layers of the neural network, in which the edge weights and/or other network parameters may be adjusted accordingly in order to minimize the loss function. This process may be repeated until the value of the loss function would stabilize in the vicinity of a predetermined value or fall below a predetermined threshold.
- Accordingly, the baseline
neural network 100 may be enhanced by adding two secondary outputs which would, in addition to the tag yields by theprediction layer 140 for the current token, yield a tag associated with the previous token and a tag associated with the next token, as schematically illustrated byFIG. 2 . The three tags may be utilized at the network training stage for computing the loss function, thus forcing the network to recognize the relationships between neighboring tokens. - Therefore, the neural networks and training methods described herein represent significant improvements over various common systems and methods. In particular, employing loss functions that are specifically aimed at training the neural network to recognize neighbor token dependencies (e.g., relationships between the neighboring tokens of the input sequences) yields significant improvement of the overall quality and efficiency of sequence labeling methods. Various aspects of the above referenced methods and systems are described in details herein below by way of examples, rather than by way of limitation.
-
FIG. 2 schematically illustrates an example neural network operating in accordance with one or more aspects of the present disclosure. The exampleneural network 200 may be utilized for performing sequence labeling tasks, e.g., POS tagging. As schematically illustrated byFIG. 2 , the exampleneural network 200 includes thefeature extraction layer 210, theBiLSTM layer 220, and theprediction layer 230. - As noted herein above, the baseline
neural network 100 may process word embeddings, which are built in such a way such that the words which are utilized in similar contexts, as well as synonyms, would be assigned feature vectors which are located close to each other in the feature space. However, a word embedding matrix, in which every dictionary word is mapped to a vector in the feature space, while having an enormous size, would still be unable to produce an embedding corresponding to a word which is not found in the dictionary. Furthermore, the relatively large size of a word embedding vector is explained by the fact that the vector carries the semantic information about the initial word, while such information may not always be useful on the labeling task (e.g., POS tagging) to be performed by the neural network. Accordingly, the neural networks implemented in accordance with one or more aspects of the present disclosure are designed to process inputs which, in addition to word-level embeddings, may include character-level embeddings and grammeme-level embeddings. - The character-level embeddings do not rely on a dictionary, but rather view each input token as a sequence of characters. A vector may be assigned to a given input token, e.g., by processing an input sequence of tokens (e.g., a natural language text represented by a sequence of words) by a neural network (such as an LSTM network and/or a fully-connected network). In certain implementations, the input tokens may be truncated by a predetermined size (e.g., 12 characters). Character-level embeddings carry grammatical and/or morphological information about the input tokens.
- The grammeme-level embedding of a given word may be produced by a neural network that, for each input word, would construct a vector each element of which is related to a specific grammatical attribute of the word (e.g., reflects a probability of the input word to be associated with the specific grammatical attribute). The neural network may apply an additional dense layer to the intermediate representation of the word, such that the resulting vector produced by the neural network would represent not only individual grammatical attributes, but also certain interactions between them.
- In the illustrative example of
FIG. 2 , theneural network 200 is designed to process character-level embeddings and grammeme-level embeddings. Thefeature extraction layer 210 is employed for producing feature vectors representing the input tokens, which are sequentially fed to it. Thus, at any given moment in time, the state of the network reflects the “current” token 202 being processed by the network, as well as the “previous” token 204 which has already been processed by the network, and attempts to predict certain features of the “next” token 206 to be processed by the network. - The
feature extraction layer 210 produces the grammeme embeddings 212 (e.g., by processing the input tokens by an LSTM network and/or a fully-connected network) and character-level embeddings 214 (e.g., by processing the input tokens by another LSTM network and/or a fully-connected network). The gramemme embedding 212 are then fed to thedense layer 216, the output of which is concatenated with the character-level embedding 214 and is fed to thedense layer 218. A dense layer performs a transformation in which every input is connected to every output by a linear transformation characterized by a weight value, which may be followed by a non-linear activation function (e.g., ReLU, Softmax, etc.). - It should be noted that in various other implementations, the neural networks implementing the methods described herein may process input vectors representing any combinations (e.g., concatenations) of word-level embeddings, character-level embeddings, and/or grammeme-level embeddings.
- Referring again to
FIG. 2 , the output of thedense layer 218 is fed to thebackward LSTM 224 and forwardLSTM 226 of theLSTM layer 220. The outputs of the LSTMs 224-226 are fed to theBiLSTM 228, the output of which in the main operational mode is processed by the main prediction pipeline of the prediction layer 230 (i.e., thedense layer 234 optionally followed by the conditional random field (CRF) 238) in order to produce thetag 246 associated with thecurrent token 202. - Conversely, in the training mode, two auxiliary prediction pipelines of the
prediction layer 230 may be utilized, such that the first auxiliary prediction pipeline that includes thedense layers backward LSTM 224 and produces thetag 242 associated with theprevious token 204; the second auxiliary prediction pipeline that includes thedense layers forward LSTM 226 and produces thetag 248 associated with thenext token 206. The loss function may be computed which takes into account the differences between the respective predicted tags and the tags specified by the training dataset for the current, previous, and next tokens. Thus, the two auxiliary prediction pipeline of theprediction layer 230 are only utilized in the network training mode. - In an illustrative example, the loss function may be represented as a weighted sum reflecting the differences between the respective predicted tags and the tags specified by the training dataset for the current, previous, and next tokens:
-
L=w 1 d(T prev ,T′ prev)+w 2 d(T cur ,T′ cur)+w 3 d(T next ,T′ next) - where L is the value of the loss function,
- d is the distance metric in the tag space,
- w1, w2, and w3 are the weight coefficients,
- Tprev is the tag produced by the neural network for the previous token,
- T′prev is the tag associated with the previous token by the training dataset,
- Tcur is the tag produced by the neural network for the current token,
- T′cur is the tag associated with the current token by the training dataset,
- Tnext is the tag produced by the neural network for the next token, and
- T′next is the tag associated with the next token by the training dataset.
- The network training may involve processing, by the neural network, a training dataset that may include one or more input sequences with classification tags assigned to each token (e.g., a corpus of natural language texts with part of speeches assigned to each word). A value of a loss function may be computed based on the observed output of the neural network (i.e., the tag produced by the neural network for a given token) and the desired output specified by the training dataset for the same token. The error reflected by the loss function may be propagated back to the previous layers of the neural network, in which the edge weights and/or other network parameters may be adjusted accordingly in order to minimize the loss function. This process may be repeated until the value of the loss function would stabilize in the vicinity of a predetermined value or fall below a predetermined threshold.
- As noted herein above, utilizing the loss function based on the three tags would force the neural network to recognize neighbor token dependencies (e.g., relationships between the neighboring tokens of the input sequences) and would thus yield a significant improvement of the overall quality and efficiency of sequence labeling methods.
-
FIG. 3 depicts a flow diagram of anexample method 300 of neural network training utilizing loss functions reflecting neighbor token dependencies, in accordance with one or more aspects of the present disclosure.Method 300 and/or each of its individual functions, routines, subroutines, or operations may be performed by one or more processors of the computer system (e.g.,example computer system 500 ofFIG. 5 ) executing the method. In certain implementations,method 300 may be performed by a single processing thread. Alternatively,method 300 may be performed by two or more processing threads, each thread executing one or more individual functions, routines, subroutines, or operations of the method. In an illustrative example, the processingthreads implementing method 300 may be synchronized (e.g., using semaphores, critical sections, and/or other thread synchronization mechanisms). Alternatively, the processingthreads implementing method 300 may be executed asynchronously with respect to each other. Therefore, whileFIG. 3 and the associated description lists the operations ofmethod 300 in certain order, various implementations of the method may perform at least some of the described operations in parallel and/or in arbitrary selected orders. - At
block 310, a computer system implementing the method may receive a training dataset comprising a plurality of labeled tokens (e.g., a natural language text in which each word is labeled by a tag identifying a grammatical attribute of the word, such as a POS associated with the word). - At block 320, the computer system may determine, by a neural network, the first tag associated with the current token processed by the neural network, the second tag associated with the previous token which has been processed by the neural network before processing the current token, and the third tag associated with the next token to be processed by the neural network after processing the current token. In an illustrative example, the tags may represent respective grammatical attributes (such as POS) associated with the tokens. The neural network may include a feature extraction layer, a bi-directional long-short term memory (BiLSTM) layer, and a prediction layer, such that the BiLSTM layer further includes a BiLSTM, a backward LSTM and a forward LSTM, and the outputs of the backward LSTM and the forward LSTM are fed to the BiLSTM, as described in more detail herein above.
- At
block 330, the computer system may compute, for the training dataset, a value of a loss function reflecting the differences between the respective computed tags and corresponding labels specified by the training dataset. In an illustrative example, the loss function may be represented by a weighted sum of the difference of the computed tag for the current token and the label associated with the current token by the training dataset, the difference of the computed tag for the previous token and the label associated with the previous token by the training dataset, and the difference of the computed tag for the next token and the label associated with the next token by the training dataset, as described in more detail herein above. - At block 340, the computer system may adjust, based on the computed value of the loss function, one or more parameters of the neural network which undergoes the training. In an illustrative example, the error reflected by the loss function value is back-propagated starting from the last layer of the neural network, and the weights and/or other network parameters are adjusted in order to minimize the loss function.
- The process described by blocks 320-340 may be repeated until the value of the loss function would stabilize in a vicinity of a certain value or fall below a predetermined threshold or fall below a predetermined threshold.
- At
block 350, the computer system may employ the trained neural network for performing a sequence labeling task, such as a natural language processing task (e.g., POS tagging) of one or more input natural language texts, and the method may terminate. -
FIG. 4 depicts a flow diagram of anexample method 400 of neural-network-based sequence labeling, in accordance with one or more aspects of the present disclosure.Method 400 and/or each of its individual functions, routines, subroutines, or operations may be performed by one or more processors of the computer system (e.g.,example computer system 500 ofFIG. 5 ) executing the method. In certain implementations,method 400 may be performed by a single processing thread. Alternatively,method 400 may be performed by two or more processing threads, each thread executing one or more individual functions, routines, subroutines, or operations of the method. In an illustrative example, the processingthreads implementing method 400 may be synchronized (e.g., using semaphores, critical sections, and/or other thread synchronization mechanisms). Alternatively, the processingthreads implementing method 400 may be executed asynchronously with respect to each other. Therefore, whileFIG. 4 and the associated description lists the operations ofmethod 400 in certain order, various implementations of the method may perform at least some of the described operations in parallel and/or in arbitrary selected orders. - At
block 410, a computer system implementing the method may receive an input dataset comprising a plurality of tokens (e.g., a natural language text comprising a plurality of words). - At
block 420, the computer system may employ a neural network (e.g., a neural network having the architecture of theneural network 200 ofFIG. 2 ) to compute the feature vectors representing the respective tokens. Each feature vector may represented by a combination of a word embedding, a character-level embedding, and/or a grammeme embedding representing the corresponding token, as described in more detail herein above. - At
block 430, the computer system may processes the feature vectors produced by the feature extraction layer and yield a set of vectors, such that each vector encodes information about a corresponding input tokens and its context, as described in more detail herein above. - At
block 440, the computer system may process the set of information encoding vectors and for each vector may yield a tag of a predetermined set of tags (e.g., a tag indicative of a grammatical attributed of the corresponding input token). Upon completing the operations of method 450, the method may terminate. -
FIG. 5 depicts a component diagram of an example computer system which may be employed for implementing the methods described herein. Thecomputer system 500 may be connected to other computer system in a LAN, an intranet, an extranet, or the Internet. Thecomputer system 500 may operate in the capacity of a server or a client computer system in client-server network environment, or as a peer computer system in a peer-to-peer (or distributed) network environment. Thecomputer system 500 may be a provided by a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, or any computer system capable of executing a set of instructions (sequential or otherwise) that specify operations to be performed by that computer system. Further, while only a single computer system is illustrated, the term “computer system” shall also be taken to include any collection of computer systems that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methods described herein. -
Exemplary computer system 500 includes aprocessor 502, a main memory 504 (e.g., read-only memory (ROM) or dynamic random access memory (DRAM)), and adata storage device 518, which communicate with each other via abus 530. -
Processor 502 may be represented by one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly,processor 502 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets.Processor 502 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like.Processor 502 is configured to executeinstructions 526 for performing the methods described herein. -
Computer system 500 may further include anetwork interface device 522, a video display unit 510, a character input device 512 (e.g., a keyboard), and a touchscreen input device 514. -
Data storage device 518 may include a computer-readable storage medium 524 on which is stored one or more sets ofinstructions 526 embodying any one or more of the methods or functions described herein.Instructions 526 may also reside, completely or at least partially, withinmain memory 504 and/or withinprocessor 502 during execution thereof bycomputer system 500,main memory 504 andprocessor 502 also constituting computer-readable storage media.Instructions 526 may further be transmitted or received overnetwork 516 vianetwork interface device 522. - In an illustrative example,
instructions 526 may include instructions ofmethod 300 of neural network training utilizing loss functions reflecting neighbor token dependencies, implemented in accordance with one or more aspects of the present disclosure. In another illustrative example,instructions 526 may include instructions ofmethod 400 of neural-network-based sequence labeling, implemented in accordance with one or more aspects of the present disclosure. While computer-readable storage medium 524 is shown in the example ofFIG. 5 to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “computer-readable storage medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methods of the present disclosure. The term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media. - The methods, components, and features described herein may be implemented by discrete hardware components or may be integrated in the functionality of other hardware components such as ASICS, FPGAs, DSPs or similar devices. In addition, the methods, components, and features may be implemented by firmware modules or functional circuitry within hardware devices. Further, the methods, components, and features may be implemented in any combination of hardware devices and software components, or only in software.
- In the foregoing description, numerous details are set forth. It will be apparent, however, to one of ordinary skill in the art having the benefit of this disclosure, that the present disclosure may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present disclosure.
- Some portions of the detailed description have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, graphemes, characters, terms, numbers, or the like.
- It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “determining”, “computing”, “calculating”, “obtaining”, “identifying,” “modifying” or the like, refer to the actions and processes of a computer system, or similar electronic computer system, that manipulates and transforms data represented as physical (e.g., electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
- The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions.
- It is to be understood that the above description is intended to be illustrative, and not restrictive. Various other implementations will be apparent to those of skill in the art upon reading and understanding the above description. The scope of the disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/209,337 US20230325673A1 (en) | 2018-12-25 | 2023-06-13 | Neural network training utilizing loss functions reflecting neighbor token dependencies |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
RU2018146352 | 2018-12-25 | ||
RU2018146352A RU2721190C1 (en) | 2018-12-25 | 2018-12-25 | Training neural networks using loss functions reflecting relationships between neighbouring tokens |
US16/236,382 US11715008B2 (en) | 2018-12-25 | 2018-12-29 | Neural network training utilizing loss functions reflecting neighbor token dependencies |
US18/209,337 US20230325673A1 (en) | 2018-12-25 | 2023-06-13 | Neural network training utilizing loss functions reflecting neighbor token dependencies |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/236,382 Continuation US11715008B2 (en) | 2018-12-25 | 2018-12-29 | Neural network training utilizing loss functions reflecting neighbor token dependencies |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230325673A1 true US20230325673A1 (en) | 2023-10-12 |
Family
ID=70735124
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/236,382 Active 2042-05-16 US11715008B2 (en) | 2018-12-25 | 2018-12-29 | Neural network training utilizing loss functions reflecting neighbor token dependencies |
US18/209,337 Pending US20230325673A1 (en) | 2018-12-25 | 2023-06-13 | Neural network training utilizing loss functions reflecting neighbor token dependencies |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/236,382 Active 2042-05-16 US11715008B2 (en) | 2018-12-25 | 2018-12-29 | Neural network training utilizing loss functions reflecting neighbor token dependencies |
Country Status (2)
Country | Link |
---|---|
US (2) | US11715008B2 (en) |
RU (1) | RU2721190C1 (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7116309B2 (en) * | 2018-10-10 | 2022-08-10 | 富士通株式会社 | Context information generation method, context information generation device and context information generation program |
CN111967268B (en) * | 2020-06-30 | 2024-03-19 | 北京百度网讯科技有限公司 | Event extraction method and device in text, electronic equipment and storage medium |
CN111916143B (en) * | 2020-07-27 | 2023-07-28 | 西安电子科技大学 | Molecular activity prediction method based on multi-substructural feature fusion |
WO2022141864A1 (en) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | Conversation intent recognition model training method, apparatus, computer device, and medium |
CN117321602A (en) * | 2021-05-28 | 2023-12-29 | 谷歌有限责任公司 | Character level attention neural network |
CN113298179B (en) * | 2021-06-15 | 2024-05-28 | 南京大学 | Customs commodity abnormal price detection method and device |
CN113408300B (en) * | 2021-07-09 | 2024-02-20 | 北京百度网讯科技有限公司 | Model training method, brand word recognition device and electronic equipment |
CN116384237A (en) * | 2023-03-29 | 2023-07-04 | 大连海事大学 | Thermal infrared atmospheric parameter inversion method and device and electronic equipment |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180121788A1 (en) * | 2016-11-03 | 2018-05-03 | Salesforce.Com, Inc. | Deep Neural Network Model for Processing Data Through Mutliple Linguistic Task Hiearchies |
US20180225281A1 (en) * | 2017-02-06 | 2018-08-09 | Thomson Reuters Global Resources Unlimited Company | Systems and Methods for Automatic Semantic Token Tagging |
US20180260379A1 (en) * | 2017-03-09 | 2018-09-13 | Samsung Electronics Co., Ltd. | Electronic apparatus for compressing language model, electronic apparatus for providing recommendation word and operation methods thereof |
US20180268023A1 (en) * | 2017-03-16 | 2018-09-20 | Massachusetts lnstitute of Technology | System and Method for Semantic Mapping of Natural Language Input to Database Entries via Convolutional Neural Networks |
US20190073351A1 (en) * | 2016-03-18 | 2019-03-07 | Gogle Llc | Generating dependency parses of text segments using neural networks |
US20190130218A1 (en) * | 2017-11-01 | 2019-05-02 | Salesforce.Com, Inc. | Training a neural network using augmented training datasets |
US20190228073A1 (en) * | 2018-01-23 | 2019-07-25 | Wipro Limited | Method and system for identifying places of interest in a natural language input |
US20190228099A1 (en) * | 2018-01-21 | 2019-07-25 | Microsoft Technology Licensing, Llc. | Question and answer pair generation using machine learning |
US20200285951A1 (en) * | 2019-03-07 | 2020-09-10 | Adobe Inc. | Figure captioning system and related methods |
US11321538B1 (en) * | 2021-10-15 | 2022-05-03 | Dovel Technologies, Llc | Ensemble natural language processing model with compliance verification |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5146405A (en) | 1988-02-05 | 1992-09-08 | At&T Bell Laboratories | Methods for part-of-speech determination and usage |
US11062206B2 (en) * | 2015-11-12 | 2021-07-13 | Deepmind Technologies Limited | Training neural networks using normalized target outputs |
US9659248B1 (en) * | 2016-01-19 | 2017-05-23 | International Business Machines Corporation | Machine learning and training a computer-implemented neural network to retrieve semantically equivalent questions using hybrid in-memory representations |
US10019438B2 (en) | 2016-03-18 | 2018-07-10 | International Business Machines Corporation | External word embedding neural network language models |
US20170308790A1 (en) * | 2016-04-21 | 2017-10-26 | International Business Machines Corporation | Text classification by ranking with convolutional neural networks |
RU2665273C2 (en) * | 2016-06-03 | 2018-08-28 | Автономная некоммерческая образовательная организация высшего образования "Сколковский институт науки и технологий" | Trained visual markers and the method of their production |
RU2630427C2 (en) * | 2016-08-12 | 2017-09-07 | Дмитрий Владимирович Мительков | Method and system of semantic processing text documents |
RU2641447C1 (en) * | 2016-12-27 | 2018-01-17 | Общество с ограниченной ответственностью "ВижнЛабс" | Method of training deep neural networks based on distributions of pairwise similarity measures |
CN106845530B (en) * | 2016-12-30 | 2018-09-11 | 百度在线网络技术(北京)有限公司 | character detection method and device |
-
2018
- 2018-12-25 RU RU2018146352A patent/RU2721190C1/en active
- 2018-12-29 US US16/236,382 patent/US11715008B2/en active Active
-
2023
- 2023-06-13 US US18/209,337 patent/US20230325673A1/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190073351A1 (en) * | 2016-03-18 | 2019-03-07 | Gogle Llc | Generating dependency parses of text segments using neural networks |
US20180121788A1 (en) * | 2016-11-03 | 2018-05-03 | Salesforce.Com, Inc. | Deep Neural Network Model for Processing Data Through Mutliple Linguistic Task Hiearchies |
US20180121799A1 (en) * | 2016-11-03 | 2018-05-03 | Salesforce.Com, Inc. | Training a Joint Many-Task Neural Network Model using Successive Regularization |
US20210279551A1 (en) * | 2016-11-03 | 2021-09-09 | Salesforce.Com, Inc. | Training a joint many-task neural network model using successive regularization |
US20180225281A1 (en) * | 2017-02-06 | 2018-08-09 | Thomson Reuters Global Resources Unlimited Company | Systems and Methods for Automatic Semantic Token Tagging |
US20180260379A1 (en) * | 2017-03-09 | 2018-09-13 | Samsung Electronics Co., Ltd. | Electronic apparatus for compressing language model, electronic apparatus for providing recommendation word and operation methods thereof |
US20180268023A1 (en) * | 2017-03-16 | 2018-09-20 | Massachusetts lnstitute of Technology | System and Method for Semantic Mapping of Natural Language Input to Database Entries via Convolutional Neural Networks |
US20190130218A1 (en) * | 2017-11-01 | 2019-05-02 | Salesforce.Com, Inc. | Training a neural network using augmented training datasets |
US20190228099A1 (en) * | 2018-01-21 | 2019-07-25 | Microsoft Technology Licensing, Llc. | Question and answer pair generation using machine learning |
US20190228073A1 (en) * | 2018-01-23 | 2019-07-25 | Wipro Limited | Method and system for identifying places of interest in a natural language input |
US20200285951A1 (en) * | 2019-03-07 | 2020-09-10 | Adobe Inc. | Figure captioning system and related methods |
US11321538B1 (en) * | 2021-10-15 | 2022-05-03 | Dovel Technologies, Llc | Ensemble natural language processing model with compliance verification |
Also Published As
Publication number | Publication date |
---|---|
US20200202211A1 (en) | 2020-06-25 |
RU2721190C1 (en) | 2020-05-18 |
US11715008B2 (en) | 2023-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230325673A1 (en) | Neural network training utilizing loss functions reflecting neighbor token dependencies | |
CN109753566B (en) | Model training method for cross-domain emotion analysis based on convolutional neural network | |
US11132512B2 (en) | Multi-perspective, multi-task neural network model for matching text to program code | |
Radford et al. | Improving language understanding by generative pre-training | |
CN108733792B (en) | Entity relation extraction method | |
Yao et al. | Bi-directional LSTM recurrent neural network for Chinese word segmentation | |
Xu et al. | Investigation on the Chinese text sentiment analysis based on convolutional neural networks in deep learning. | |
JP2020500366A (en) | Simultaneous multi-task neural network model for multiple natural language processing (NLP) tasks | |
CN110678881A (en) | Natural language processing using context-specific word vectors | |
Beysolow | Applied natural language processing with python | |
US20240013059A1 (en) | Extreme Language Model Compression with Optimal Sub-Words and Shared Projections | |
US11544457B2 (en) | Machine learning based abbreviation expansion | |
EP3850530A1 (en) | Minimization of computational demands in model agnostic cross-lingual transfer with neural task representations as weak supervision | |
CN113157919A (en) | Sentence text aspect level emotion classification method and system | |
Verma et al. | Semantic similarity between short paragraphs using Deep Learning | |
Yao | Attention-based BiLSTM neural networks for sentiment classification of short texts | |
Yang et al. | Text classification based on convolutional neural network and attention model | |
Seilsepour et al. | Self-supervised sentiment classification based on semantic similarity measures and contextual embedding using metaheuristic optimizer | |
US20230367978A1 (en) | Cross-lingual apparatus and method | |
Yang et al. | Unitabe: Pretraining a unified tabular encoder for heterogeneous tabular data | |
WO2023091226A1 (en) | Language-model pretraining with gradient-disentangled embedding sharing | |
CN115510230A (en) | Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism | |
Kandi | Language Modelling for Handling Out-of-Vocabulary Words in Natural Language Processing | |
Saravani et al. | Persian language modeling using recurrent neural networks | |
Liang et al. | Named Entity Recognition Method Based on BERT-whitening and Dynamic Fusion Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: ABBYY DEVELOPMENT INC., NORTH CAROLINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ABBYY PRODUCTION LLC;REEL/FRAME:063951/0099 Effective date: 20211231 Owner name: ABBYY PRODUCTION LLC, RUSSIAN FEDERATION Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:INDENBOM, EUGENE;ANASTASIEV, DANIIL;SIGNING DATES FROM 20190114 TO 20190123;REEL/FRAME:063951/0090 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |