EP3848856A1 - Method and apparatus for generating network representation of neural network, storage medium, and device - Google Patents

Method and apparatus for generating network representation of neural network, storage medium, and device Download PDF

Info

Publication number
EP3848856A1
EP3848856A1 EP19857335.4A EP19857335A EP3848856A1 EP 3848856 A1 EP3848856 A1 EP 3848856A1 EP 19857335 A EP19857335 A EP 19857335A EP 3848856 A1 EP3848856 A1 EP 3848856A1
Authority
EP
European Patent Office
Prior art keywords
sequence
locally strengthened
vector
representation
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP19857335.4A
Other languages
German (de)
French (fr)
Other versions
EP3848856A4 (en
Inventor
Zhaopeng Tu
Baosong YANG
Tong Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Publication of EP3848856A1 publication Critical patent/EP3848856A1/en
Publication of EP3848856A4 publication Critical patent/EP3848856A4/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Definitions

  • the present disclosure relates to the field of computer technologies, and in particular to a network representation generating method and apparatus for a neural network, a storage medium, and a device.
  • An attention mechanism is a method of establishing a model for a dependence between hidden states of an encoder and a decoder in a neural network.
  • the attention mechanism is widely applied to tasks of natural language processing (NLP) based on deep learning.
  • a self-attenuation network is a neural network model based on a self-attention mechanism, which belongs to one type of attention models.
  • an attention weight can be calculated for each element pair in an input sequence, so that a long-distance dependence can be captured, and network representations corresponding to elements are not affected by distances between the elements.
  • all the elements in the input sequence are fully considered.
  • attention weights between each element and all the elements are required to be calculated, which disperses a distribution of the weights to some extent, and further weakens an association between the elements.
  • a network representation generating method for a neural network is provided.
  • the method is applied to a computer device.
  • the method includes:
  • a network representation generating apparatus for a neural network.
  • the apparatus includes: an obtaining module, a linear transformation module, a logical similarity degree calculation module, a locally strengthened matrix construction module, an attention weight distribution determining module and a fusion module, where the obtaining module is configured to obtain a source-side vector representation sequence corresponding to an input sequence; the linear transformation module is configured to perform a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence; the logical similarity degree calculation module is configured to calculate a logical similarity degree between the request vector sequence and the key vector sequence; the locally strengthened matrix construction module is configured to construct a locally strengthened matrix according to the request vector sequence; the attention weight distribution determining module is configured to perform a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements; and the fusion module is configured to fuse value vectors in the value
  • a computer-readable storage medium stores a computer program.
  • the computer program when executed by a processor, causes the processor to perform the steps of the network representation generating method for a neural network described above.
  • a computer device in another aspect, includes a memory and a processor.
  • the memory stores a computer program.
  • the computer program when executed by the processor, causes the processor to perform the steps of the network representation generating method for a neural network described above.
  • the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information.
  • the linear transformation By performing the linear transformation on the source-side vector representation sequence corresponding to the input sequence, the request vector sequence, the key vector sequence and the value vector sequence are obtained.
  • the logical similarity degree is obtained according to the request vector sequence and the key vector sequence.
  • the nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions, correcting original attention weights.
  • a weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distributions, so that a network representation sequence with the strengthened local information can be obtained.
  • the obtained network representation sequence not only the local information can be strengthened, but also an association between elements in the input sequence far away from each other can be retained.
  • FIG. 1 is a diagram showing an application environment of a network representation generating method for a neural network according to an embodiment.
  • the network representation generating method for a neural network is applied to a network representation generating system for a neural network.
  • the network representation generating system for a neural network includes a terminal 110 and a computer device 120.
  • the terminal 110 and the computer device 120 are connected to each other through Bluetooth, a universal serial bus (USB) or a network.
  • the terminal 110 may transmit a to-be-processed input sequence to the computer device 120 in real time or non-real time.
  • the computer device 120 is used to receive the input sequence, and perform transformation on the input sequence to output a corresponding network representation sequence.
  • the terminal 110 may be a desktop terminal or a mobile terminal.
  • the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, or the like.
  • the computer device 120 may be an independent server or terminal, or may be a server cluster formed by multiple servers, or may be a cloud server providing basic cloud computing services such as a cloud server service, a cloud database service, a cloud storage service, and a CDN service.
  • the computer device 120 may directly obtain the input sequence without the terminal 110.
  • the mobile phone may directly obtain the input sequence (for example, a sequence formed by words in an instant text message), perform transformation on the input sequence by using a network representation generating apparatus for a neural network configured on the mobile phone, and output a network representation sequence corresponding to the input sequence.
  • a network representation generating method for a neural network is provided according to an embodiment.
  • description is made mainly by using an example in which the method is applied to the computer device 120 in FIG. 1 .
  • the network representation generating method for a neural network may include the following steps S202 to S212.
  • the input sequence is a sequence to be transformed to obtain a corresponding network representation sequence.
  • the input sequence may be a word sequence corresponding to a to-be-translated text, and elements in the input sequence are respectively words in the word sequence.
  • the word sequence may be a sequence formed by performing word segmentation on the to-be-translated text and arraying obtained words according to a word order.
  • the word sequence is a sequence formed by arraying words according to a word order. For example, if the to-be-translated text is "Bush held a talk with Sharon", a corresponding input sequence X is ⁇ Bush, held, a, talk, with, Sharon ⁇ .
  • the source-side vector representation sequence is a sequence formed by source-side vector representations of the elements in the input sequence.
  • the vector representations in the source-side vector representation sequence are in a one-to-one correspondence with the elements in the input sequence.
  • the computer device may convert each element in the input sequence into a vector having a fixed length (i.e., perform word embedding).
  • the network representation generating method for a neural network is applied to a neural network model.
  • the computer device may convert each element in the input sequence into a corresponding vector through a first layer of the neural network model.
  • the computer device converts an i-th element x ⁇ in the input sequence into a d-dimensional column vector, i.e., z ⁇ .
  • the computer device combines vectors corresponding to the elements in the input sequence, to obtain the source-side vector representation sequence corresponding to the input sequence, that is, a vector sequence formed by I d-dimensional column vectors, where d is a positive integer.
  • the computer device may also receive a source-side vector representation sequence corresponding to an input sequence transmitted by another device.
  • z i and column vectors mentioned below may be row vectors. For ease of describing the calculation process, the description herein is given by means of the column vectors.
  • a linear transformation is performed on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence.
  • the linear transformation may be used to map a vector in a vector space to another vector space.
  • the vector space is a set formed by multiple vectors having the same dimension.
  • the computer device may perform the linear transformation on the source-side vector representation sequence by three different learnable parameter matrices, so that the source-side vector representation sequence is mapped to three different vector spaces, to respectively obtain the request vector sequence, the key vector sequence and the value vector sequence corresponding to the source-side vector representation sequence.
  • the network representation generating method for a neural network is applied to a model based on a self-attention neural network (SAN).
  • SAN self-attention neural network
  • each of the request vector sequence, the key vector sequence and the value vector sequence is obtained by performing the linear transformation on the source-side vector representation sequence corresponding to the input sequence at a source side.
  • the network representation generating method for a neural network is applied to a neural network model including an Encoder-Decoder structure.
  • the key vector sequence and the value vector sequence are obtained by encoding the source-side vector representation sequence corresponding to the input sequence by an encoder. That is, the key vector sequence and the value vector sequence are outputs of the encoder.
  • the request vector sequence is an input of a decoder, for example, may be a target-side vector representation sequence, where the target-side vector representation sequence may be formed by vector representations corresponding to elements in an output sequence outputted by the decoder.
  • the computer device may perform the linear transformation on the source-side vector representation sequence Z by three different learnable parameter matrices W Q , W K , and W V , to obtain a request vector sequence Q , a key vector sequence K and a value vector sequence V according to the following formulas:
  • the learnable parameter matrices W Q , W K , and W V each are a d ⁇ d matrix.
  • the request vector sequence Q the key
  • the logical similarity degree is used for measuring a similarity between one element in the input sequence and another element in the input sequence.
  • a corresponding attention weight may be assigned, based on the similarity, to a value vector corresponding to the another element in the input sequence.
  • the network representation corresponding to the element is obtained in the case of taking the association between the element and the another element into consideration, so that the generated network representation can more accurately present features of the element and contain more abundant information.
  • the network representation generating method for a neural network is applied to a neural network model including an Encoder-Decoder structure.
  • the request vector sequence is a target-side vector representation sequence
  • the calculated logical similarity degree is used for indicating a similarity between the target-side vector representation sequence and the key vector sequence corresponding to the input sequence.
  • a corresponding attention weight is assigned, based on the similarity, to the value vector sequence corresponding to the input sequence, so that the network representation of each element outputted by the source side is obtained in the case of taking the effect of the target-side vector representation sequence inputted by a target side into consideration.
  • K T represents a transposed matrix of the key vector sequence K
  • d denotes a dimension of a source-side vector representation Z i into which each element x ⁇ in the input sequence is converted
  • d also denotes a dimension of the network representation corresponding to x ⁇
  • a network hidden state vector a network hidden state vector.
  • e i implies an association between two elements in each of I element pairs formed by the i-th element x ⁇ and all elements x 1 , x 2 , ..., x ⁇ , ..., x I in the input sequence.
  • the logical similarity degree matrix E is an I ⁇ I matrix.
  • a locally strengthened matrix is constructed according to the request vector sequence.
  • Each element of a column vector in the locally strengthened matrix represents an association degree between two elements in the input sequence.
  • the effect of another element in the input sequence having a large association degree with the current element on the network representation may be strengthened by the locally strengthened matrix, and the effect of an element having a small association degree with the current element on the network representation is relatively weakened.
  • a considered scope is limited to local elements rather than all the elements in the input sequence when considering the effect of another element on the network representation of the current element. In this way, in the attention weight assignment, the attention weights are biased to be assigned in the local elements.
  • a magnitude of the attention weight assigned to a value vector corresponding to an element among the local elements is related to an association degree between the element and the current element. That is, a large attention weight is assigned to a value vector corresponding to an element having a large association degree with the current element.
  • the attention weights may be assigned in a locally strengthened range.
  • a relatively high attention weight is assigned to a value vector corresponding to the element "hold”. Similar to the "held”, "a talk” among the local elements that falls within the locally strengthened range corresponding to the element "Bush” is also noted and is assigned with a relatively high attention weight.
  • the computer device is required to determine a locally strengthened range corresponding to the current element, so that the assignment of the attention weights corresponding to the current element is limited in the locally strengthened range.
  • the locally strengthened range may be determined according to two variables including a center point of the locally strengthened range and a window size of the locally strengthened range.
  • the center point refers to a position of an element assigned with the highest attention weight in the process of generating of the network representation of the current element in the input sequence.
  • the window size refers to a length of the locally strengthened range, which determines how many elements are centralizedly assigned with the attention weights.
  • the locally strengthened range is defined by elements falling in a range with the center point as a center and with the window size as a span. Since a locally strengthened range corresponding to each element is related to the element itself, which corresponds to the element and is not fixed in a specific range, abundant context information may be flexibly captured by means of the generated network representation of the element.
  • the computer device may determine the locally strengthened range corresponding to each element according to the center point and the window size.
  • the process may be performed by: determining the center point as a mean of a Gaussian distribution and determining the window size as a variance of the Gaussian distribution; and determining the locally strengthened range according to the Gaussian distribution determined based on the mean and the variance.
  • the computer device may calculate an association degree between two elements based on the determined locally strengthened range, to obtain the locally strengthened matrix.
  • G ij represents an association degree between a j-th element and a center point P ⁇ corresponding to an i-th element in the input sequence, and G ij is a value of a j-th element of an i-th column vector in a locally strengthened matrix G
  • P ⁇ represents a center point of a locally strengthened range corresponding to the i-th element
  • D ⁇ represents a window size of the locally strengthened range corresponding to the i-th element.
  • the locally strengthened matrix G is an I ⁇ I matrix, including I column vectors, where a dimension of each column vector is I .
  • a value of each element in the i-th column vector of the locally strengthened matrix G is determined based on the locally strengthened range corresponding to the i-th element in the input sequence.
  • the formula (2) is a function that is symmetric about the center point P i .
  • the numerator in the formula represents a distance between the j-th element and the center point P ⁇ corresponding to the i-th element in the input sequence. A close distance corresponds to a large G ij , indicating a large association degree between the j-th element and the i-th element.
  • a far distance corresponds to a small G ij , indicating a small association degree between the j-th element and the i-th element. That is, in the process of generating a network representation corresponding to the i-th element, the attention weights are centralizedly assigned among elements close to the center point P ⁇
  • calculating the G ij according to the formula (2) modified based on the Gaussian distribution is merely an example.
  • the center point may be used as a mean
  • the window size may be used as a variance
  • a value of G ij is calculated through another distribution having the mean and the variance, such as a Poisson distribution or a binomial distribution, to obtain the locally strengthened matrix G .
  • a nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements.
  • the logical similarity degree indicates a similarity between two elements in each element pair in the input sequence
  • the locally strengthened matrix indicates an association between the two elements in each element pair in the input sequence.
  • the locally strengthened attention weight distribution may be calculated by a combination of the logical similarity degree and the locally strengthened matrix.
  • the performing a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to the elements may include: correcting the logical similarity degree according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree; and performing normalization on the locally strengthened logical similarity degree, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements.
  • the computer device may correct the logical similarity degree through the association degree, to obtain the locally strengthened logical similarity degree.
  • the logical similarity degree matrix E including logical similarity degrees respectively corresponding to all element pairs may be added to the locally strengthened matrix G including association degrees respectively corresponding to all the element pairs, to correct (which is also referred to as offset) the logical similarity degree matrix, and normalization is performed on logical similarity degree vectors in the corrected logical similarity degree matrix, to obtain the locally strengthened attention weight distribution.
  • the normalization on the logical similarity degree vectors in the corrected logical similarity degree matrix is performed in a unit of a column vector e i ′ . . That is, a value of each element in the column vector e i ′ is in a range of (0,1), and a sum of all elements in the column vector e i ′ is 1.
  • a maximum value in the column vector can be highlighted, and other components far lower than the maximum value can be suppressed, and thus the locally strengthened attention weight distribution corresponding to the i-th element in the input sequence can be obtained.
  • A ⁇ 1 , ⁇ 2 , ⁇ 3 , ..., ⁇ I ⁇ , A includes / I-dimensional column vectors, and an i-th element ⁇ i in A represents an attention weight distribution corresponding to an i-th element x ⁇ in the input sequence.
  • value vectors in the value vector sequence are fused according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • the network representation sequence is a sequence formed by multiple network representations (vector representations).
  • the input sequence may be inputted to the neural network model, and the network representation sequence corresponding to the input sequence may be outputted through linear transformation or nonlinear transformation on a model parameter in a hidden layer of the neural network model.
  • the computer device obtains an attention weight distribution ⁇ i corresponding to the element from the locally strengthened attention weight distribution matrix, and calculates a weighted sum of the value vectors in the value vector sequence with each element in the attention weight distribution ⁇ i corresponding to the element as a weight coefficient, to obtain a network representation o i corresponding to the current element x ⁇ .
  • the attention weight distribution corresponding to the current element is a locally strengthened attention weight distribution obtained by correcting an original logical similarity, not the value vectors corresponding to all the elements in the input sequence are completely considered in the weighted sum process, but value vectors corresponding to elements falling in the locally strengthened range are emphatically considered. In this way, the outputted network representation of the current element contains local information associated with the current element.
  • the term "element” used in the present disclosure may be used for describing a basic component unit of a vector (including a column vector or a matrix vector) in this specification.
  • “elements in an input sequence” refer to inputs in the input sequence
  • “elements in a matrix” refer to column vectors that constitute the matrix
  • “elements in a column vector” refer to values in the column vector. That is, the “element” refers to a basic component unit that constitutes a sequence, a vector, or a matrix.
  • FIG. 3 is a schematic diagram showing a process of calculating a network representation sequence corresponding to an input sequence according to an embodiment.
  • Z is linearly transformed into a request vector sequence Q , a key vector sequence K and a value vector sequence V through three different learnable parameter matrices.
  • a logical similarity degree between each pair of key values is calculated through a dot product operation, to obtain a logical similarity degree matrix E .
  • a locally strengthened matrix G is constructed according to Q or K , and E is corrected by G , to obtain a locally strengthened logical similarity degree matrix E' .
  • normalization is performed on E' by using the softmax function, to obtain a locally strengthened attention weight distribution matrix A.
  • a dot product operation is performed on A and the value vector sequence V , to output a network representation sequence O .
  • FIG. 4 is a diagram showing a system architecture in which an SAN attention weight distribution is corrected by a Gaussian distribution according to an embodiment.
  • the description is given below by taking the input sequence being "Bush held a talk with Sharon" and the current element being “Bush” as an example.
  • a basic model is constructed by an original SAN, to obtain a logical similarity degree between each pair of elements (formed by two elements in the input sequence), and an attention weight distribution corresponding to "Bush” is calculated based on the logical similarity degree, which considers all words.
  • the word "held” is assigned with the highest attention weight (where a column height represents a magnitude of an attention weight), and remaining words are assigned with lower attention weights. Referring to the middle of FIG.
  • a position of a center point of a locally strengthened range corresponding to the current element "Bush” calculated by using the Gaussian distribution is approximately equal to 4, which corresponds to the word "talk” in the input sequence, and a window size of the locally strengthened range is approximately equal to 3. That is, the locally strengthened range corresponding to the current element "Bush” includes positions corresponding to three words centered on the word "talk".
  • a locally strengthened matrix is calculated based on the determined locally strengthened range, and the logical similarity degree obtained from the left side of FIG. 4 is corrected by using the locally strengthened matrix, so that the corrected attention weights are centralizedly assigned among the three words, and the word "talk" is assigned with the highest attention weight.
  • the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information.
  • the request vector sequence, the key vector sequence and the value vector sequence are obtained.
  • the logical similarity degree is obtained according to the request vector sequence and the key vector sequence.
  • the nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions, correcting original attention weights.
  • a weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distributions, so that a network representation sequence with the strengthened local information can be obtained.
  • the obtained network representation sequence not only the local information can be strengthened, but also an association between elements in the input sequence far away from each other can be retained.
  • the process of constructing the locally strengthened matrix according to the request vector sequence may be implemented by performing the following steps S502 to S508.
  • a center point of a locally strengthened range corresponding to each of the elements is determined.
  • the locally strengthened range corresponding to each element in the input sequence is determined by the center point and the window size corresponding to the element.
  • the center point corresponding to the element depends on the request vector corresponding to the element. Therefore, the center point of the locally strengthened range corresponding to the element may be determined according to the request vector.
  • the process of determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each element may be implemented by performing the following steps including: performing, for each element in the input sequence, a transformation on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element; performing a nonlinear transformation on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence; and determining the second scalar as the center point of the locally strengthened range corresponding to the element.
  • the computer device may determine, according to the request vector sequence obtained in step S204, a center point of a locally strengthened range corresponding to each element. Taking the i-th element x ⁇ in the input sequence as an example, a center point of a locally strengthened range corresponding to the i-th element x ⁇ may be obtained by performing the following steps 1) and 2).
  • the computer device maps a request vector q ⁇ corresponding to the i-th element into a hidden state by a first feedforward neural network, and performs a linear transformation on the request vector q i by U P T , to obtain a first scalar p i corresponding to the i-th element in the input sequence.
  • the vector is mapped into the hidden state by the feedforward neural network.
  • the method for mapping the vector through the feedforward neural network is not limited thereto, and the feedforward neural network may be replaced with other neural network models, such as a long short-term memory (LSTM) model and variations thereof, a gated unit and variations thereof, or by performing simple linear transformation.
  • LSTM long short-term memory
  • the computer device converts the first scalar p ⁇ into a scalar whose value range is (0,1) by a nonlinear transformation function, and multiplies the scalar by a length I of the input sequence, to obtain a center point position P i whose value range is (0,1) .
  • P i is a center point of a locally strengthened range corresponding to the i-th element, and P ⁇ is proportional to the length I of the input sequence.
  • the method of converting the scalar using sigmoid herein and in the following may be replaced with another method for mapping any real number into a range (0,1), which is not limited in the present disclosure.
  • the computer device determines the calculated P ⁇ as the center point of the locally strengthened range corresponding to the i-th element x ⁇ in the input sequence. For example, if the length I of the input sequence is equal to 10, and the calculated P ⁇ is equal to 5, the center point of the locally strengthened range corresponding to x ⁇ is a fifth element in the input sequence. In the process of generating a network representation corresponding to x i , a value vector of the fifth element in the input sequence is assigned with the highest attention weight.
  • the computer device may repeat the foregoing steps until center points of locally strengthened ranges respectively corresponding to all the elements are each obtained according to a corresponding request vector in the request vector sequence.
  • a window size of the locally strengthened range corresponding to the element is determined.
  • a corresponding window size may be predicted for each element.
  • the computer device may determine, according to each request vector in the request vector sequence, a window size of a locally strengthened range corresponding to each element. That is, each request vector corresponds to one window size.
  • the process of determining, according to the request vector sequence, a window size of a locally strengthened range corresponding to each element may be implemented by performing the following steps including: performing, for each element in the input sequence, a linear transformation on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element; performing a nonlinear transformation on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence; and determining the fourth scalar as the window size of the locally strengthened range corresponding to the element.
  • the computer device may determine, according to the request vector sequence obtained in step S204, a window size of a locally strengthened range corresponding to each element. Taking the i-th element x ⁇ in the input sequence as an example, a window size of a locally strengthened range corresponding to the i-th element x ⁇ may be obtained by performing the following steps 1) and 2).
  • the computer device maps a request vector q ⁇ corresponding to the i-th element into a hidden state by a second feedforward neural network,, and performs a linear transformation on the request vector q i by U D T , , to obtain a third scalar z i corresponding to the i-th element in the input sequence.
  • the computer device converts the third scalar z ⁇ into a scalar whose value range is (0,1) by a nonlinear transformation function, and multiplies the scalar by the length I of the input sequence, to obtain a window size D i whose value range is (0, I ) .
  • D i is a window size of a locally strengthened range corresponding to the i-th element, and D ⁇ is proportional to the length I of the input sequence.
  • the computer device determines the calculated Z i as the window size of the locally strengthened range corresponding to the i-th element x ⁇ in the input sequence. For example, if the length I of the input sequence is equal to 10, and the calculated Z ⁇ is equal to 7, the window size of the locally strengthened range corresponding to x ⁇ is seven elements centered on a center point. In the process of generating a network representation corresponding to x ⁇ , attention weights are centralizedly assigned among the seven elements.
  • the computer device may repeat the foregoing steps until window sizes of locally strengthened ranges respectively corresponding to all the elements are each obtained according to a corresponding request vector in the request vector sequence.
  • the locally strengthened range corresponding to the element is determined according to the center point and the window size.
  • step S502 and step S504 since request vectors respectively corresponding to the elements in the input sequence are different from each other, center points and window sizes respectively corresponding to the elements are different from each other.
  • locally strengthened ranges respectively corresponding to the elements are different from each other.
  • the locally strengthened range is selected according to characteristics of each element itself, which is more flexible.
  • association degrees between every two of the elements are calculated based on the locally strengthened ranges respectively corresponding to the elements, to obtain the locally strengthened matrix.
  • the computer device may calculate the association degrees between every two of the elements based on the determined locally strengthened ranges, to obtain the locally strengthened matrix.
  • FIG. 6 is a schematic flowchart showing a process of determining a locally strengthened range according to a request vector sequence according to an embodiment.
  • the request vector sequence is firstly mapped into a hidden state by a feedforward neural network.
  • the hidden state is mapped to a scalar in a real number space by a linear transformation.
  • the scalar is converted into a scalar whose value range is (0, 1) by a nonlinear transformation function sigmoid and is multiplied by the length I of the input sequence, to obtain a center point and a window size, so as to determine a locally strengthened range.
  • a locally strengthened matrix is calculated based on the locally strengthened range.
  • a corresponding locally strengthened range can be flexibly determined for the element, rather than fixing a locally strengthened range for the input sequence, so that the dependence between elements in the input sequence relatively far away from each other can be effectively improved.
  • the process of constructing a locally strengthened matrix according to the request vector sequence may be implemented by performing the following steps including: determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determining, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements; determining the locally strengthened range corresponding to the element according to the center point and the window size; and calculating association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix.
  • the process of determining, according to the request vector sequence, a locally strengthened range corresponding to each element is similar to that in the foregoing, which is not repeated herein, except that global context information is considered when determining the window size.
  • the window sizes of the locally strengthened ranges respectively corresponding to all the elements in the input sequence are determined by a uniform window size. In this case, the information of all the elements in the input sequence is required to be fused when determining the window size.
  • the process of determining a uniform window size of the locally strengthened ranges according to the key vector sequence may be implemented by performing the following steps including: obtaining key vectors in the key vector sequence; calculating an average of the key vectors; performing a linear transformation on the average to obtain a fifth scalar; performing a nonlinear transformation on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence; and determining the sixth scalar as the uniform window size of the locally strengthened ranges.
  • the computer device may determine the uniform window size of the locally strengthened ranges according to the key vector sequence obtained in step S204. That is, the window sizes of the locally strengthened ranges respectively corresponding to the elements are the same.
  • the uniform window size may be obtained by performing the following steps 1) to 3).
  • U D T is the same parameter matrix as that used in calculating the hidden state of the window size previously described
  • W D is a trainable linear transformation matrix
  • the computer device converts the fifth scalar Z into a scalar whose value range is (0,1) by a nonlinear transformation function, and multiplies the scalar by the length I of the input sequence, to obtain a window size D whose value range is (0, I ) .
  • D is a uniform window size of locally strengthened ranges, and D is proportional to the length I of the input sequence.
  • the computer device may calculate the association degrees between every two of the elements based on the determined locally strengthened ranges, to obtain the locally strengthened matrix.
  • FIG. 7 is a schematic flowchart showing a process of determining a locally strengthened range according to a request vector sequence and a key vector sequence according to an embodiment.
  • the request vector sequence is mapped into a hidden state by a feedforward neural network, and an average of the key vector sequence is calculated by average pooling.
  • the hidden state is mapped to a scalar in a real number space by a linear transformation, and the average is mapped to a scalar in the real number space by the linear transformation.
  • the obtained scalars each are converted into a scalar whose value range is (0, 1) by a nonlinear transformation function sigmoid, and the scalar is multiplied by the length I of the input sequence, to obtain a center point and a window size, so as to determine a locally strengthened range.
  • all the context information is considered when determining the uniform window size, so that abundant context information can be captured by the locally strengthened range corresponding to each element determined based on the uniform window size.
  • the process of performing the linear transformation on the source-side vector representation sequence, to respectively obtain the request vector sequence, the key vector sequence and the value vector sequence corresponding to the source-side vector representation sequence may be implemented by performing the following steps including: dividing the source-side vector representation sequence into multiple source-side vector representation subsequences having a low dimension; and performing different linear transformations on each of the source-side vector representation subsequences respectively by multiple different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence.
  • the method further includes: splicing network representation subsequences respectively corresponding to the source-side vector representation subsequences, and performing a linear transformation, to obtain a network representation sequence to be outputted.
  • a stacked multi-head neural network may be used for processing the source-side vector representation sequence corresponding to the input sequence.
  • the source-side vector representation sequence may be divided, to obtain multiple (also called as multi-head) source-side vector representation subsequences having a low dimension.
  • the source-side vector representation sequence includes five elements, and each element is a 512-dimensional column vector.
  • the source-side vector representation sequence is divided into eight parts. That is, eight 5x64 source-side vector representation subsequences are obtained.
  • the eight source-side vector representation subsequences, as input vectors are transformed respectively by different subspaces, to output eight 5x64 network representation subsequences.
  • the eight network representation subsequences are spliced and a linear transformation is performed, to output a 5x512-dimensional network representation sequence.
  • the stacked multi-head neural network includes H subspaces.
  • the source-side vector representation subsequences are transformed respectively in the subspaces.
  • a locally strengthened matrix G h corresponding to the h-th subspace is constructed according to the request vector sequence Q h or the key vector sequence K h .
  • a center point P hi of a locally strengthened range corresponding to an i-th element is determined according to Q h
  • a window size D hi of the locally strengthened range corresponding to the i-th element is determined according to Q h or K h
  • G hi,hj is a value of a j-th element of an i-th column vector in the locally strengthened matrix G h
  • G hi,hj represents an association degree between a j-th element and the center point P hi corresponding to the i-th element in the input sequence expressed in the h-th subspace.
  • the method further includes: after the network representation sequence corresponding to the input sequence is obtained, determining the network representation sequence as a new source-side vector representation sequence, and returning to the step of performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence to perform the method again until a loop stop condition is met, and outputting a final network representation sequence.
  • the neural network may stack multiple layers of calculation. Whether in a one-layer neural network or in a stacked multi-head neural network, the calculation may be repeatedly performed for multiple times. In the calculation of each layer, an output of a previous layer is used as an input of a current layer, and the step of performing linear transformation, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence is repeatedly performed until an output of the current layer, i.e., a network representation sequence of the current layer, is obtained. Considering efficiency and performance, the number of times of repetitions may be 6, and network parameters of the neural network at a layer are different from those at another layer. It may be understood that, a process of repeating for 6 times is actually a process of updating a source-side vector representation sequence of an original input sequence for 6 times by the network parameters at each layer.
  • an output of a first layer is O L 1 .
  • O L 1 is used as an input, and transformation is performed on O L 1 by network parameters of the second layer, to output an output O L 2 of the second layer, and so on, until the number of times of repetitions is reached, and an output obtained by the repetition for 6 times is used as a final output, that is, O L 6 is used as the network representation sequence corresponding to the input sequence.
  • FIG. 8 is a schematic structural diagram of a stacked multi-head self-attention neural network having multiple layers according to an embodiment.
  • inputs thereof are the same, which is an output of the previous layer.
  • the input is divided into multiple sub-inputs, the same transformation is performed on the sub-inputs by respective network parameters of multiple sub-spaces (also called multiple heads), to obtain outputs of the subspaces.
  • the outputs are spliced to obtain an output of a current layer.
  • the output of the current layer is used as an input of a next layer, and the process is repeated for multiple times.
  • An output of a last layer is used as a final output.
  • the input sequence may be a to-be-translated text sequence
  • a network representation sequence that is outputted includes feature vectors corresponding to words in a translated text. Therefore, a translated sentence may be determined according to the outputted network representation sequence. According to the embodiments of the present disclosure, significant improvements in translation quality for longer phrases and longer sentences are implemented.
  • FIG. 9 is a schematic flowchart of a network representation generating method for a neural network according to an embodiment.
  • the method includes the following steps S902 to S914, S9161 to S9167, and S918 to S930.
  • the source-side vector representation sequence is divided into multiple source-side vector representation subsequences having a low dimension.
  • a transformation is performed on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element.
  • a nonlinear transformation is performed on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence.
  • the second scalar is determined as a center point of a locally strengthened range corresponding to the element.
  • a linear transformation is performed on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element.
  • a nonlinear transformation is performed on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence.
  • the fourth scalar is determined as a window size of the locally strengthened range corresponding to the element.
  • a nonlinear transformation is performed on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence.
  • the sixth scalar is determined as a uniform window size of locally strengthened ranges respectively corresponding to the elements.
  • the locally strengthened range corresponding to the element is determined according to the center point and the window size.
  • association degrees between every two of the elements are calculated based on the locally strengthened ranges, to obtain a locally strengthened matrix.
  • the logical similarity degree is corrected according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree.
  • value vectors in the value vector sequence are fused according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • step S930 with the outputted network representation sequence as a new source-side vector representation sequence, the method returns to step S904 until a final network representation sequence is obtained.
  • the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information.
  • the request vector sequence, the key vector sequence and the value vector sequence are obtained.
  • the logical similarity degree is obtained according to the request vector sequence and the key vector sequence.
  • the nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions, correcting original attention weights.
  • a weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distributions, so that a network representation sequence with the strengthened local information can be obtained.
  • the obtained network representation sequence not only the local information can be strengthened,, but also an association between elements in the input sequence far away from each other can be retained.
  • steps in the flowchart in FIG. 9 are sequentially presented as indicated by arrows, but the steps are not necessarily sequentially performed in the order indicated by the arrows. Unless explicitly specified in the present disclosure, the steps are performed without any strict sequence limitation, and may be performed in another order. In addition, at least some of the steps in FIG. 9 may include multiple substeps or multiple stages. The substeps or the stages are not necessarily performed at the same time instant, but may be performed at different time instants. The substeps or stages are not necessarily performed sequentially, and may be performed in turn or alternately with another step or at least some of substeps or stages of the another step.
  • a network representation generating apparatus 1000 for a neural network includes an obtaining module 1002, a linear transformation module 1004, a logical similarity degree calculation module 1006, a locally strengthened matrix construction module 1008, an attention weight distribution determining module 1010, and a fusion module 1012.
  • the obtaining module 1002 is configured to obtain a source-side vector representation sequence corresponding to an input sequence.
  • the linear transformation module 1004 is configured to perform linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence.
  • the logical similarity degree calculation module 1006 is configured to calculate a logical similarity degree between the request vector sequence and the key vector sequence.
  • the locally strengthened matrix construction module 1008 is configured to construct a locally strengthened matrix according to the request vector sequence.
  • the attention weight distribution determining module 1010 is configured to perform a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions corresponding to the elements.
  • the fusion module 1012 is configured to fuse value vectors in the value vector sequence according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • the locally strengthened matrix construction module 1008 is further configured to: determine, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determine, according to the request vector sequence, a window size of the locally strengthened range corresponding to the element; determine the locally strengthened range corresponding to the element according to the center point and the window size; and calculate association degrees between every two of the elements based on locally strengthened ranges respectively corresponding to the elements, to obtain the locally strengthened matrix.
  • the locally strengthened matrix construction module 1008 is further configured to: determine, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determine, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements; determine the locally strengthened range corresponding to the element according to the center point and the window size; and calculate association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix.
  • the locally strengthened matrix construction module 1008 is further configured to: perform, for each element in the input sequence, a transformation on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element; perform a nonlinear transformation on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence; and determine the second scalar as the center point of the locally strengthened range corresponding to the element.
  • the locally strengthened matrix construction module 1008 is further configured to: perform, for each element in the input sequence, a linear transformation on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element; perform a nonlinear transformation on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence; and determine the fourth scalar as the window size of the locally strengthened range corresponding to the element.
  • the locally strengthened matrix construction module 1008 is further configured to: obtain key vectors in the key vector sequence; calculate an average of the key vectors; perform a linear transformation on the average to obtain a fifth scalar; perform a nonlinear transformation on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence; and determine the sixth scalar as the uniform window size of the locally strengthened ranges.
  • the locally strengthened matrix construction module 1008 is further configured to: determine the center point as a mean of a Gaussian distribution, and determine the window size as a variance of the Gaussian distribution; determine the locally strengthened range according to the Gaussian distribution determined based on the mean and the variance; and sequentially arraying the association degrees between every two of the elements according to an order of the elements in the input sequence, to obtain the locally strengthened matrix.
  • G ij ⁇ 2 j ⁇ P i 2 D i 2
  • G ij represents an association degree between a j-th element and a center point P ⁇ corresponding to an i-th element in the input sequence
  • G ij is a value of a j-th element of an i-th column vector in a locally strengthened matrix G
  • P i represents a center point of a locally strengthened range corresponding to the i-th element
  • D i represents a window size of the locally strengthened range corresponding to the i-th element.
  • the attention weight distribution determining module 1010 is further configured to: correct the logical similarity degree according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree; and perform normalization on the locally strengthened logical similarity degree, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements.
  • the linear transformation module 1004 is further configured to: divide the source-side vector representation sequence into multiple source-side vector representation subsequences having a low dimension; and perform different linear transformations on each of the source-side vector representation subsequences respectively by multiple different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence.
  • the apparatus further includes a splicing module, configured to: splice network representation subsequences respectively corresponding to the source-side vector representation subsequences, and perform a linear transformation, to obtain a network representation sequence to be outputted.
  • the apparatus 1000 further includes: a loop module.
  • the loop module is configured to: after the network representation sequence corresponding to the input sequence is obtained, determine the network representation sequence as a new source-side vector representation sequence, and return to the step of performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence to perform the operations again until a loop stop condition is met, and output a final network representation sequence.
  • the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information.
  • the request vector sequence, the key vector sequence and the value vector sequence are obtained.
  • the logical similarity degree is obtained according to the request vector sequence and the key vector sequence.
  • the nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions, correcting original attention weights.
  • a weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distribution, so that a network representation sequence with the strengthened local information can be obtained.
  • the obtained network representation sequence not only the local information can be strengthened, but also an association between elements in the input sequence far away from each other can be retained.
  • FIG. 11 is a diagram showing an internal structure of a computer device 120 according to an embodiment.
  • the computer device includes a processor, memories, and a network interface that are connected to each other via a system bus.
  • the memories include a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium in the computer device stores an operating system, and may further store a computer program.
  • the computer program when executed by the processor, may cause the processor to implement the network representation generating method for a neural network.
  • the internal memory may also store a computer program.
  • the computer program when executed by the processor, may cause the processor to perform the network representation generating method for a neural network.
  • FIG. 11 is merely a block diagram of a partial structure related to the solution in the present disclosure, and does not constitute a limitation to the computer device to which the solution of the present disclosure is applied.
  • the computer device may include more or fewer components than those shown in the figure, or some components may be combined, or a different component deployment may be used.
  • the network representation generating apparatus 1000 for a neural network may be implemented in a form of a computer program.
  • the computer program may run on the computer device shown in FIG. 11 .
  • Program modules forming the network representation generating apparatus 1000 for a neural network for example, the obtaining module 1002, the linear transformation module 1004, the logical similarity degree calculation module 1006, the locally strengthened matrix construction module 1008, the attention weight distribution determining module 1010, and the fusion module 1012 in FIG. 10 , may be stored in the memories of the computer device.
  • the computer program formed by the program modules causes the processor to perform the steps in the network representation generating method for a neural network according to the embodiments of the present disclosure described in this specification.
  • the computer device shown in FIG. 11 may perform step S202 by the obtaining module 1002 in the network representation generating apparatus for a neural network shown in FIG. 10 .
  • the computer device may perform step S204 by the linear transformation module 1004.
  • the computer device may perform step S206 by the logical similarity degree calculation module 1006.
  • the computer device may perform step S208 by the locally strengthened matrix construction module 1008.
  • the computer device may perform step S210 by the attention weight distribution determining module 1010.
  • the computer device may perform step S212 by the fusion module 1012.
  • a computer device includes a memory and a processor.
  • the memory stores a computer program.
  • the computer program when executed by the processor, causes the processor to perform the following steps of: obtaining a source-side vector representation sequence corresponding to an input sequence; performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence; calculating a logical similarity degree between the request vector sequence and the key vector sequence; constructing a locally strengthened matrix according to the request vector sequence; performing a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements; and fusing value vectors in the value vector sequence according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • the computer program when executed by the processor to perform the step of constructing the locally strengthened matrix according to the request vector sequence, causes the processor to perform the following steps of: determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determining, according to the request vector sequence, a window size of the locally strengthened range corresponding to the element; determining the locally strengthened range corresponding to the element according to the center point and the window size; and calculating an association degree between every two of the elements based on locally strengthened ranges respectively corresponding to the elements, to obtain the locally strengthened matrix.
  • the computer program when executed by the processor to perform the step of constructing the locally strengthened matrix according to the request vector sequence, causes the processor to perform the following steps of: determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determining, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements; determining the locally strengthened range corresponding to the element according to the center point and the window size; and calculating association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix.
  • the computer program when executed by the processor to perform the step of determining, according to the request vector sequence, the center point of the locally strengthened range corresponding to the element, causes the processor to perform the following steps of: performing, for each element in the input sequence, a transformation on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element; performing a nonlinear transformation on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence; and determining the second scalar as the center point of the locally strengthened range corresponding to the element.
  • the computer program when executed by the processor to perform the step of determining, according to the request vector sequence, the window size of the locally strengthened range corresponding to the element, causes the processor to perform the following steps of: performing, for each element in the input sequence, a linear transformation on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element; performing a nonlinear transformation on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence; and determining the fourth scalar as the window size of the locally strengthened range corresponding to the element.
  • the computer program when executed by the processor to perform the step of determining, according to the key vector sequence, the uniform window size of the locally strengthened ranges, causes the processor to perform the following steps of: obtaining key vectors in the key vector sequence; calculating an average of the key vectors; performing a linear transformation on the average to obtain a fifth scalar; performing a nonlinear transformation on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence; and determining the sixth scalar as the uniform window size of the locally strengthened ranges.
  • the computer program when executed by the processor to perform the step of determine the locally strengthened range corresponding to the element according to the center point and the window size, causes the processor to perform the following steps of: determining the center point as a mean of a Gaussian distribution, and determining the window size as a variance of the Gaussian distribution; and determining the locally strengthened range according to the Gaussian distribution determined based on the mean and the variance.
  • the computer program when executed by the processor to perform the step of calculating the association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthening matrix, causes the processor to perform the following steps of: sequentially arraying the association degrees between every two of the elements according to an order of the elements in the input sequence, to obtain the locally strengthened matrix.
  • G ij ⁇ 2 j ⁇ P i 2 D i 2
  • G ij represents an association degree between a j-th element and a center point P ⁇ corresponding to an i-th element in the input sequence
  • G ij is a value of a j-th element of an i-th column vector in a locally strengthened matrix G
  • P i represents a center point of a locally strengthened range corresponding to the i-th element
  • D i represents a window size of the locally strengthened range corresponding to the i-th element.
  • the computer program when executed by the processor to perform the step of performing the nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements, causes the processor to perform the following steps of: correcting the logical similarity degree according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree; and performing normalization on the locally strengthened logical similarity degree, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements.
  • the computer program when executed by the processor to perform the step of performing the linear transformation on the source-side vector representation sequence, to respectively obtain the request vector sequence, the key vector sequence and the value vector sequence corresponding to the source-side vector representation sequence, causes the processor to perform the following steps of: dividing the source-side vector representation sequence into multiple source-side vector representation subsequences having a low dimension; and performing different linear transformations on each of the source-side vector representation subsequences respectively by multiple different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence.
  • the computer program when executed by the processor, causes the processor to further perform the following steps of: splicing network representation subsequences respectively corresponding to the source-side vector representation subsequences, and performing a linear transformation, to obtain a network representation sequence to be outputted.
  • the computer program when executed by the processor, causes the processor to further perform the following steps of: after the network representation sequence corresponding to the input sequence is obtained, determining the network representation sequence as a new source-side vector representation sequence, and returning to the step of performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence to perform the method again until a loop stop condition is met, and outputting a final network representation sequence.
  • the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information.
  • the request vector sequence, the key vector sequence and the value vector sequence are obtained.
  • the logical similarity degree is obtained according to the request vector sequence and the key vector sequence.
  • the nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distribution, correcting original attention weights.
  • a weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distributions, so that a network representation sequence with the strengthened local information can be obtained.
  • the obtained network representation sequence not only the local information can be strengthened, but also an association between elements in the input sequence far away from each other can be retained.
  • a computer-readable storage medium stores a computer program.
  • the computer program when executed by a processor, causes the processor to perform the following steps of: obtaining a source-side vector representation sequence corresponding to an input sequence; performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence; calculating a logical similarity degree between the request vector sequence and the key vector sequence; constructing a locally strengthened matrix according to the request vector sequence; performing a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements; and fusing value vectors in the value vector sequence according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • the computer program when executed by the processor to perform the step of constructing the locally strengthened matrix according to the request vector sequence, causes the processor to perform the following steps of: determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determining, according to the request vector sequence, a window size of the locally strengthened range corresponding to the element; determining the locally strengthened range corresponding to the element according to the center point and the window size; and calculating association degrees between every two of the elements based on locally strengthened ranges respectively corresponding to the elements, to obtain the locally strengthened matrix.
  • the computer program when executed by the processor to perform the step of constructing the locally strengthened matrix according to the request vector sequence, causes the processor to perform the following steps of: determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determining, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements; determining the locally strengthened range corresponding to the element according to the center point and the window size; and calculating association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix.
  • the computer program when executed by the processor to perform the step of determining, according to the request vector sequence, the center point of the locally strengthened range corresponding to the element, causes the processor to perform the following steps of: performing, for each element in the input sequence, a transformation on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element; performing a nonlinear transformation on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence; and determining the second scalar as the center point of the locally strengthened range corresponding to the element.
  • the computer program when executed by the processor to perform the step of determining, according to the request vector sequence, the window size of the locally strengthened range corresponding to the element, causes the processor to perform the following steps of: performing, for each element in the input sequence, a linear transformation on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element; performing a nonlinear transformation on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence; and determining the fourth scalar as the window size of the locally strengthened range corresponding to the element.
  • the computer program when executed by the processor to perform the step of determining, according to the key vector sequence, the uniform window size of the locally strengthened ranges, causes the processor to perform the following steps of: obtaining key vectors in the key vector sequence; calculating an average of the key vectors; performing a linear transformation on the average to obtain a fifth scalar; performing a nonlinear transformation on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence; and determining the sixth scalar as the uniform window size of the locally strengthened ranges.
  • the computer program when executed by the processor to perform the step of determine the locally strengthened range corresponding to the element according to the center point and the window size, causes the processor to perform the following steps of: determining the center point as a mean of a Gaussian distribution, and determining the window size as a variance of the Gaussian distribution; and determining the locally strengthened range according to the Gaussian distribution determined based on the mean and the variance.
  • the computer program when executed by the processor to perform the step of calculating the association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthening matrix, causes the processor to perform the following steps: sequentially arraying the association degrees between every two of the elements according to an order of the elements in the input sequence, to obtain the locally strengthened matrix.
  • G ij ⁇ 2 j ⁇ P i 2 D i 2
  • G ij represents an association degree between a j-th element and a center point P i corresponding to an i-th element in the input sequence
  • G ij is a value of a j-th element of an i-th column vector in a locally strengthened matrix G
  • P i represents a center point of a locally strengthened range corresponding to the i-th element
  • D i represents a window size of the locally strengthened range corresponding to the i-th element.
  • the computer program when executed by the processor to perform the step of performing the nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements, causes the processor to perform the following steps of: correcting the logical similarity degree according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree; and performing normalization on the locally strengthened logical similarity degree, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements.
  • the computer program when executed by the processor to perform the step of performing the linear transformation on the source-side vector representation sequence, to respectively obtain the request vector sequence, the key vector sequence and the value vector sequence corresponding to the source-side vector representation sequence, causes the processor to perform the following steps of: dividing the source-side vector representation sequence into multiple source-side vector representation subsequences having a low dimension; and performing different linear transformations on each of the source-side vector representation subsequences respectively by multiple different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence.
  • the computer program when executed by the processor, causes the processor to further perform the following steps of: splicing network representation subsequences respectively corresponding to the source-side vector representation subsequences, and performing linear transformation, to obtain a network representation sequence to be outputted.
  • the computer program when executed by the processor, causes the processor to further perform the following steps of: after the network representation sequence corresponding to the input sequence is obtained, determining the network representation sequence as a new source-side vector representation sequence, and returning to the step of performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence to perform the method again until a loop stop condition is met, and outputting a final network representation sequence.
  • the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information.
  • the logical similarity degree is according to the request vector sequence and the key vector sequence.
  • the nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions, correcting original attention weights.
  • a weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distributions, so that a network representation sequence with the strengthened local information can be obtained.
  • the obtained network representation sequence not only the local information can be strengthened, but also an association between elements in the input sequence far away from each other can be retained.
  • the non-volatile memory may include a read-only memory (ROM), a programmable ROM (PROM), an electrically programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, or the like.
  • ROM read-only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • the volatile memory may include a random access memory (RAM) or an external cache.
  • the RAM may be implemented in various forms, such as a static RAM (SRAM), a dynamic RAM (DRAM), a synchronous DRAM (SDRAM), a double data rate SDRAM (DDRSDRAM), an enhanced SDRAM (ESDRAM), a Synchlink DRAM (SLDRAM), a rambus direct RAM (RDRAM), a direct rambus dynamic RAM (DRDRAM), and a rambus dynamic RAM (RDRAM).
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM Synchlink DRAM
  • RDRAM direct RAM
  • DRAM direct rambus dynamic RAM
  • RDRAM rambus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to a method and apparatus for generating network representation of a neural network, a storage medium, and a device. The method comprises: obtaining a source vector representation sequence corresponding to an input sequence; performing linear conversion on the source vector representation sequence to separately obtain a request vector sequence, a key vector sequence, and a value vector sequence corresponding to the source vector representation sequence; calculating the logic similarity between the request vector sequence and the key vector sequence; constructing a local enhanced matrix according to the request vector sequence; performing non-linear conversion on the basis of the logic similarity and the local enhanced matrix to obtain a locally enhanced attention weight distribution corresponding to elements; and fusing value vectors in the value vector sequence according to the attention weight distribution to obtain a network representation sequence corresponding to the input sequence. According to the network representation sequence generated in the solution provided by the present application, not only local information can be enhanced, but also the connection between long-long elements in the input sequence can be preserved.

Description

  • This application claims priority to Chinese Patent Application No. 201811027795.X , entitled "METHOD AND APPARATUS FOR GENERATING NETWORK REPRESENTATION OF NEURAL NETWORK, STORAGE MEDIUM, AND DEVICE" and filed on September 4, 2018, which is incorporated herein by reference in its entirety.
  • FIELD OF THE TECHNOLOGY
  • The present disclosure relates to the field of computer technologies, and in particular to a network representation generating method and apparatus for a neural network, a storage medium, and a device.
  • BACKGROUND OF THE DISCLOSURE
  • An attention mechanism is a method of establishing a model for a dependence between hidden states of an encoder and a decoder in a neural network. The attention mechanism is widely applied to tasks of natural language processing (NLP) based on deep learning.
  • A self-attenuation network (SAN) is a neural network model based on a self-attention mechanism, which belongs to one type of attention models. By the SAN, an attention weight can be calculated for each element pair in an input sequence, so that a long-distance dependence can be captured, and network representations corresponding to elements are not affected by distances between the elements. However, in the SAN, all the elements in the input sequence are fully considered. In this case, attention weights between each element and all the elements are required to be calculated, which disperses a distribution of the weights to some extent, and further weakens an association between the elements.
  • SUMMARY
  • In view of this, it is desired to provide a network representation generating method and apparatus for a neural network, a storage medium, and a device, to resolve the existing technical problem that an association between elements is weakened due to considering attention weights between each element and all the elements in a self-attention neural network.
  • In an aspect, a network representation generating method for a neural network is provided. The method is applied to a computer device. The method includes:
    • obtaining a source-side vector representation sequence corresponding to an input sequence;
    • performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence;
    • calculating a logical similarity degree between the request vector sequence and the key vector sequence;
    • constructing a locally strengthened matrix according to the request vector sequence;
    • performing a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements; and
    • fusing value vectors in the value vector sequence according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • In another aspect, a network representation generating apparatus for a neural network is provided. The apparatus includes: an obtaining module, a linear transformation module, a logical similarity degree calculation module, a locally strengthened matrix construction module, an attention weight distribution determining module and a fusion module, where
    the obtaining module is configured to obtain a source-side vector representation sequence corresponding to an input sequence;
    the linear transformation module is configured to perform a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence;
    the logical similarity degree calculation module is configured to calculate a logical similarity degree between the request vector sequence and the key vector sequence;
    the locally strengthened matrix construction module is configured to construct a locally strengthened matrix according to the request vector sequence;
    the attention weight distribution determining module is configured to perform a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements; and
    the fusion module is configured to fuse value vectors in the value vector sequence according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • In another aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program. The computer program, when executed by a processor, causes the processor to perform the steps of the network representation generating method for a neural network described above.
  • In another aspect, a computer device is provided. The computer device includes a memory and a processor. The memory stores a computer program. The computer program, when executed by the processor, causes the processor to perform the steps of the network representation generating method for a neural network described above.
  • According to the network representation generating method and apparatus for a neural network, the storage medium, and the device, the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information. By performing the linear transformation on the source-side vector representation sequence corresponding to the input sequence, the request vector sequence, the key vector sequence and the value vector sequence are obtained. The logical similarity degree is obtained according to the request vector sequence and the key vector sequence. The nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions, correcting original attention weights. A weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distributions, so that a network representation sequence with the strengthened local information can be obtained. In the obtained network representation sequence, not only the local information can be strengthened, but also an association between elements in the input sequence far away from each other can be retained.
  • BRIEF DESCRIPTION OF THE DRAWINGS
    • FIG. 1 is a diagram showing an application environment of a network representation generating method for a neural network according to an embodiment;
    • FIG. 2 is a schematic flowchart of a network representation generating method for a neural network according to an embodiment;
    • FIG. 3 is a schematic diagram showing a process of calculating a network representation sequence corresponding to an input sequence according to an embodiment;
    • FIG. 4 is a diagram showing a system architecture in which an SAN attention weight distribution is corrected by a Gaussian distribution according to an embodiment;
    • FIG. 5 is a schematic flowchart showing a process of constructing a locally strengthened matrix according to a request vector sequence according to an embodiment;
    • FIG. 6 is a schematic flowchart showing a process of determining a locally strengthened range according to a request vector sequence according to an embodiment;
    • FIG. 7 is a schematic flowchart showing a process of determining a locally strengthened range according to a request vector sequence and a key vector sequence according to an embodiment;
    • FIG. 8 is a schematic structural diagram of a stacked multi-head self-attention neural network having multiple layers according to an embodiment;
    • FIG. 9 is a schematic flowchart of a network representation generating method for a neural network according to an embodiment;
    • FIG. 10 is a structural block diagram of a network representation generating apparatus for a neural network according to an embodiment; and.
    • FIG. 11 is a structural block diagram of a computer device according to an embodiment.
    DESCRIPTION OF EMBODIMENTS
  • To make the objectives, technical solutions, and advantages of the present disclosure clearer and more understandable, the present disclosure is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that, the embodiments described herein are merely used for explaining the present disclosure, but are not intended to limit the present disclosure.
  • FIG. 1 is a diagram showing an application environment of a network representation generating method for a neural network according to an embodiment. Referring to FIG. 1, the network representation generating method for a neural network is applied to a network representation generating system for a neural network. The network representation generating system for a neural network includes a terminal 110 and a computer device 120. The terminal 110 and the computer device 120 are connected to each other through Bluetooth, a universal serial bus (USB) or a network. The terminal 110 may transmit a to-be-processed input sequence to the computer device 120 in real time or non-real time. The computer device 120 is used to receive the input sequence, and perform transformation on the input sequence to output a corresponding network representation sequence. The terminal 110 may be a desktop terminal or a mobile terminal. The mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, or the like. The computer device 120 may be an independent server or terminal, or may be a server cluster formed by multiple servers, or may be a cloud server providing basic cloud computing services such as a cloud server service, a cloud database service, a cloud storage service, and a CDN service.
  • It should be noted that, the foregoing application environment is merely an example. In some embodiments, the computer device 120 may directly obtain the input sequence without the terminal 110. For example, in a case that the computer device is implemented by a mobile phone, the mobile phone may directly obtain the input sequence (for example, a sequence formed by words in an instant text message), perform transformation on the input sequence by using a network representation generating apparatus for a neural network configured on the mobile phone, and output a network representation sequence corresponding to the input sequence.
  • As shown in FIG. 2, a network representation generating method for a neural network is provided according to an embodiment. In this embodiment, description is made mainly by using an example in which the method is applied to the computer device 120 in FIG. 1. Referring to FIG. 2, the network representation generating method for a neural network may include the following steps S202 to S212.
  • In S202, a source-side vector representation sequence corresponding to an input sequence is obtained.
  • The input sequence is a sequence to be transformed to obtain a corresponding network representation sequence. The input sequence includes a set of ordered elements. Taking an input sequence including I elements as an example, the input sequence may be represented by X = {x 1, x 2, x 3, ... xI }, where a length of the input sequence is indicated by I, and I is a positive integer.
  • In a scenario that the input sequence is required to be translated, the input sequence may be a word sequence corresponding to a to-be-translated text, and elements in the input sequence are respectively words in the word sequence. If the to-be-translated text is a Chinese text, the word sequence may be a sequence formed by performing word segmentation on the to-be-translated text and arraying obtained words according to a word order. If the to-be-translated text is an English text, the word sequence is a sequence formed by arraying words according to a word order. For example, if the to-be-translated text is "Bush held a talk with Sharon", a corresponding input sequence X is {Bush, held, a, talk, with, Sharon}.
  • The source-side vector representation sequence is a sequence formed by source-side vector representations of the elements in the input sequence. The vector representations in the source-side vector representation sequence are in a one-to-one correspondence with the elements in the input sequence. The source-side vector representation sequence may be represented by Z = {z 1, z 2, z 3, ... zI } .
  • The computer device may convert each element in the input sequence into a vector having a fixed length (i.e., perform word embedding). In an embodiment, the network representation generating method for a neural network is applied to a neural network model. In this case, the computer device may convert each element in the input sequence into a corresponding vector through a first layer of the neural network model. For example, the computer device converts an i-th element x¡ in the input sequence into a d-dimensional column vector, i.e., z¡ . The computer device combines vectors corresponding to the elements in the input sequence, to obtain the source-side vector representation sequence corresponding to the input sequence, that is, a vector sequence formed by I d-dimensional column vectors, where d is a positive integer. The computer device may also receive a source-side vector representation sequence corresponding to an input sequence transmitted by another device. zi and column vectors mentioned below may be row vectors. For ease of describing the calculation process, the description herein is given by means of the column vectors.
  • In S204, a linear transformation is performed on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence.
  • The linear transformation may be used to map a vector in a vector space to another vector space. The vector space is a set formed by multiple vectors having the same dimension. In an embodiment, the computer device may perform the linear transformation on the source-side vector representation sequence by three different learnable parameter matrices, so that the source-side vector representation sequence is mapped to three different vector spaces, to respectively obtain the request vector sequence, the key vector sequence and the value vector sequence corresponding to the source-side vector representation sequence.
  • In an embodiment, the network representation generating method for a neural network is applied to a model based on a self-attention neural network (SAN). In this case, each of the request vector sequence, the key vector sequence and the value vector sequence is obtained by performing the linear transformation on the source-side vector representation sequence corresponding to the input sequence at a source side. In another embodiment, the network representation generating method for a neural network is applied to a neural network model including an Encoder-Decoder structure. In this case, the key vector sequence and the value vector sequence are obtained by encoding the source-side vector representation sequence corresponding to the input sequence by an encoder. That is, the key vector sequence and the value vector sequence are outputs of the encoder. The request vector sequence is an input of a decoder, for example, may be a target-side vector representation sequence, where the target-side vector representation sequence may be formed by vector representations corresponding to elements in an output sequence outputted by the decoder.
  • In an embodiment, the computer device may perform the linear transformation on the source-side vector representation sequence Z by three different learnable parameter matrices WQ, WK , and WV , to obtain a request vector sequence Q, a key vector sequence K and a value vector sequence V according to the following formulas: Q = Z W Q
    Figure imgb0001
    K = Z W K
    Figure imgb0002
    V = Z W V
    Figure imgb0003

    where the input sequence X = {x 1, x 2, x 3,..., xI } includes I elements, and each element in the source-side vector representation sequence Z = {z 1, z 2, z 3 ,..., zI } is a d-dimensional column vector, that is, Z is a vector sequence formed by I d-dimensional column vectors, which may be denoted as an I × d matrix. Further, the learnable parameter matrices WQ, WK , and WV each are a d × d matrix. The request vector sequence Q, the key vector sequence K, and the value vector sequence V each are an I × d matrix.
  • In S206, a logical similarity between the request vector sequence and the key vector sequence is calculated.
  • The logical similarity degree is used for measuring a similarity between one element in the input sequence and another element in the input sequence. In the process of generating a network representation corresponding to the element, a corresponding attention weight may be assigned, based on the similarity, to a value vector corresponding to the another element in the input sequence. In this way, the network representation corresponding to the element is obtained in the case of taking the association between the element and the another element into consideration, so that the generated network representation can more accurately present features of the element and contain more abundant information.
  • In an embodiment, the network representation generating method for a neural network is applied to a neural network model including an Encoder-Decoder structure. In this case, the request vector sequence is a target-side vector representation sequence, and the calculated logical similarity degree is used for indicating a similarity between the target-side vector representation sequence and the key vector sequence corresponding to the input sequence. A corresponding attention weight is assigned, based on the similarity, to the value vector sequence corresponding to the input sequence, so that the network representation of each element outputted by the source side is obtained in the case of taking the effect of the target-side vector representation sequence inputted by a target side into consideration.
  • In an embodiment, the computer device may calculate a logical similarity degree matrix E between the request vector sequence Q and the key vector sequence K according to a cosine similarity formula, that is, E = Q K T d
    Figure imgb0004

    where KT represents a transposed matrix of the key vector sequence K, d denotes a dimension of a source-side vector representation Zi into which each element x¡ in the input sequence is converted, d also denotes a dimension of the network representation corresponding to x¡, and also a dimension of a network hidden state vector. In the foregoing formula, by dividing by d ,
    Figure imgb0005
    an inner product is decreased, and a calculation speed is reduced.
  • The process of calculating the logical similarity degree matrix E is described in the following.
  • Q = (q 1 , q 2 , ..., qi,..., qI ) and K = (k 1 , k 2,..., ki,...,kI ) . where qi and k¡ are d-dimensional column vectors and are respectively a request vector and a key vector corresponding to the source-side vector representation zi . In the logical similarity degree matrix E = (e 1 , e2, ..., ei ,..., eI ), each element of ei represents a logical similarity degree between the request vector qi corresponding to the source-side vector representation zi and one of key vectors k 1, k 2 , ..., ki, ..., kI respectively corresponding to all elements in the input sequence, where ei is an element in an i-th column of the E, and ei is an I - dimensional column vector and is calculated according to a formula e i = 1 d q i k 1 T , q i k 2 T , q i k 3 T , , q i k I T .
    Figure imgb0006
    . In essence, ei implies an association between two elements in each of I element pairs formed by the i-th element x¡ and all elements x 1 , x 2 , ..., x¡, ..., xI in the input sequence. The logical similarity degree matrix E is an I × I matrix. The logical similarity degree matrix E is as follows: E = 1 d q 1 k 1 T q 2 k 1 T q 3 k 1 T q I k 1 T q 1 k 2 T q 2 k 2 T q 3 k 2 T q I k 2 T q 1 k 3 T q 2 k 3 T q 3 k 3 T q I k 3 T q 1 k I T q 2 k I T q 3 k I T q I k I T
    Figure imgb0007
  • In S208, a locally strengthened matrix is constructed according to the request vector sequence.
  • Each element of a column vector in the locally strengthened matrix represents an association degree between two elements in the input sequence. In the process of generating the network representation corresponding to each element in the input sequence, the effect of another element in the input sequence having a large association degree with the current element on the network representation may be strengthened by the locally strengthened matrix, and the effect of an element having a small association degree with the current element on the network representation is relatively weakened. By means of the locally strengthened matrix, a considered scope is limited to local elements rather than all the elements in the input sequence when considering the effect of another element on the network representation of the current element. In this way, in the attention weight assignment, the attention weights are biased to be assigned in the local elements. A magnitude of the attention weight assigned to a value vector corresponding to an element among the local elements is related to an association degree between the element and the current element. That is, a large attention weight is assigned to a value vector corresponding to an element having a large association degree with the current element.
  • The description is given below by taking the input sequence "Bush held a talk with Sharon" as an example. In the SAN model, in a process of outputting a network representation corresponding to an element "Bush", value vectors respectively corresponding to all elements "Bush", "held", "a", "talk", "with", and "Sharon" in the input sequence are completely considered, and the value vectors respectively corresponding to all the elements are assigned with corresponding attention weights, which disperses a distribution of the attention weights to some extent, and further weakens an association between the element "Bush" and an adjacent element.
  • In the network representation generating method for a neural network in this embodiment, in a process of outputting the network representation corresponding to the element "Bush", the attention weights may be assigned in a locally strengthened range. In the process of outputting the network representation corresponding to the element "Bush", if a large association exists between the element "Bush" and the element "hold", a relatively high attention weight is assigned to a value vector corresponding to the element "hold". Similar to the "held", "a talk" among the local elements that falls within the locally strengthened range corresponding to the element "Bush" is also noted and is assigned with a relatively high attention weight. In this way, information (value vectors) corresponding to words in the phrase "held a talk" is captured and is associated with the element "Bush", so that the outputted network representation of the element "Bush" can not only indicate local information, but also retain a dependence with a farther element.
  • Therefore, in the process of generating the network representation corresponding to each element, the computer device is required to determine a locally strengthened range corresponding to the current element, so that the assignment of the attention weights corresponding to the current element is limited in the locally strengthened range.
  • In an embodiment, the locally strengthened range may be determined according to two variables including a center point of the locally strengthened range and a window size of the locally strengthened range. The center point refers to a position of an element assigned with the highest attention weight in the process of generating of the network representation of the current element in the input sequence. The window size refers to a length of the locally strengthened range, which determines how many elements are centralizedly assigned with the attention weights. In this case, the locally strengthened range is defined by elements falling in a range with the center point as a center and with the window size as a span. Since a locally strengthened range corresponding to each element is related to the element itself, which corresponds to the element and is not fixed in a specific range, abundant context information may be flexibly captured by means of the generated network representation of the element.
  • In an embodiment, the computer device may determine the locally strengthened range corresponding to each element according to the center point and the window size. The process may be performed by: determining the center point as a mean of a Gaussian distribution and determining the window size as a variance of the Gaussian distribution; and determining the locally strengthened range according to the Gaussian distribution determined based on the mean and the variance. The computer device may calculate an association degree between two elements based on the determined locally strengthened range, to obtain the locally strengthened matrix. The association degree between two elements is calculated according to the following formula: G ij = 2 j P i 2 D i 2
    Figure imgb0008

    where Gij represents an association degree between a j-th element and a center point P¡ corresponding to an i-th element in the input sequence, and Gij is a value of a j-th element of an i-th column vector in a locally strengthened matrix G; P¡ represents a center point of a locally strengthened range corresponding to the i-th element; and D¡ represents a window size of the locally strengthened range corresponding to the i-th element.
  • It can be seen from the formula (2) that, the locally strengthened matrix G is an I × I matrix, including I column vectors, where a dimension of each column vector is I. A value of each element in the i-th column vector of the locally strengthened matrix G is determined based on the locally strengthened range corresponding to the i-th element in the input sequence. The formula (2) is a function that is symmetric about the center point Pi . The numerator in the formula represents a distance between the j-th element and the center point P¡ corresponding to the i-th element in the input sequence. A close distance corresponds to a large Gij, indicating a large association degree between the j-th element and the i-th element. In addition, a far distance corresponds to a small Gij , indicating a small association degree between the j-th element and the i-th element. That is, in the process of generating a network representation corresponding to the i-th element, the attention weights are centralizedly assigned among elements close to the center point P¡
  • It should be noted that, calculating the Gij according to the formula (2) modified based on the Gaussian distribution is merely an example. In some embodiments, after the center point and the window size corresponding to the locally strengthened range are determined, the center point may be used as a mean, the window size may be used as a variance, and a value of Gij is calculated through another distribution having the mean and the variance, such as a Poisson distribution or a binomial distribution, to obtain the locally strengthened matrix G.
  • In S210, a nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements.
  • The logical similarity degree indicates a similarity between two elements in each element pair in the input sequence, and the locally strengthened matrix indicates an association between the two elements in each element pair in the input sequence. The locally strengthened attention weight distribution may be calculated by a combination of the logical similarity degree and the locally strengthened matrix.
  • In an embodiment, the performing a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to the elements may include: correcting the logical similarity degree according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree; and performing normalization on the locally strengthened logical similarity degree, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements.
  • After obtaining the logical similarity degree and the association degree between the two elements in each element pair in the input sequence, the computer device may correct the logical similarity degree through the association degree, to obtain the locally strengthened logical similarity degree. In an embodiment, the logical similarity degree matrix E including logical similarity degrees respectively corresponding to all element pairs may be added to the locally strengthened matrix G including association degrees respectively corresponding to all the element pairs, to correct (which is also referred to as offset) the logical similarity degree matrix, and normalization is performed on logical similarity degree vectors in the corrected logical similarity degree matrix, to obtain the locally strengthened attention weight distribution.
  • The normalization on the logical similarity degree vectors in the corrected logical similarity degree matrix is performed in a unit of a column vector e i .
    Figure imgb0009
    . That is, a value of each element in the column vector e i
    Figure imgb0010
    is in a range of (0,1), and a sum of all elements in the column vector e i
    Figure imgb0011
    is 1. By normalization on the column vector e i ,
    Figure imgb0012
    a maximum value in the column vector can be highlighted, and other components far lower than the maximum value can be suppressed, and thus the locally strengthened attention weight distribution corresponding to the i-th element in the input sequence can be obtained.
  • In an embodiment, the locally strengthened attention weight distribution A may be calculated according to the following formula: A = softmax E + G
    Figure imgb0013
    where the softmax function is a normalization function, and A indicates a matrix including an attention weight distribution corresponding to each element in the input sequence. A = {α1, α2, α3, ..., α I } , A includes / I-dimensional column vectors, and an i-th element αi in A represents an attention weight distribution corresponding to an i-th element x¡ in the input sequence.
  • In S212, value vectors in the value vector sequence are fused according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • The network representation sequence is a sequence formed by multiple network representations (vector representations). In this embodiment, the input sequence may be inputted to the neural network model, and the network representation sequence corresponding to the input sequence may be outputted through linear transformation or nonlinear transformation on a model parameter in a hidden layer of the neural network model.
  • In the process of outputting a network representation corresponding to the current element xi , the computer device obtains an attention weight distribution α i corresponding to the element from the locally strengthened attention weight distribution matrix, and calculates a weighted sum of the value vectors in the value vector sequence with each element in the attention weight distribution αi corresponding to the element as a weight coefficient, to obtain a network representation oi corresponding to the current element x¡ . In this case, a network representation sequence O corresponding to the input sequence is formed by multiple network representations, for example, O = {o 1, o 2, o 3, ..., o I}
  • An i-th element Oi in the network representation sequence O corresponding to the input sequence may be calculated according to the following formula: o i = j = 1 I α ji ν j
    Figure imgb0014
  • Since αij is a constant and vj is a d-dimensional column vector, Oi is also a d-dimensional column vector. That is, in a case that the attention weight distribution corresponding to the i-th element x¡ in the input sequence is expressed as αi = {α i1, α i2, α i3, ..., αiI }, and a K value vector sequence corresponding to the input sequence is expressed as V = {v 1, v 2, v 3, ..., vI }, the network representation Oi corresponding to xi may be calculated according to the following formula: o i = α i 1 ν 1 + α i 2 ν 2 + α i 3 ν 3 + + α iI ν I
    Figure imgb0015
  • Since the attention weight distribution corresponding to the current element is a locally strengthened attention weight distribution obtained by correcting an original logical similarity, not the value vectors corresponding to all the elements in the input sequence are completely considered in the weighted sum process, but value vectors corresponding to elements falling in the locally strengthened range are emphatically considered. In this way, the outputted network representation of the current element contains local information associated with the current element.
  • It should be noted that, the term "element" used in the present disclosure may be used for describing a basic component unit of a vector (including a column vector or a matrix vector) in this specification. For example, "elements in an input sequence" refer to inputs in the input sequence, and "elements in a matrix" refer to column vectors that constitute the matrix, and "elements in a column vector" refer to values in the column vector. That is, the "element" refers to a basic component unit that constitutes a sequence, a vector, or a matrix.
  • FIG. 3 is a schematic diagram showing a process of calculating a network representation sequence corresponding to an input sequence according to an embodiment. Referring to FIG. 3, after a vectorized representation Z corresponding to an input sequence X is obtained, Z is linearly transformed into a request vector sequence Q, a key vector sequence K and a value vector sequence V through three different learnable parameter matrices. Next, a logical similarity degree between each pair of key values is calculated through a dot product operation, to obtain a logical similarity degree matrix E. Next, a locally strengthened matrix G is constructed according to Q or K, and E is corrected by G, to obtain a locally strengthened logical similarity degree matrix E'. Next, normalization is performed on E' by using the softmax function, to obtain a locally strengthened attention weight distribution matrix A. Finally, a dot product operation is performed on A and the value vector sequence V, to output a network representation sequence O.
  • FIG. 4 is a diagram showing a system architecture in which an SAN attention weight distribution is corrected by a Gaussian distribution according to an embodiment. The description is given below by taking the input sequence being "Bush held a talk with Sharon" and the current element being "Bush" as an example. On the left side of FIG. 4, a basic model is constructed by an original SAN, to obtain a logical similarity degree between each pair of elements (formed by two elements in the input sequence), and an attention weight distribution corresponding to "Bush" is calculated based on the logical similarity degree, which considers all words. The word "held" is assigned with the highest attention weight (where a column height represents a magnitude of an attention weight), and remaining words are assigned with lower attention weights. Referring to the middle of FIG. 4, a position of a center point of a locally strengthened range corresponding to the current element "Bush" calculated by using the Gaussian distribution is approximately equal to 4, which corresponds to the word "talk" in the input sequence, and a window size of the locally strengthened range is approximately equal to 3. That is, the locally strengthened range corresponding to the current element "Bush" includes positions corresponding to three words centered on the word "talk". A locally strengthened matrix is calculated based on the determined locally strengthened range, and the logical similarity degree obtained from the left side of FIG. 4 is corrected by using the locally strengthened matrix, so that the corrected attention weights are centralizedly assigned among the three words, and the word "talk" is assigned with the highest attention weight. With reference to the left side of FIG. 4 and the middle of FIG. 4, a corrected attention weight distribution corresponding to the current element "Bush" on the right side of FIG. 4 is obtained. That is, the phrase "held a talk" is assigned with most attention weights. In a process that the network representation corresponding to the word "Bush" is calculated, value vectors corresponding to the three words "held a talk" are considered emphatically. In this way, information of the phrase "held a talk" is captured and is associated with the word "Bush".
  • According to the network representation generating method for a neural network, the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information. By performing the linear transformation on the source-side vector representation sequence corresponding to the input sequence, the request vector sequence, the key vector sequence and the value vector sequence are obtained. The logical similarity degree is obtained according to the request vector sequence and the key vector sequence. The nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions, correcting original attention weights. A weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distributions, so that a network representation sequence with the strengthened local information can be obtained. In the obtained network representation sequence not only the local information can be strengthened, but also an association between elements in the input sequence far away from each other can be retained.
  • As shown in FIG. 5, in an embodiment, the process of constructing the locally strengthened matrix according to the request vector sequence may be implemented by performing the following steps S502 to S508.
  • In S502, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements is determined.
  • The locally strengthened range corresponding to each element in the input sequence is determined by the center point and the window size corresponding to the element. The center point corresponding to the element depends on the request vector corresponding to the element. Therefore, the center point of the locally strengthened range corresponding to the element may be determined according to the request vector.
  • In an embodiment, the process of determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each element may be implemented by performing the following steps including: performing, for each element in the input sequence, a transformation on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element; performing a nonlinear transformation on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence; and determining the second scalar as the center point of the locally strengthened range corresponding to the element.
  • The computer device may determine, according to the request vector sequence obtained in step S204, a center point of a locally strengthened range corresponding to each element. Taking the i-th element x¡ in the input sequence as an example, a center point of a locally strengthened range corresponding to the i-th element x¡ may be obtained by performing the following steps 1) and 2).
  • In 1), the computer device maps a request vector q¡ corresponding to the i-th element into a hidden state by a first feedforward neural network, and performs a linear transformation on the request vector qi by U P T ,
    Figure imgb0016
    to obtain a first scalar pi corresponding to the i-th element in the input sequence. The first scalar pi is a value in a real number space, and is calculated according to the following formula: p i = U P T tanh W P q i
    Figure imgb0017

    where tanh (WPqi ) is a part of the first feedforward neural network, tanh is an activation function, q¡ is a request vector corresponding to the i-th element in the input sequence, U P T
    Figure imgb0018
    and WP each are a trainable linear transformation matrix, U P T
    Figure imgb0019
    is a transposed matrix of UP , UP is a d-dimensional column vector, and U P T
    Figure imgb0020
    is a d-dimensional row vector. In this way, a high-dimensional vector outputted by the feedforward neural network is mapped to a scalar. Herein and in the following, the vector is mapped into the hidden state by the feedforward neural network. The method for mapping the vector through the feedforward neural network is not limited thereto, and the feedforward neural network may be replaced with other neural network models, such as a long short-term memory (LSTM) model and variations thereof, a gated unit and variations thereof, or by performing simple linear transformation.
  • In 2), the computer device converts the first scalar p¡ into a scalar whose value range is (0,1) by a nonlinear transformation function, and multiplies the scalar by a length I of the input sequence, to obtain a center point position Pi whose value range is (0,1) . Pi is a center point of a locally strengthened range corresponding to the i-th element, and P¡ is proportional to the length I of the input sequence. P¡ may be calculated according to the following formula: P i = I sigmoid p i
    Figure imgb0021

    where sigmoid is a nonlinear transformation function and is used to convert p¡ into a scalar whose value range is (0,1). The method of converting the scalar using sigmoid herein and in the following may be replaced with another method for mapping any real number into a range (0,1), which is not limited in the present disclosure.
  • The computer device determines the calculated P¡ as the center point of the locally strengthened range corresponding to the i-th element x¡ in the input sequence. For example, if the length I of the input sequence is equal to 10, and the calculated P¡ is equal to 5, the center point of the locally strengthened range corresponding to x¡ is a fifth element in the input sequence. In the process of generating a network representation corresponding to xi , a value vector of the fifth element in the input sequence is assigned with the highest attention weight.
  • The computer device may repeat the foregoing steps until center points of locally strengthened ranges respectively corresponding to all the elements are each obtained according to a corresponding request vector in the request vector sequence.
  • In S504, according to the request vector sequence, a window size of the locally strengthened range corresponding to the element is determined.
  • In order to flexibly predict the window size, a corresponding window size may be predicted for each element. In this case, the computer device may determine, according to each request vector in the request vector sequence, a window size of a locally strengthened range corresponding to each element. That is, each request vector corresponds to one window size.
  • In an embodiment, the process of determining, according to the request vector sequence, a window size of a locally strengthened range corresponding to each element may be implemented by performing the following steps including: performing, for each element in the input sequence, a linear transformation on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element; performing a nonlinear transformation on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence; and determining the fourth scalar as the window size of the locally strengthened range corresponding to the element.
  • The computer device may determine, according to the request vector sequence obtained in step S204, a window size of a locally strengthened range corresponding to each element. Taking the i-th element x¡ in the input sequence as an example, a window size of a locally strengthened range corresponding to the i-th element x¡ may be obtained by performing the following steps 1) and 2).
  • In 1), the computer device maps a request vector q¡ corresponding to the i-th element into a hidden state by a second feedforward neural network,, and performs a linear transformation on the request vector qi by U D T ,
    Figure imgb0022
    , to obtain a third scalar zi corresponding to the i-th element in the input sequence. The third scalar zi is a value in a real number space, and is calculated according to the following formula: z i = U D T tanh W P q i
    Figure imgb0023

    where tanh (WPqi ) is a part of the second feedforward neural network, tanh is an activation function, qi is a request vector corresponding to the i-th element in the input sequence, WP is the same parameter matrix as that used in calculating the hidden state of the center point previously described, U D T
    Figure imgb0024
    is a trainable linear transformation matrix, U D T
    Figure imgb0025
    is a transposed matrix of UD, UD is a d-dimensional column vector, and U D T
    Figure imgb0026
    is a d-dimensional row vector. In this way, a high-dimensional vector outputted by the feedforward neural network is mapped to a scalar.
  • In 2), the computer device converts the third scalar z¡ into a scalar whose value range is (0,1) by a nonlinear transformation function, and multiplies the scalar by the length I of the input sequence, to obtain a window size Di whose value range is (0,I) . Di is a window size of a locally strengthened range corresponding to the i-th element, and D¡ is proportional to the length I of the input sequence. D¡ may be calculated according to the following formula: D i = I sigmoid z i
    Figure imgb0027

    where sigmoid is a nonlinear transformation function and is used to convert zi into a scalar whose value range is (0,1) .
  • The computer device determines the calculated Zi as the window size of the locally strengthened range corresponding to the i-th element x¡ in the input sequence. For example, if the length I of the input sequence is equal to 10, and the calculated Z¡ is equal to 7, the window size of the locally strengthened range corresponding to x¡ is seven elements centered on a center point. In the process of generating a network representation corresponding to x¡ , attention weights are centralizedly assigned among the seven elements.
  • The computer device may repeat the foregoing steps until window sizes of locally strengthened ranges respectively corresponding to all the elements are each obtained according to a corresponding request vector in the request vector sequence.
  • In S506, the locally strengthened range corresponding to the element is determined according to the center point and the window size.
  • It can be seen from step S502 and step S504 that, since request vectors respectively corresponding to the elements in the input sequence are different from each other, center points and window sizes respectively corresponding to the elements are different from each other. In this case, locally strengthened ranges respectively corresponding to the elements are different from each other. The locally strengthened range is selected according to characteristics of each element itself, which is more flexible.
  • In S508, based on the locally strengthened ranges respectively corresponding to the elements, association degrees between every two of the elements are calculated based on the locally strengthened ranges respectively corresponding to the elements, to obtain the locally strengthened matrix.
  • The computer device may calculate the association degrees between every two of the elements based on the determined locally strengthened ranges, to obtain the locally strengthened matrix. An association degree between two of the elements is obtained according to the following formula: G ij = 2 j P i 2 D i 2
    Figure imgb0028
    where Gij is a value of the j-th element of the i-th column vector in the locally strengthened matrix G
  • FIG. 6 is a schematic flowchart showing a process of determining a locally strengthened range according to a request vector sequence according to an embodiment. Referring to FIG. 6, the request vector sequence is firstly mapped into a hidden state by a feedforward neural network. The hidden state is mapped to a scalar in a real number space by a linear transformation. The scalar is converted into a scalar whose value range is (0, 1) by a nonlinear transformation function sigmoid and is multiplied by the length I of the input sequence, to obtain a center point and a window size, so as to determine a locally strengthened range. A locally strengthened matrix is calculated based on the locally strengthened range.
  • In the foregoing embodiment, by performing the transformation on the request vector corresponding to each element in the input sequence, a corresponding locally strengthened range can be flexibly determined for the element, rather than fixing a locally strengthened range for the input sequence, so that the dependence between elements in the input sequence relatively far away from each other can be effectively improved.
  • In an embodiment, the process of constructing a locally strengthened matrix according to the request vector sequence may be implemented by performing the following steps including: determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determining, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements; determining the locally strengthened range corresponding to the element according to the center point and the window size; and calculating association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix.
  • In this embodiment, the process of determining, according to the request vector sequence, a locally strengthened range corresponding to each element is similar to that in the foregoing, which is not repeated herein, except that global context information is considered when determining the window size. The window sizes of the locally strengthened ranges respectively corresponding to all the elements in the input sequence are determined by a uniform window size. In this case, the information of all the elements in the input sequence is required to be fused when determining the window size.
  • In an embodiment, the process of determining a uniform window size of the locally strengthened ranges according to the key vector sequence may be implemented by performing the following steps including: obtaining key vectors in the key vector sequence; calculating an average of the key vectors; performing a linear transformation on the average to obtain a fifth scalar; performing a nonlinear transformation on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence; and determining the sixth scalar as the uniform window size of the locally strengthened ranges.
  • The computer device may determine the uniform window size of the locally strengthened ranges according to the key vector sequence obtained in step S204. That is, the window sizes of the locally strengthened ranges respectively corresponding to the elements are the same. The uniform window size may be obtained by performing the following steps 1) to 3).
  • In 1), the computer device obtains a key vector sequence K corresponding to the input sequence, and calculates an average K of all key vectors in the key vector sequence K, i.e., K = i = 1 I k i I
    Figure imgb0029
  • In 2), the computer device performs a linear transformation on the obtained average K , to generate a fifth scalar Z in a real number space, i.e., z = U D T tanh W D K
    Figure imgb0030

    where U D T
    Figure imgb0031
    is the same parameter matrix as that used in calculating the hidden state of the window size previously described, and WD is a trainable linear transformation matrix.
  • In 3), the computer device converts the fifth scalar Z into a scalar whose value range is (0,1) by a nonlinear transformation function, and multiplies the scalar by the length I of the input sequence, to obtain a window size D whose value range is (0, I). D is a uniform window size of locally strengthened ranges, and D is proportional to the length I of the input sequence. D may be calculated according to the following formula: D = I sigmoid z
    Figure imgb0032

    where Sigmoid is a nonlinear transformation function, and is used to convert Z into a scalar whose value range is (0,1) .
  • Although the window sizes of the locally strengthened ranges respectively corresponding to the elements are the same, the locally strengthened ranges respectively corresponding to the elements are different from each other since the center point corresponding to each element is calculated according to the corresponding request vector. The computer device may calculate the association degrees between every two of the elements based on the determined locally strengthened ranges, to obtain the locally strengthened matrix. An association degree between two of the elements is obtained according to the following formula: G ij = 2 j P i 2 D i 2
    Figure imgb0033
    where Gij is a value of the j-th element of the i-th column vector in the locally strengthened matrix G
  • FIG. 7 is a schematic flowchart showing a process of determining a locally strengthened range according to a request vector sequence and a key vector sequence according to an embodiment. Referring to FIG. 7, the request vector sequence is mapped into a hidden state by a feedforward neural network, and an average of the key vector sequence is calculated by average pooling. The hidden state is mapped to a scalar in a real number space by a linear transformation, and the average is mapped to a scalar in the real number space by the linear transformation. The obtained scalars each are converted into a scalar whose value range is (0, 1) by a nonlinear transformation function sigmoid, and the scalar is multiplied by the length I of the input sequence, to obtain a center point and a window size, so as to determine a locally strengthened range.
  • In the foregoing embodiment, by performing the transformation on the key vector sequence corresponding to the input sequence, where the key vector sequence includes feature vectors (key vectors) corresponding to all the elements in the input sequence, all the context information is considered when determining the uniform window size, so that abundant context information can be captured by the locally strengthened range corresponding to each element determined based on the uniform window size.
  • In an embodiment, the process of performing the linear transformation on the source-side vector representation sequence, to respectively obtain the request vector sequence, the key vector sequence and the value vector sequence corresponding to the source-side vector representation sequence may be implemented by performing the following steps including: dividing the source-side vector representation sequence into multiple source-side vector representation subsequences having a low dimension; and performing different linear transformations on each of the source-side vector representation subsequences respectively by multiple different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence. The method further includes: splicing network representation subsequences respectively corresponding to the source-side vector representation subsequences, and performing a linear transformation, to obtain a network representation sequence to be outputted.
  • A stacked multi-head neural network may be used for processing the source-side vector representation sequence corresponding to the input sequence. In this case, the source-side vector representation sequence may be divided, to obtain multiple (also called as multi-head) source-side vector representation subsequences having a low dimension. For example, the source-side vector representation sequence includes five elements, and each element is a 512-dimensional column vector. The source-side vector representation sequence is divided into eight parts. That is, eight 5x64 source-side vector representation subsequences are obtained. The eight source-side vector representation subsequences, as input vectors are transformed respectively by different subspaces, to output eight 5x64 network representation subsequences. The eight network representation subsequences are spliced and a linear transformation is performed, to output a 5x512-dimensional network representation sequence.
  • For example, the stacked multi-head neural network includes H subspaces. First, an input sequence X = {x1,x2,x3,...,x I }is converted into a source-side vector representation sequence Z = {z 1, z 2, z 3, ..., zI } . H source-side vector representation subsequences are obtained by dividing Z = {z 1,z 2, z 3, ..., zI }. Then, the source-side vector representation subsequences are transformed respectively in the subspaces. Taking the transformation in an h-th (where h= 1, 2, ..., H) subspace as an example, in the h-th subspace, linear transformations are performed on Zh = z h1,zh2 ,z h3 , ..., zhI } respectively by corresponding learnable parameter matrices W h Q ,
    Figure imgb0034
    W h K ,
    Figure imgb0035
    and W h V ,
    Figure imgb0036
    to obtain a request vector sequence Qh, a key vector sequence Kh, and a value vector sequence Vh . In the H subspaces, three learnable parameter matrices used in a subspace each are different from that used in another subspace, so that different feature vectors are obtained in the subspaces, and different local information can be concerned in the different subspaces.
  • Next, in the h-th subspace, a logical similarity degree Eh between the request vector sequence and the key vector sequence is calculated according to E h = Q h K h T d .
    Figure imgb0037
    Then, a locally strengthened matrix Gh corresponding to the h-th subspace is constructed according to the request vector sequence Qh or the key vector sequence Kh. In the locally strengthened matrix Gh, each element Gh¡,hj is calculated according to G hi , j = 2 j P hi 2 D hi 2 .
    Figure imgb0038
    In the formula, a center point Phi of a locally strengthened range corresponding to an i-th element is determined according to Qh, and a window size Dhi of the locally strengthened range corresponding to the i-th element is determined according to Qh or Kh Ghi,hj is a value of a j-th element of an i-th column vector in the locally strengthened matrix Gh, and Ghi,hj represents an association degree between a j-th element and the center point Phi corresponding to the i-th element in the input sequence expressed in the h-th subspace.
  • Next, in the h-th subspace, softmax nonlinear transformation is performed to convert the logical similarity degree into an attention weight distribution. The logical similarity degree is corrected by the locally strengthened matrix Gh , to obtain an attention weight distribution A h = soft max (Eh + Gh ) . Further, in the h-th subspace, an output representation sequence Oh corresponding to the input sequence is calculated according to O = Concat (O 1,O 2, O 3, ..., OH ) WO . Finally, output representation sequences Oh in the subspaces are spliced, and a linear transformation is reperformed, to obtain a final output vector O= Concat(O 1, O 2 , O 3 ,...,Oh,...,OH )WO.
  • In an embodiment, the method further includes: after the network representation sequence corresponding to the input sequence is obtained, determining the network representation sequence as a new source-side vector representation sequence, and returning to the step of performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence to perform the method again until a loop stop condition is met, and outputting a final network representation sequence.
  • The neural network may stack multiple layers of calculation. Whether in a one-layer neural network or in a stacked multi-head neural network, the calculation may be repeatedly performed for multiple times. In the calculation of each layer, an output of a previous layer is used as an input of a current layer, and the step of performing linear transformation, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence is repeatedly performed until an output of the current layer, i.e., a network representation sequence of the current layer, is obtained. Considering efficiency and performance, the number of times of repetitions may be 6, and network parameters of the neural network at a layer are different from those at another layer. It may be understood that, a process of repeating for 6 times is actually a process of updating a source-side vector representation sequence of an original input sequence for 6 times by the network parameters at each layer.
  • For example, in the stacked multi-head neural network, an output of a first layer is O L1 .In the calculation of a second layer, O L1 is used as an input, and transformation is performed on O L1 by network parameters of the second layer, to output an output O L2 of the second layer, and so on, until the number of times of repetitions is reached, and an output obtained by the repetition for 6 times is used as a final output, that is, O L6 is used as the network representation sequence corresponding to the input sequence.
  • FIG. 8 is a schematic structural diagram of a stacked multi-head self-attention neural network having multiple layers according to an embodiment. Referring to FIG. 8, for the layers, inputs thereof are the same, which is an output of the previous layer. The input is divided into multiple sub-inputs, the same transformation is performed on the sub-inputs by respective network parameters of multiple sub-spaces (also called multiple heads), to obtain outputs of the subspaces. The outputs are spliced to obtain an output of a current layer. The output of the current layer is used as an input of a next layer, and the process is repeated for multiple times. An output of a last layer is used as a final output.
  • In an embodiment, the input sequence may be a to-be-translated text sequence, and a network representation sequence that is outputted includes feature vectors corresponding to words in a translated text. Therefore, a translated sentence may be determined according to the outputted network representation sequence. According to the embodiments of the present disclosure, significant improvements in translation quality for longer phrases and longer sentences are implemented.
  • Reference is made to FIG. 9, which is a schematic flowchart of a network representation generating method for a neural network according to an embodiment. The method includes the following steps S902 to S914, S9161 to S9167, and S918 to S930.
  • In S902, a source-side vector representation sequence corresponding to an input sequence is obtained.
  • In S904, the source-side vector representation sequence is divided into multiple source-side vector representation subsequences having a low dimension.
  • In S906, different linear transformations are performed on each of the source-side vector representation subsequences by multiple different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence.
  • In S908, a logical similarity degree between the request vector sequence and the key vector sequence is calculated.
  • In S910, for each element in the input sequence, a transformation is performed on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element.
  • In S912, a nonlinear transformation is performed on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence.
  • In S914, the second scalar is determined as a center point of a locally strengthened range corresponding to the element.
  • In S9162, for each element in the input sequence, a linear transformation is performed on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element.
  • In S9164, a nonlinear transformation is performed on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence.
  • In S9166, the fourth scalar is determined as a window size of the locally strengthened range corresponding to the element.
  • In S9161, key vectors in the key vector sequence are obtained, and an average of the key vectors is calculated.
  • In S9163, a linear transformation is performed on the average to obtain a fifth scalar.
  • In S9165, a nonlinear transformation is performed on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence.
  • In S9167, the sixth scalar is determined as a uniform window size of locally strengthened ranges respectively corresponding to the elements.
  • In S918, the locally strengthened range corresponding to the element is determined according to the center point and the window size.
  • In S920, association degrees between every two of the elements are calculated based on the locally strengthened ranges, to obtain a locally strengthened matrix.
  • In S922, the logical similarity degree is corrected according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree.
  • In S924, normalization is performed on the locally strengthened logical similarity degree, to obtain locally strengthened attention weight distributions respectively corresponding to the elements.
  • In S926, value vectors in the value vector sequence are fused according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • In S928, multiple network representation subsequences respectively corresponding to source-side vector representation subsequences are spliced, and a linear transformation is performed, to obtain a network representation sequence to be outputted.
  • In S930, with the outputted network representation sequence as a new source-side vector representation sequence, the method returns to step S904 until a final network representation sequence is obtained.
  • According to the network representation generating method for a neural network, the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information. By performing the linear transformation on the source-side vector representation sequence corresponding to the input sequence, the request vector sequence, the key vector sequence and the value vector sequence are obtained. The logical similarity degree is obtained according to the request vector sequence and the key vector sequence. The nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions, correcting original attention weights. A weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distributions, so that a network representation sequence with the strengthened local information can be obtained. In the obtained network representation sequence, not only the local information can be strengthened,, but also an association between elements in the input sequence far away from each other can be retained.
  • It should be noted that, steps in the flowchart in FIG. 9 are sequentially presented as indicated by arrows, but the steps are not necessarily sequentially performed in the order indicated by the arrows. Unless explicitly specified in the present disclosure, the steps are performed without any strict sequence limitation, and may be performed in another order. In addition, at least some of the steps in FIG. 9 may include multiple substeps or multiple stages. The substeps or the stages are not necessarily performed at the same time instant, but may be performed at different time instants. The substeps or stages are not necessarily performed sequentially, and may be performed in turn or alternately with another step or at least some of substeps or stages of the another step.
  • As shown in FIG. 10, a network representation generating apparatus 1000 for a neural network is provided according to an embodiment. The apparatus includes an obtaining module 1002, a linear transformation module 1004, a logical similarity degree calculation module 1006, a locally strengthened matrix construction module 1008, an attention weight distribution determining module 1010, and a fusion module 1012.
  • The obtaining module 1002 is configured to obtain a source-side vector representation sequence corresponding to an input sequence.
  • The linear transformation module 1004 is configured to perform linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence.
  • The logical similarity degree calculation module 1006 is configured to calculate a logical similarity degree between the request vector sequence and the key vector sequence.
  • The locally strengthened matrix construction module 1008 is configured to construct a locally strengthened matrix according to the request vector sequence.
  • The attention weight distribution determining module 1010 is configured to perform a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions corresponding to the elements.
  • The fusion module 1012 is configured to fuse value vectors in the value vector sequence according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • In an embodiment, the locally strengthened matrix construction module 1008 is further configured to: determine, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determine, according to the request vector sequence, a window size of the locally strengthened range corresponding to the element; determine the locally strengthened range corresponding to the element according to the center point and the window size; and calculate association degrees between every two of the elements based on locally strengthened ranges respectively corresponding to the elements, to obtain the locally strengthened matrix.
  • In an embodiment, the locally strengthened matrix construction module 1008 is further configured to: determine, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determine, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements; determine the locally strengthened range corresponding to the element according to the center point and the window size; and calculate association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix.
  • In an embodiment, the locally strengthened matrix construction module 1008 is further configured to: perform, for each element in the input sequence, a transformation on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element; perform a nonlinear transformation on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence; and determine the second scalar as the center point of the locally strengthened range corresponding to the element.
  • In an embodiment, the locally strengthened matrix construction module 1008 is further configured to: perform, for each element in the input sequence, a linear transformation on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element; perform a nonlinear transformation on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence; and determine the fourth scalar as the window size of the locally strengthened range corresponding to the element.
  • In an embodiment, the locally strengthened matrix construction module 1008 is further configured to: obtain key vectors in the key vector sequence; calculate an average of the key vectors; perform a linear transformation on the average to obtain a fifth scalar; perform a nonlinear transformation on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence; and determine the sixth scalar as the uniform window size of the locally strengthened ranges.
  • In an embodiment, the locally strengthened matrix construction module 1008 is further configured to: determine the center point as a mean of a Gaussian distribution, and determine the window size as a variance of the Gaussian distribution; determine the locally strengthened range according to the Gaussian distribution determined based on the mean and the variance; and sequentially arraying the association degrees between every two of the elements according to an order of the elements in the input sequence, to obtain the locally strengthened matrix. An association degree between two of the elements is calculated according to the following formula: G ij = 2 j P i 2 D i 2
    Figure imgb0039

    where Gij represents an association degree between a j-th element and a center point P¡ corresponding to an i-th element in the input sequence, and Gij is a value of a j-th element of an i-th column vector in a locally strengthened matrix G; Pi represents a center point of a locally strengthened range corresponding to the i-th element; and Di represents a window size of the locally strengthened range corresponding to the i-th element.
  • In an embodiment, the attention weight distribution determining module 1010 is further configured to: correct the logical similarity degree according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree; and perform normalization on the locally strengthened logical similarity degree, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements.
  • In an embodiment, the linear transformation module 1004 is further configured to: divide the source-side vector representation sequence into multiple source-side vector representation subsequences having a low dimension; and perform different linear transformations on each of the source-side vector representation subsequences respectively by multiple different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence. The apparatus further includes a splicing module, configured to: splice network representation subsequences respectively corresponding to the source-side vector representation subsequences, and perform a linear transformation, to obtain a network representation sequence to be outputted.
  • In an embodiment, the apparatus 1000 further includes: a loop module. The loop module is configured to: after the network representation sequence corresponding to the input sequence is obtained, determine the network representation sequence as a new source-side vector representation sequence, and return to the step of performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence to perform the operations again until a loop stop condition is met, and output a final network representation sequence.
  • According to the network representation generating apparatus 1000 for a neural network, the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information. By performing the linear transformation on the source-side vector representation sequence corresponding to the input sequence, the request vector sequence, the key vector sequence and the value vector sequence are obtained. The logical similarity degree is obtained according to the request vector sequence and the key vector sequence. The nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions, correcting original attention weights. A weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distribution, so that a network representation sequence with the strengthened local information can be obtained. In the obtained network representation sequence, not only the local information can be strengthened, but also an association between elements in the input sequence far away from each other can be retained.
  • FIG. 11 is a diagram showing an internal structure of a computer device 120 according to an embodiment. As shown in FIG. 11, the computer device includes a processor, memories, and a network interface that are connected to each other via a system bus. The memories include a non-volatile storage medium and an internal memory. The non-volatile storage medium in the computer device stores an operating system, and may further store a computer program. The computer program, when executed by the processor, may cause the processor to implement the network representation generating method for a neural network. The internal memory may also store a computer program. The computer program, when executed by the processor, may cause the processor to perform the network representation generating method for a neural network.
  • A person skilled in the art may understand that, the structure shown in FIG. 11 is merely a block diagram of a partial structure related to the solution in the present disclosure, and does not constitute a limitation to the computer device to which the solution of the present disclosure is applied. Actually, the computer device may include more or fewer components than those shown in the figure, or some components may be combined, or a different component deployment may be used.
  • In an embodiment, the network representation generating apparatus 1000 for a neural network provided in the present disclosure may be implemented in a form of a computer program. The computer program may run on the computer device shown in FIG. 11. Program modules forming the network representation generating apparatus 1000 for a neural network, for example, the obtaining module 1002, the linear transformation module 1004, the logical similarity degree calculation module 1006, the locally strengthened matrix construction module 1008, the attention weight distribution determining module 1010, and the fusion module 1012 in FIG. 10, may be stored in the memories of the computer device. The computer program formed by the program modules causes the processor to perform the steps in the network representation generating method for a neural network according to the embodiments of the present disclosure described in this specification.
  • For example, the computer device shown in FIG. 11 may perform step S202 by the obtaining module 1002 in the network representation generating apparatus for a neural network shown in FIG. 10. The computer device may perform step S204 by the linear transformation module 1004. The computer device may perform step S206 by the logical similarity degree calculation module 1006. The computer device may perform step S208 by the locally strengthened matrix construction module 1008. The computer device may perform step S210 by the attention weight distribution determining module 1010. The computer device may perform step S212 by the fusion module 1012.
  • A computer device is provided according to an embodiment. The computer device includes a memory and a processor. The memory stores a computer program. The computer program, when executed by the processor, causes the processor to perform the following steps of: obtaining a source-side vector representation sequence corresponding to an input sequence; performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence; calculating a logical similarity degree between the request vector sequence and the key vector sequence; constructing a locally strengthened matrix according to the request vector sequence; performing a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements; and fusing value vectors in the value vector sequence according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • In an embodiment, the computer program, when executed by the processor to perform the step of constructing the locally strengthened matrix according to the request vector sequence, causes the processor to perform the following steps of: determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determining, according to the request vector sequence, a window size of the locally strengthened range corresponding to the element; determining the locally strengthened range corresponding to the element according to the center point and the window size; and calculating an association degree between every two of the elements based on locally strengthened ranges respectively corresponding to the elements, to obtain the locally strengthened matrix.
  • In an embodiment, the computer program, when executed by the processor to perform the step of constructing the locally strengthened matrix according to the request vector sequence, causes the processor to perform the following steps of: determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determining, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements; determining the locally strengthened range corresponding to the element according to the center point and the window size; and calculating association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix.
  • In an embodiment, the computer program, when executed by the processor to perform the step of determining, according to the request vector sequence, the center point of the locally strengthened range corresponding to the element, causes the processor to perform the following steps of: performing, for each element in the input sequence, a transformation on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element; performing a nonlinear transformation on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence; and determining the second scalar as the center point of the locally strengthened range corresponding to the element.
  • In an embodiment, the computer program, when executed by the processor to perform the step of determining, according to the request vector sequence, the window size of the locally strengthened range corresponding to the element, causes the processor to perform the following steps of: performing, for each element in the input sequence, a linear transformation on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element; performing a nonlinear transformation on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence; and determining the fourth scalar as the window size of the locally strengthened range corresponding to the element.
  • In an embodiment, the computer program, when executed by the processor to perform the step of determining, according to the key vector sequence, the uniform window size of the locally strengthened ranges, causes the processor to perform the following steps of: obtaining key vectors in the key vector sequence; calculating an average of the key vectors; performing a linear transformation on the average to obtain a fifth scalar; performing a nonlinear transformation on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence; and determining the sixth scalar as the uniform window size of the locally strengthened ranges.
  • In an embodiment, the computer program, when executed by the processor to perform the step of determine the locally strengthened range corresponding to the element according to the center point and the window size, causes the processor to perform the following steps of: determining the center point as a mean of a Gaussian distribution, and determining the window size as a variance of the Gaussian distribution; and determining the locally strengthened range according to the Gaussian distribution determined based on the mean and the variance. The computer program, when executed by the processor to perform the step of calculating the association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthening matrix, causes the processor to perform the following steps of: sequentially arraying the association degrees between every two of the elements according to an order of the elements in the input sequence, to obtain the locally strengthened matrix. An association degree between two of the elements is calculated according to the following formula: G ij = 2 j P i 2 D i 2
    Figure imgb0040

    where Gij represents an association degree between a j-th element and a center point P¡ corresponding to an i-th element in the input sequence, and Gij is a value of a j-th element of an i-th column vector in a locally strengthened matrix G; Pi represents a center point of a locally strengthened range corresponding to the i-th element; and Di represents a window size of the locally strengthened range corresponding to the i-th element.
  • In an embodiment, the computer program, when executed by the processor to perform the step of performing the nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements, causes the processor to perform the following steps of: correcting the logical similarity degree according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree; and performing normalization on the locally strengthened logical similarity degree, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements.
  • In an embodiment, the computer program, when executed by the processor to perform the step of performing the linear transformation on the source-side vector representation sequence, to respectively obtain the request vector sequence, the key vector sequence and the value vector sequence corresponding to the source-side vector representation sequence, causes the processor to perform the following steps of: dividing the source-side vector representation sequence into multiple source-side vector representation subsequences having a low dimension; and performing different linear transformations on each of the source-side vector representation subsequences respectively by multiple different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence. The computer program, when executed by the processor, causes the processor to further perform the following steps of: splicing network representation subsequences respectively corresponding to the source-side vector representation subsequences, and performing a linear transformation, to obtain a network representation sequence to be outputted.
  • In an embodiment, the computer program, when executed by the processor, causes the processor to further perform the following steps of: after the network representation sequence corresponding to the input sequence is obtained, determining the network representation sequence as a new source-side vector representation sequence, and returning to the step of performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence to perform the method again until a loop stop condition is met, and outputting a final network representation sequence.
  • According to the computer device, the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information. By performing the linear transformation on the source-side vector representation sequence corresponding to the input sequence, the request vector sequence, the key vector sequence and the value vector sequence are obtained. The logical similarity degree is obtained according to the request vector sequence and the key vector sequence, The nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distribution, correcting original attention weights. A weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distributions, so that a network representation sequence with the strengthened local information can be obtained. In the obtained network representation sequence, not only the local information can be strengthened, but also an association between elements in the input sequence far away from each other can be retained.
  • A computer-readable storage medium is provided according to an embodiment. The computer-readable storage medium stores a computer program. The computer program, when executed by a processor, causes the processor to perform the following steps of: obtaining a source-side vector representation sequence corresponding to an input sequence; performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence; calculating a logical similarity degree between the request vector sequence and the key vector sequence; constructing a locally strengthened matrix according to the request vector sequence; performing a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements; and fusing value vectors in the value vector sequence according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  • In an embodiment, the computer program, when executed by the processor to perform the step of constructing the locally strengthened matrix according to the request vector sequence, causes the processor to perform the following steps of: determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determining, according to the request vector sequence, a window size of the locally strengthened range corresponding to the element; determining the locally strengthened range corresponding to the element according to the center point and the window size; and calculating association degrees between every two of the elements based on locally strengthened ranges respectively corresponding to the elements, to obtain the locally strengthened matrix.
  • In an embodiment, the computer program, when executed by the processor to perform the step of constructing the locally strengthened matrix according to the request vector sequence, causes the processor to perform the following steps of: determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determining, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements; determining the locally strengthened range corresponding to the element according to the center point and the window size; and calculating association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix.
  • In an embodiment, the computer program, when executed by the processor to perform the step of determining, according to the request vector sequence, the center point of the locally strengthened range corresponding to the element, causes the processor to perform the following steps of: performing, for each element in the input sequence, a transformation on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element; performing a nonlinear transformation on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence; and determining the second scalar as the center point of the locally strengthened range corresponding to the element.
  • In an embodiment, the computer program, when executed by the processor to perform the step of determining, according to the request vector sequence, the window size of the locally strengthened range corresponding to the element, causes the processor to perform the following steps of: performing, for each element in the input sequence, a linear transformation on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element; performing a nonlinear transformation on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence; and determining the fourth scalar as the window size of the locally strengthened range corresponding to the element.
  • In an embodiment, the computer program, when executed by the processor to perform the step of determining, according to the key vector sequence, the uniform window size of the locally strengthened ranges, causes the processor to perform the following steps of: obtaining key vectors in the key vector sequence; calculating an average of the key vectors; performing a linear transformation on the average to obtain a fifth scalar; performing a nonlinear transformation on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence; and determining the sixth scalar as the uniform window size of the locally strengthened ranges.
  • In an embodiment, the computer program, when executed by the processor to perform the step of determine the locally strengthened range corresponding to the element according to the center point and the window size, causes the processor to perform the following steps of: determining the center point as a mean of a Gaussian distribution, and determining the window size as a variance of the Gaussian distribution; and determining the locally strengthened range according to the Gaussian distribution determined based on the mean and the variance. The computer program, when executed by the processor to perform the step of calculating the association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthening matrix, causes the processor to perform the following steps: sequentially arraying the association degrees between every two of the elements according to an order of the elements in the input sequence, to obtain the locally strengthened matrix. An association degree between two of the elements is calculated according to the following formula: G ij = 2 j P i 2 D i 2
    Figure imgb0041

    where Gij represents an association degree between a j-th element and a center point Pi corresponding to an i-th element in the input sequence, and Gij is a value of a j-th element of an i-th column vector in a locally strengthened matrix G; Pi represents a center point of a locally strengthened range corresponding to the i-th element; and Di represents a window size of the locally strengthened range corresponding to the i-th element.
  • In an embodiment, the computer program, when executed by the processor to perform the step of performing the nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements, causes the processor to perform the following steps of: correcting the logical similarity degree according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree; and performing normalization on the locally strengthened logical similarity degree, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements.
  • In an embodiment, the computer program, when executed by the processor to perform the step of performing the linear transformation on the source-side vector representation sequence, to respectively obtain the request vector sequence, the key vector sequence and the value vector sequence corresponding to the source-side vector representation sequence, causes the processor to perform the following steps of: dividing the source-side vector representation sequence into multiple source-side vector representation subsequences having a low dimension; and performing different linear transformations on each of the source-side vector representation subsequences respectively by multiple different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence. The computer program, when executed by the processor, causes the processor to further perform the following steps of: splicing network representation subsequences respectively corresponding to the source-side vector representation subsequences, and performing linear transformation, to obtain a network representation sequence to be outputted.
  • In an embodiment, the computer program, when executed by the processor, causes the processor to further perform the following steps of: after the network representation sequence corresponding to the input sequence is obtained, determining the network representation sequence as a new source-side vector representation sequence, and returning to the step of performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence to perform the method again until a loop stop condition is met, and outputting a final network representation sequence.
  • According to the computer-readable storage medium, the locally strengthened matrix is constructed based on the request vector sequence corresponding to the input sequence, so that attention weights can be assigned in the locally strengthened range, to strengthen local information. By performing the linear transformation on the source-side vector representation sequence corresponding to the input sequence, the request vector sequence, the key vector sequence and the value vector sequence are obtained. The logical similarity degree is according to the request vector sequence and the key vector sequence. The nonlinear transformation is performed based on the logical similarity degree and the locally strengthened matrix, to obtain the locally strengthened attention weight distributions, correcting original attention weights. A weighted sum is performed on the value vector sequence according to the locally strengthened attention weight distributions, so that a network representation sequence with the strengthened local information can be obtained. In the obtained network representation sequence, not only the local information can be strengthened, but also an association between elements in the input sequence far away from each other can be retained.
  • A person of ordinary skill in the art may understand that some or all procedures in the foregoing method embodiments may be implemented by a computer program instructing related hardware. The program may be stored in a non-volatile computer-readable storage medium. When the program is executed, the procedures of the foregoing method embodiments may be performed. Any reference to a memory, a storage, a database, or another medium used in the embodiments provided in the present disclosure may include a non-volatile and/or volatile memory. The non-volatile memory may include a read-only memory (ROM), a programmable ROM (PROM), an electrically programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, or the like. The volatile memory may include a random access memory (RAM) or an external cache. As an illustration instead of a limitation, the RAM may be implemented in various forms, such as a static RAM (SRAM), a dynamic RAM (DRAM), a synchronous DRAM (SDRAM), a double data rate SDRAM (DDRSDRAM), an enhanced SDRAM (ESDRAM), a Synchlink DRAM (SLDRAM), a rambus direct RAM (RDRAM), a direct rambus dynamic RAM (DRDRAM), and a rambus dynamic RAM (RDRAM).
  • The technical features in the foregoing embodiments may be randomly combined. For concise description, not all possible combinations of the technical features in the embodiments are described. However, provided that combinations of the technical features do not conflict with each other, the combinations of the technical features are considered as falling within the scope described in this specification.
  • The foregoing embodiments merely show several implementations of the present disclosure, and are described in detail, but cannot be understood as a limitation to the patent scope of the present disclosure. It should be noted that, a person of ordinary skill in the art may further make variations and improvements without departing from the ideas of the present disclosure. The variations and improvements fall within the protection scope of the present disclosure. Therefore, the protection scope of the patent of the present disclosure shall be defined by the appended claims.

Claims (22)

  1. A network representation generating method for a neural network, the method being applied to a computer device and comprising:
    obtaining a source-side vector representation sequence corresponding to an input sequence;
    performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence;
    calculating a logical similarity degree between the request vector sequence and the key vector sequence;
    constructing a locally strengthened matrix according to the request vector sequence;
    performing a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements; and
    fusing value vectors in the value vector sequence according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  2. The method according to claim 1, wherein the constructing a locally strengthened matrix according to the request vector sequence comprises:
    determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements;
    determining, according to the request vector sequence, a window size of the locally strengthened range corresponding to the element;
    determining the locally strengthened range corresponding to the element according to the center point and the window size; and
    calculating association degrees between every two of the elements based on locally strengthened ranges respectively corresponding to the elements, to obtain the locally strengthened matrix.
  3. The method according to claim 1, wherein the constructing a locally strengthened matrix according to the request vector sequence comprises:
    determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements;
    determining, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements;
    determining the locally strengthened range corresponding to the element according to the center point and the window size; and
    calculating association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix.
  4. The method according to claim 2 or 3, wherein the determining, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements comprises:
    performing, for each element in the input sequence, a transformation on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element;
    performing a nonlinear transformation on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence; and
    determining the second scalar as the center point of the locally strengthened range corresponding to the element.
  5. The method according to claim 2, wherein the determining, according to the request vector sequence, a window size of the locally strengthened range corresponding to the element comprises:
    performing, for each element in the input sequence, a linear transformation on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element;
    performing a nonlinear transformation on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence; and
    determining the fourth scalar as the window size of the locally strengthened range corresponding to the element.
  6. The method according to claim 3, wherein the determining, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements comprises:
    obtaining key vectors in the key vector sequence;
    calculating an average of the key vectors;
    performing a linear transformation on the average to obtain a fifth scalar;
    performing a nonlinear transformation on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence; and
    determining the sixth scalar as the uniform window size of the locally strengthened ranges.
  7. The method according to claim 2 or 3, wherein
    the determining the locally strengthened range corresponding to the element according to the center point and the window size comprises:
    determining the center point as a mean of a Gaussian distribution, and determining the window size as a variance of the Gaussian distribution; and
    determining the locally strengthened range according to the Gaussian distribution determined based on the mean and the variance; and
    calculating the association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix comprises:
    sequentially arraying the association degrees between every two of the elements according to an order of the elements in the input sequence, to obtain the locally strengthened matrix, an association degree between two of the elements being calculated according to the following formula: G ij = 2 j P i 2 D i 2 ,
    Figure imgb0042
    wherein Gij represents an association degree between a j-th element and a center point P¡ corresponding to an i-th element in the input sequence, and Gij is a value of a j-th element of an i-th column vector in a locally strengthened matrix G; Pi represents a center point of a locally strengthened range corresponding to the i-th element; and Di represents a window size of the locally strengthened range corresponding to the i-th element.
  8. The method according to any one of claims 1 to 3, wherein the performing a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to the elements comprises:
    correcting the logical similarity degree according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree; and
    performing normalization on the locally strengthened logical similarity degree, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements.
  9. The method according to any one of claims 1 to 3, wherein
    the performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence comprises:
    dividing the source-side vector representation sequence into a plurality of source-side vector representation subsequences having a low dimension; and
    performing different linear transformations on each of the source-side vector representation subsequences respectively by a plurality of different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence; and
    the method further comprises:
    splicing network representation subsequences respectively corresponding to the source-side vector representation subsequences, and performing a linear transformation, to obtain a network representation sequence to be outputted.
  10. The method according to any one of claims 1 to 3, further comprising:
    after the network representation sequence corresponding to the input sequence is obtained, determining the network representation sequence as a new source-side vector representation sequence, and returning to the step of performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence to perform the method again until a loop stop condition is met, and outputting a final network representation sequence.
  11. A network representation generating apparatus for a neural network, the apparatus comprising:
    an obtaining module, configured to obtain a source-side vector representation sequence corresponding to an input sequence;
    a linear transformation module, configured to perform a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence;
    a logical similarity degree calculation module, configured to calculate a logical similarity degree between the request vector sequence and the key vector sequence;
    a locally strengthened matrix construction module, configured to construct a locally strengthened matrix according to the request vector sequence;
    an attention weight distribution determining module, configured to perform a nonlinear transformation based on the logical similarity degree and the locally strengthened matrix, to obtain locally strengthened attention weight distributions respectively corresponding to elements; and
    a fusion module, configured to fuse value vectors in the value vector sequence according to the attention weight distributions, to obtain a network representation sequence corresponding to the input sequence.
  12. The apparatus according to claim 11, wherein the locally strengthened matrix construction module is further configured to: determine, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determine, according to the request vector sequence, a window size of the locally strengthened range corresponding to the element; determine the locally strengthened range corresponding to the element according to the center point and the window size; and calculate association degrees between every two of the elements based on locally strengthened ranges respectively corresponding to the elements, to obtain the locally strengthened matrix.
  13. The apparatus according to claim 11, wherein the locally strengthened matrix construction module is further configured to: determine, according to the request vector sequence, a center point of a locally strengthened range corresponding to each of the elements; determine, according to the key vector sequence, a uniform window size of locally strengthened ranges respectively corresponding to the elements; determine the locally strengthened range corresponding to the element according to the center point and the window size; and calculate association degrees between every two of the elements based on the locally strengthened ranges, to obtain the locally strengthened matrix.
  14. The apparatus according to claim 12 or 13, wherein the locally strengthened matrix construction module is further configured to: perform, for each element in the input sequence, a transformation on a request vector corresponding to the element in the request vector sequence by a first feedforward neural network, to obtain a first scalar corresponding to the element; perform a nonlinear transformation on the first scalar by a nonlinear transformation function, to obtain a second scalar proportional to a length of the input sequence; and determine the second scalar as the center point of the locally strengthened range corresponding to the element.
  15. The apparatus according to claim 12, wherein the locally strengthened matrix construction module is further configured to: perform, for each element in the input sequence, a linear transformation on a request vector corresponding to the element in the request vector sequence by a second feedforward neural network, to obtain a third scalar corresponding to the element; perform a nonlinear transformation on the third scalar by a nonlinear transformation function, to obtain a fourth scalar proportional to a length of the input sequence; and determine the fourth scalar as the window size of the locally strengthened range corresponding to the element.
  16. The apparatus according to claim 13, wherein the locally strengthened matrix construction module is further configured to: obtain key vectors in the key vector sequence; calculate an average of the key vectors; perform a linear transformation on the average to obtain a fifth scalar; perform a nonlinear transformation on the fifth scalar by a nonlinear transformation function, to obtain a sixth scalar proportional to a length of the input sequence; and determine the sixth scalar as the uniform window size of the locally strengthened ranges.
  17. The apparatus according to claim 12 or 13, wherein the locally strengthened matrix construction module is further configured to: determine the center point as a mean of a Gaussian distribution, and determine the window size as a variance of the Gaussian distribution; determine the locally strengthened range according to the Gaussian distribution determined based on the mean and the variance; and sequentially array the association degrees between every two of the elements according to an order of the elements in the input sequence, to obtain the locally strengthened matrix, an association degree between two of the elements being calculated according to the following formula: G ij = 2 j P i 2 D i 2 ,
    Figure imgb0043
    wherein Gij represents an association degree between a j-th element and a center point Pi corresponding to an i-th element in the input sequence, and Gij is a value of a j-th element of an i-th column vector in a locally strengthened matrix G; Pi represents a center point of a locally strengthened range corresponding to the i-th element; and Di represents a window size of the locally strengthened range corresponding to the i-th element.
  18. The apparatus according to any one of claims 11 to 13, wherein the attention weight distribution determining module is further configured to: correct the logical similarity degree according to the locally strengthened matrix, to obtain a locally strengthened logical similarity degree; and perform normalization on the locally strengthened logical similarity degree, to obtain the locally strengthened attention weight distributions respectively corresponding to the elements.
  19. The apparatus according to any one of claims 11 to 13, wherein the linear transformation module is further configured to: divide the source-side vector representation sequence into a plurality of source-side vector representation subsequences having a low dimension; and perform different linear transformations on each of the source-side vector representation subsequences respectively by a plurality of different parameter matrices, to obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation subsequence; and
    the apparatus further comprises:
    a splicing module, configured to splice network representation subsequences respectively corresponding to the source-side vector representation subsequences, and perform a linear transformation, to obtain a network representation sequence to be outputted.
  20. The apparatus according to any one of claims 11 to 13, further comprising:
    a loop module, configured to: after the network representation sequence corresponding to the input sequence is obtained, determine the network representation sequence as a new source-side vector representation sequence, and return to the step of performing a linear transformation on the source-side vector representation sequence, to respectively obtain a request vector sequence, a key vector sequence and a value vector sequence corresponding to the source-side vector representation sequence to perform the operations again until a loop stop condition is met, and output a final network representation sequence.
  21. A computer-readable storage medium, storing a computer program that, when executed by a processor, causes the processor to perform the operations of the method according to any one of claims 1 to 10.
  22. A computer device, comprising:
    a memory storing a computer program; and
    a processor, wherein
    the computer program, when executed by the processor, causes the processor to perform the operations of the method according to any one of claims 1 to 10.
EP19857335.4A 2018-09-04 2019-08-12 Method and apparatus for generating network representation of neural network, storage medium, and device Pending EP3848856A4 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811027795.XA CN109034378B (en) 2018-09-04 2018-09-04 Network representation generation method and device of neural network, storage medium and equipment
PCT/CN2019/100212 WO2020048292A1 (en) 2018-09-04 2019-08-12 Method and apparatus for generating network representation of neural network, storage medium, and device

Publications (2)

Publication Number Publication Date
EP3848856A1 true EP3848856A1 (en) 2021-07-14
EP3848856A4 EP3848856A4 (en) 2021-11-17

Family

ID=64623896

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19857335.4A Pending EP3848856A4 (en) 2018-09-04 2019-08-12 Method and apparatus for generating network representation of neural network, storage medium, and device

Country Status (5)

Country Link
US (1) US11875220B2 (en)
EP (1) EP3848856A4 (en)
JP (1) JP7098190B2 (en)
CN (1) CN109034378B (en)
WO (1) WO2020048292A1 (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034378B (en) * 2018-09-04 2023-03-31 腾讯科技(深圳)有限公司 Network representation generation method and device of neural network, storage medium and equipment
CN109918630B (en) * 2019-01-23 2023-08-04 平安科技(深圳)有限公司 Text generation method, device, computer equipment and storage medium
CN110008482B (en) * 2019-04-17 2021-03-09 腾讯科技(深圳)有限公司 Text processing method and device, computer readable storage medium and computer equipment
CN110276082B (en) * 2019-06-06 2023-06-30 百度在线网络技术(北京)有限公司 Translation processing method and device based on dynamic window
CN110347790B (en) * 2019-06-18 2021-08-10 广州杰赛科技股份有限公司 Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN110705273B (en) * 2019-09-02 2023-06-13 腾讯科技(深圳)有限公司 Information processing method and device based on neural network, medium and electronic equipment
US11875131B2 (en) * 2020-09-16 2024-01-16 International Business Machines Corporation Zero-shot cross-lingual transfer learning
CN112434527A (en) * 2020-12-03 2021-03-02 上海明略人工智能(集团)有限公司 Keyword determination method and device, electronic equipment and storage medium
CN112785848B (en) * 2021-01-04 2022-06-17 清华大学 Traffic data prediction method and system
CN112967112B (en) * 2021-03-24 2022-04-29 武汉大学 Electronic commerce recommendation method for self-attention mechanism and graph neural network
CN113392139B (en) * 2021-06-04 2023-10-20 中国科学院计算技术研究所 Environment monitoring data completion method and system based on association fusion
CN113254592B (en) * 2021-06-17 2021-10-22 成都晓多科技有限公司 Comment aspect detection method and system of multi-level attention model based on door mechanism
CN113378791B (en) * 2021-07-09 2022-08-05 合肥工业大学 Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion
CN113283235B (en) * 2021-07-21 2021-11-19 明品云(北京)数据科技有限公司 User label prediction method and system
CN113887325A (en) * 2021-09-10 2022-01-04 北京三快在线科技有限公司 Model training method, expression recognition method and device
CN117180952B (en) * 2023-11-07 2024-02-02 湖南正明环保股份有限公司 Multi-directional airflow material layer circulation semi-dry flue gas desulfurization system and method thereof

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09297112A (en) * 1996-03-08 1997-11-18 Mitsubishi Heavy Ind Ltd Structure parameter analysis device and analysis method
US7496546B2 (en) * 2003-03-24 2009-02-24 Riken Interconnecting neural network system, interconnecting neural network structure construction method, self-organizing neural network structure construction method, and construction programs therefor
CN104765728B (en) 2014-01-08 2017-07-18 富士通株式会社 The method trained the method and apparatus of neutral net and determine sparse features vector
EP3141610A1 (en) * 2015-09-12 2017-03-15 Jennewein Biotechnologie GmbH Production of human milk oligosaccharides in microbial hosts with engineered import / export
CN106056526B (en) * 2016-05-26 2019-04-12 南昌大学 A kind of resume image based on parsing rarefaction representation and compressed sensing
CN106096640B (en) * 2016-05-31 2019-03-26 合肥工业大学 A kind of feature dimension reduction method of multi-mode system
CN106339564B (en) * 2016-09-06 2017-11-24 西安石油大学 A kind of perforating scheme method for optimizing based on Grey Correlation Cluster
CN106571135B (en) * 2016-10-27 2020-06-09 苏州大学 Ear voice feature extraction method and system
US11188824B2 (en) * 2017-02-17 2021-11-30 Google Llc Cooperatively training and/or using separate input and subsequent content neural networks for information retrieval
CN107025219B (en) * 2017-04-19 2019-07-26 厦门大学 A kind of word insertion representation method based on internal Semantic hierarchy
CN107180247A (en) * 2017-05-19 2017-09-19 中国人民解放军国防科学技术大学 Relation grader and its method based on selective attention convolutional neural networks
CN107345860B (en) * 2017-07-11 2019-05-31 南京康尼机电股份有限公司 Rail vehicle door sub-health state recognition methods based on Time Series Data Mining
GB2566257A (en) * 2017-08-29 2019-03-13 Sky Cp Ltd System and method for content discovery
CN107783960B (en) * 2017-10-23 2021-07-23 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting information
CN107797992A (en) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 Name entity recognition method and device
CN108256172B (en) * 2017-12-26 2021-12-07 同济大学 Dangerous case early warning and forecasting method in process of pipe jacking and downward passing existing box culvert
CN108537822B (en) * 2017-12-29 2020-04-21 西安电子科技大学 Moving target tracking method based on weighted confidence estimation
CN108334499B (en) * 2018-02-08 2022-03-18 海南云江科技有限公司 Text label labeling device and method and computing device
CN108828533B (en) * 2018-04-26 2021-12-31 电子科技大学 Method for extracting similar structure-preserving nonlinear projection features of similar samples
CN109034378B (en) * 2018-09-04 2023-03-31 腾讯科技(深圳)有限公司 Network representation generation method and device of neural network, storage medium and equipment

Also Published As

Publication number Publication date
CN109034378A (en) 2018-12-18
WO2020048292A1 (en) 2020-03-12
JP7098190B2 (en) 2022-07-11
CN109034378B (en) 2023-03-31
US20210042603A1 (en) 2021-02-11
US11875220B2 (en) 2024-01-16
EP3848856A4 (en) 2021-11-17
JP2021517316A (en) 2021-07-15

Similar Documents

Publication Publication Date Title
EP3848856A1 (en) Method and apparatus for generating network representation of neural network, storage medium, and device
US11853709B2 (en) Text translation method and apparatus, storage medium, and computer device
CN109146064B (en) Neural network training method, device, computer equipment and storage medium
KR102180002B1 (en) Attention-based sequence transformation neural network
US11948066B2 (en) Processing sequences using convolutional neural networks
EP3745394B1 (en) End-to-end text-to-speech conversion
CN109271646A (en) Text interpretation method, device, readable storage medium storing program for executing and computer equipment
CN106910497B (en) Chinese word pronunciation prediction method and device
WO2021196954A1 (en) Serialized data processing method and device, and text processing method and device
EP3893163A1 (en) End-to-end graph convolution network
CN115797495B (en) Method for generating image by sentence-character semantic space fusion perceived text
EP3958148A1 (en) Method and device for generating hidden state in recurrent neural network for language processing
CN108959388B (en) Information generation method and device
US11210474B2 (en) Language processing using a neural network
CN111310464A (en) Word vector acquisition model generation method and device and word vector acquisition method and device
KR20200095789A (en) Method and apparatus for building a translation model
CN111597339B (en) Document-level multi-round dialogue intention classification method, device, equipment and storage medium
CN115017178A (en) Training method and device for data-to-text generation model
CN112837673B (en) Speech synthesis method, device, computer equipment and medium based on artificial intelligence
CN111797220B (en) Dialog generation method, apparatus, computer device and storage medium
KR20190103011A (en) Distance based deep learning
CN111832699A (en) Computationally efficient expressive output layer for neural networks
CN114238549A (en) Training method and device of text generation model, storage medium and computer equipment
Wang Recurrent neural network
CN113434652B (en) Intelligent question-answering method, intelligent question-answering device, equipment and storage medium

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20210406

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

A4 Supplementary search report drawn up and despatched

Effective date: 20211015

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 40/47 20200101ALI20211011BHEP

Ipc: G06N 3/04 20060101AFI20211011BHEP

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
RAP3 Party data changed (applicant data changed or rights of an application transferred)

Owner name: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED