CN115630651B - Text generation method and training method and device of text generation model - Google Patents

Text generation method and training method and device of text generation model Download PDF

Info

Publication number
CN115630651B
CN115630651B CN202211306837.XA CN202211306837A CN115630651B CN 115630651 B CN115630651 B CN 115630651B CN 202211306837 A CN202211306837 A CN 202211306837A CN 115630651 B CN115630651 B CN 115630651B
Authority
CN
China
Prior art keywords
text
feature
sequence
hidden state
feature sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211306837.XA
Other languages
Chinese (zh)
Other versions
CN115630651A (en
Inventor
王凡
鲍思琪
何煌
吴华
林英展
黄世维
何径舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211306837.XA priority Critical patent/CN115630651B/en
Publication of CN115630651A publication Critical patent/CN115630651A/en
Application granted granted Critical
Publication of CN115630651B publication Critical patent/CN115630651B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a text generation method and a training method and device of a text generation device, relates to the field of artificial intelligence, and particularly relates to the technical fields of deep learning, natural language processing, intelligent voice and the like. The text generation method comprises the following specific implementation scheme: preprocessing a text to be processed to obtain an embedded feature sequence, wherein the embedded feature sequence comprises embedded features corresponding to text units; inputting the embedded feature sequence into an attention network formed by a decoding unit to obtain a text feature sequence output by the attention network; and decoding the text feature sequence to generate a following text of the text to be processed, the decoding unit being configured to: encoding the input feature sequence by adopting an attention mechanism to obtain a first feature sequence; adjusting the first characteristic sequence according to the hidden state characteristics to obtain a second characteristic sequence; the hidden state features characterize the semantics of the preceding text of the text to be processed; and updating the hidden state feature according to the second feature sequence.

Description

Text generation method and training method and device of text generation model
Technical Field
The disclosure relates to the field of artificial intelligence, in particular to the technical fields of deep learning, natural language processing, intelligent voice and the like, and particularly relates to a text generation method, a training method of a text generation model, a training device of the text generation method, training equipment of the text generation model and training media of the training device.
Background
With the development of computer technology and network technology, vision-based self-attention mechanisms are widely used. For example, in the field of natural language processing, self-attention mechanisms may be relied upon to capture long distance semantic features in text. But due to the limited coding length of the self-attention mechanism, the self-attention mechanism is generally unable to remember long-term information.
Disclosure of Invention
The present disclosure aims to provide a text generation method, apparatus, electronic device, and storage medium that can generate text using long-term memory, aiming to improve the accuracy of the generated text.
According to one aspect of the present disclosure, there is provided a text generation method including: preprocessing a text to be processed to obtain an embedded feature sequence, wherein the embedded feature sequence comprises embedded features corresponding to text units in the text to be processed; inputting the embedded feature sequence into an attention network formed by the coding units to obtain a text feature sequence output by the attention network; and decoding the text feature sequence to generate a subsequent text of the text to be processed, wherein the encoding unit is configured to: encoding the input feature sequence by adopting an attention mechanism to obtain a first feature sequence; adjusting the first characteristic sequence according to the hidden state characteristics to obtain a second characteristic sequence; the hidden state features characterize the semantics of the preceding text of the text to be processed; and updating the hidden state feature according to the second feature sequence.
According to another aspect of the present disclosure, there is provided a training method of a text generation model, wherein the text generation model includes a preprocessing network, an attention network, and a decoding network; the attention network is composed of coding units, and the training method comprises the following steps: preprocessing each target text in the text sequence by adopting a preprocessing network to obtain an embedded feature sequence, wherein the embedded feature sequence comprises embedded features corresponding to text units in each target text; inputting the embedded feature sequence into an attention network to obtain a text feature sequence output by the attention network; decoding the text feature sequence by adopting a decoding network to generate a predicted post text of each target text; training a text generation model according to the predicted following text and the adjacent following text of each target text in the text sequence; wherein the encoding unit is configured to perform the following operations on: encoding the input feature sequence by adopting an attention mechanism to obtain a first feature sequence; adjusting the first characteristic sequence according to the hidden state characteristics to obtain a second characteristic sequence; the hidden state features characterize the semantics of the preceding text of each target text; and updating the hidden state feature according to the second feature sequence.
According to another aspect of the present disclosure, there is provided a text generating apparatus including: the preprocessing module is used for preprocessing the text to be processed to obtain an embedded feature sequence, wherein the embedded feature sequence comprises embedded features corresponding to text units in the text to be processed; the text feature obtaining module is used for inputting the embedded feature sequence into an attention network formed by the coding units to obtain a text feature sequence output by the attention network; and a feature decoding module, configured to decode a text feature sequence, and generate a text following the text to be processed, where the feature sequence obtaining module includes: the coding sub-module is used for coding the input characteristic sequence by adopting an attention mechanism aiming at the coding unit to obtain a first characteristic sequence; the adjusting sub-module is used for adjusting the first characteristic sequence according to the hidden state characteristics to obtain a second characteristic sequence; the hidden state features characterize the semantics of the preceding text of the text to be processed; and an updating sub-module for updating the hidden state feature according to the second feature sequence.
According to another aspect of the present disclosure, there is provided a training apparatus of a text generation model, wherein the text generation model includes a preprocessing network, an attention network, and a decoding network; the attention network is composed of coding units; the training device comprises: the preprocessing module is used for preprocessing each target text in the text sequence by adopting a preprocessing network to obtain an embedded feature sequence, wherein the embedded feature sequence comprises embedded features corresponding to text units in each target text; the text feature obtaining module is used for inputting the embedded feature sequence into the attention network to obtain a text feature sequence output by the attention network; the feature decoding module is used for decoding the text feature sequence by adopting a decoding network to generate a predicted following text of each target text; the model training module is used for training a text generation model according to the predicted following text of each target text in the following text and the text sequence; the text feature obtaining module comprises: the coding sub-module is used for coding the input characteristic sequence by adopting an attention mechanism aiming at the coding unit to obtain a first characteristic sequence; the adjusting sub-module is used for adjusting the first characteristic sequence according to the hidden state characteristics to obtain a second characteristic sequence; the hidden state features characterize the semantics of the preceding text of each target text; and an updating sub-module for updating the hidden state feature according to the second feature sequence.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text generation method and/or the training method of the text generation model provided by the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the text generation method and/or training method of the text generation model provided by the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program/instruction stored on at least one of a readable storage medium and an electronic device, which when executed by a processor, implements the text generation method and/or training method of the text generation model provided by the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is an application scenario schematic diagram of a text generation method and training method and apparatus of a text generation model according to an embodiment of the present disclosure;
FIG. 2 is a flow diagram of a text generation method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of adjusting a first feature sequence according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of updating hidden status features according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of an implementation of a text generation method according to an embodiment of the present disclosure;
FIG. 6 is a flow diagram of a training method of a text generation model according to an embodiment of the present disclosure;
fig. 7 is a block diagram of a structure of a text generating apparatus according to an embodiment of the present disclosure;
FIG. 8 is a block diagram of a training device of a text generation model according to an embodiment of the present disclosure; and
fig. 9 is a block diagram of an electronic device used to implement a text generation method and/or a training method for a text generation model of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
An attention network constructed based on a self-attention mechanism may encode (Encoding) a piece of text, i.e. process the piece of text to express it as a sequence of vectors. For example, the attention network may be a transducer network. But the attention network can only encode a small piece of text and does not have the ability to encode that piece of text in combination with the semantics of the preceding text of that piece of text. Although the attention network can adjust the network parameters by learning the semantics of a large amount of text through a large amount of pre-training, the network parameters are fixed after training. In an actual usage scenario, the attention network cannot encode the current text according to the semantics of the previously entered text and complete the prediction task. This makes it difficult for the attention network to grasp a large amount of information in the vertical field.
For example, when an attention network is applied to a question-answer scene of a certain product, the attention network cannot be well qualified for a question-answer task when the attention network lacks a detailed introduction document of the product. Instead, the attention network needs to be trained individually based on a large number of introductory documents for that certain product. There is a problem that the generalization ability of the attention network obtained by training is weak.
In order to solve the problem of weak generalization capability, methods such as transfer learning, meta learning, pre-training and prompt (prompt) learning are accompanied. However, these methods are limited by the coding length of the self-attention mechanism, and there is still a problem that only shorter text can be coded. Thus, when encoding the current text, the semantics of the previous text of the current text cannot be considered, which can limit the accuracy of the feature expression information obtained by encoding and affect the accuracy of the generated text.
In order to solve the problem, the present disclosure provides a text generation method and a training method, device, equipment and medium of a text generation model, and the application scenario of the method and device provided by the present disclosure will be described in detail below with reference to fig. 1.
Fig. 1 is an application scenario schematic diagram of a text generation method and training method and apparatus of a text generation model according to an embodiment of the present disclosure.
As shown in fig. 1, the application scenario 100 of this embodiment may include an electronic device 110, and the electronic device 110 may be various electronic devices with processing functions, including but not limited to a smart phone, a tablet computer, a laptop computer, a desktop computer, a server, and the like.
The electronic device 110 may have text processing functionality for processing the entered text 120 to predict the subsequent text 130 of the text 120. In an embodiment, the electronic device 110 may also have an intelligent voice function, for example, for converting a voice signal provided by a user into text 120, and generating a subsequent text 130 of the text, and converting the subsequent text 130 into a voice signal for playing, so as to implement intelligent interaction with the user.
Illustratively, the electronic device 110 may employ a model that combines a self-attention mechanism and a loop mechanism to encode the text 120. For example, a transducer-XL model or block loop Transformer (Block Recurrent Transformer) model may be employed that addresses long sequence problems. Wherein XL represents Extra length (Extra Long). By means of these models, the coding length of the attention network can no longer be limited to a local area, but the subsequent inference process can be influenced and corrected by recursively accumulating long-term information. For example, in a text generation scenario, the text generation model may include the model that combines the self-attention mechanism and the loop mechanism.
Illustratively, the electronic device 110 may also employ the text generation methods provided by the present disclosure to process the text 120 to generate the subsequent text 130. Thus, the length of the semantics of the model memory can be prolonged, the information before a longer time is continuously memorized, and the precision of the generated subsequent text 130 is improved. Accordingly, the electronic device 110 may employ the text generation model 140 provided by the present disclosure to implement a text generation method.
As shown in fig. 1, the application scenario 100 may further include a server 150, where the server 150 may be a background management server that supports the operation of client applications in the electronic device 110. Electronic device 110 may be communicatively coupled to server 150 via a network, which may include wired or wireless communication links. Server 150 may also be a cloud server, a server of a distributed system, or a server that incorporates a blockchain.
For example, the server 150 may train the text generation model 140 with a large amount of text and send the trained text generation model 140 to the electronic device 110 in response to an acquisition request by the electronic device 110 to generate the subsequent text 130 with the text generation model 140 by the electronic device 110.
In one embodiment, electronic device 110 may also send text 120 to server 150, and server 150 may process text 120 using the trained text generation model to obtain subsequent text 130.
It should be noted that, the text generating method provided in the present disclosure may be executed by the electronic device 110 or may be executed by the server 150. Accordingly, the text generating apparatus provided by the present disclosure may be disposed in the electronic device 110 or may be disposed in the server 150. The training method of the text generation model provided by the present disclosure may be performed by the server 150. Accordingly, the training device of the text generation model provided by the present disclosure may be provided in the server 150.
It should be understood that the number and type of electronic devices 110 and servers 150 in fig. 1 are merely illustrative. There may be any number and type of electronic devices 110 and servers 150 as desired for implementation.
The text generation method provided by the present disclosure will be described in detail below with reference to fig. 2 to 5.
Fig. 2 is a flow diagram of a text generation method according to an embodiment of the present disclosure.
As shown in fig. 2, the text generation method 200 of this embodiment may include operations S210 to S230. Wherein, in operation S220, each encoding unit in the attention network may be configured to perform operations S221 to S223.
In operation S210, a text to be processed is preprocessed to obtain an embedded feature sequence.
According to an embodiment of the present disclosure, operation S210 may segment the text x to be processed into a plurality of text units, which may constitute a text unit sequence. For example, if the setting may be divided into L text units, the text X to be processed may be represented as x= (X) 1 ,x 2 ,…,x L ). Wherein x is i Representing the ith text unit of the L text units. L can be an integer greater than 1, i has a value in the range of [1, L]. The text units may be text units with any granularity, for example, text units with word granularity or word granularity, which is not limited in this disclosure.
After obtaining the text feature sequence, according to an embodiment of the present disclosure, the embodiment may, for example, perform an embedding process on the text unit sequence to obtain an embedded feature sequence. For example, for text element x i The embedding feature E (x i ). In this embodiment, for example, an Embedding layer (Embedding layer) may be used to embed the text units to obtain a dense feature sequence through linear processing. For example, the embedded layer may be a fully connective layer, which is not limited by the present disclosure. The embedded feature sequence obtained in this embodiment includes embedded features corresponding to text units in the text to be processed, i.e., the embedded features in the embedded feature sequence correspond one-to-one with the text units in the text unit sequence.
In operation S220, the embedded feature sequence is input into an attention network composed of encoding units, resulting in a text feature sequence output by the attention network.
According to an embodiment of the present disclosure, the attention network may be constituted by, for example, at least two coding units connected in sequence, which may be understood as a coding network for processing the embedded feature sequence based on a self-attention mechanism, extracting the contextual semantic features of the text X. The embodiment can input the embedded feature sequence into the coding unit arranged at the first position in at least two coding units which are connected in sequence, and take the feature sequence output by the coding unit arranged at the last position in the at least two coding units as the text feature sequence after the processing of the at least two coding units.
According to an embodiment of the present disclosure, each encoding unit in the attention network may be configured to perform operations S221 to S223, for example.
In operation S221, the input feature sequence is encoded using an attention mechanism, resulting in a first feature sequence.
According to an embodiment of the present disclosure, for the encoding unit that is first ranked, the input feature sequence may include the embedded feature sequence obtained in operation S210. For other coding units except the first coding unit, the input characteristic sequence is the characteristic sequence output by the previous coding unit connected with the other coding unit.
In this embodiment, the first feature sequence may be obtained by performing attention computation on features in the input feature sequence, thereby implementing encoding of the input feature sequence. For example, a self-attention mechanism may be employed to encode the input sequence of features. Namely, the self-attention principle is adopted to perform attention operation on every two characteristics.
In an embodiment, the coding unit may include a self-attention layer in a transducer encoder. The self-attention layer can be constructed by using a multi-head attention mechanism, and then for the ith feature in the input feature sequence
Figure BDA0003905247300000071
Encoding can be achieved using the following equation (1) to obtain the corresponding features in the first feature sequence. Wherein j represents the arrangement sequence of the coding units in at least two coding units connected in sequence, and the value range of j is [1, N]Where N is the total number of coding units in the attention network. Then->
Figure BDA0003905247300000072
The ith feature in the feature sequence output for the (j-1) th coding unit. Wherein, setting
Figure BDA0003905247300000073
Then->
Figure BDA0003905247300000074
Figure BDA0003905247300000075
For example, for the first encoding unit of the at least two encoding units, the input feature sequence may be the embedded feature E (x i ) The embedded feature sequence is constructed. Alternatively, the attention network may perform regularization on the embedded feature sequence, and take the regularized embedded feature sequence as the feature sequence of the coding unit that is first placed as input. For example, the ith embedded feature in the regularized embedded feature sequence may be calculated using the following equation (2).
Figure BDA0003905247300000081
In operation S222, the first feature sequence is adjusted according to the hidden status feature to obtain the second feature sequence.
In operation S223, the hidden state features are updated according to the second feature sequence.
According to embodiments of the present disclosure, hidden state features are similar to hidden variables in recurrent neural networks for storing and propagating historical information as one dynamic variable. For example, hidden state features may characterize the semantics of the preceding text of the text to be processed. The embodiment can realize the adjustment of the first characteristic sequence by weighting and fusing the hidden state characteristic and the first characteristic sequence. Alternatively, the first feature sequence may be connected to the hidden state feature by concat (), and then the connected feature is processed linearly to implement adjustment of the first feature sequence.
After obtaining the second feature sequence, the embodiment may obtain an update amount of the hidden state feature according to the predetermined learning rate and the second feature sequence, and then update the hidden state feature according to the update amount. The predetermined learning rate may be set according to actual requirements, which is not limited in the present disclosure.
In operation S230, the text feature sequence is decoded, generating a following text of the text to be processed.
According to embodiments of the present disclosure, a cyclic neural network or transducer architecture based decoder or the like may be employed to decode a sequence of text features, with the decoder generating subsequent text of the text to be processed. For example, if the text to be processed is the text of an inquiry sentence, the following text may be the text of a reply sentence.
According to the embodiment of the disclosure, the first feature sequence is adjusted according to the hidden state features representing the semantics of the previous text to obtain the second feature sequence, so that the obtained second feature sequence not only can express the semantics of the text to be processed, but also can express the semantics of the previous text, and the expression capability of the second feature sequence can be improved. The hidden state features can then be fused into the semantics of the statement processed in real time by updating the hidden state features according to the second feature sequence. That is, hidden state features have the ability to learn semantics in real-time during text generation. In this way, in the text generation process, the existing knowledge can be continuously corrected according to the semantics learned in real time, and gradient feedback or fine adjustment of network parameters is not needed, and the fine adjustment of network parameters in the attention network is not needed for adapting to different application scenes. And thus facilitates improving the accuracy of the generated subsequent text and the robustness of the text generation method.
In an embodiment, the encoding unit may further include a nonlinear processing layer in the transform encoder, for performing nonlinear processing on the obtained second feature sequence. The characteristic sequence output by the coding unit is the characteristic sequence obtained after nonlinear processing. Correspondingly, the encoding unit is further configured to perform nonlinear processing on the second feature sequence to obtain an output feature sequence. For example, a ReLU activation function or a GeLU activation function may be used to perform nonlinear processing on the second feature sequence, so as to improve the robustness of the entire text generation method.
In an embodiment, the encoding unit may further comprise a regularization layer for regularizing the obtained second feature sequence. The feature sequence output by the coding unit is the feature sequence obtained after regularization processing. Correspondingly, the encoding unit is further configured to regularize the second feature sequence to obtain an output feature sequence. By regularizing the second feature sequence, the complexity of the decoding process can be reduced, and the text generation efficiency can be improved.
In an embodiment, the coding unit may include both a regularization layer and a nonlinear processing layer. The encoding unit may be configured to: and regularizing the second characteristic sequence, and then performing nonlinear processing on the characteristic sequence obtained by regularizing. The characteristic sequence output by the coding unit is the characteristic sequence after nonlinear processing.
In an embodiment, the second feature sequence may be added to the regularized feature sequence to avoid degradation problems. Then, the added features are subjected to nonlinear processing. For example, the ith feature in the second feature sequence is set to be
Figure BDA0003905247300000091
The regularized feature of the ith feature in the second feature sequence can be added to the ith feature by the following equation (3) to obtain the added feature +.>
Figure BDA0003905247300000092
Figure BDA0003905247300000093
Where LayerNorm () is a function of the regularization operation,
Figure BDA0003905247300000094
after obtaining the added features
Figure BDA0003905247300000095
After that, the following formula (4) can be used to apply the added feature +.>
Figure BDA0003905247300000096
Nonlinear processing is performed to obtain an output feature +.>
Figure BDA0003905247300000097
Figure BDA0003905247300000098
Wherein sigma 1 () As a function of the non-linear activation,
Figure BDA0003905247300000099
are network parameters of the nonlinear layer.
The implementation principle of operation S222 of adjusting the first feature sequence described above will be further extended and defined in connection with fig. 3.
Fig. 3 is a schematic diagram of adjusting a first feature sequence according to an embodiment of the present disclosure.
As shown in fig. 3, in an embodiment 300, the encoding network may determine an adjustment 303 corresponding to the first feature sequence, for example, from the hidden state feature 301 and the input feature sequence 302. The first feature sequence 304 is then adjusted according to the adjustment 303 to obtain a second feature sequence 305. For example, the encoding network may fuse the hidden state features 301 with the input feature sequence 302 such that the fused features may characterize the long-term semantics (which may be understood as global semantics) of the text to be processed, i.e. the semantics of the text to be processed determined from the semantics of the preceding text. The embodiment may take the long term semantics as an adjustment amount for the first feature sequence. Since the first feature sequence is obtained by processing only the input feature sequence, the first feature sequence may characterize short-term semantics (which may be understood as local semantics) of the text to be processed. Therefore, the adjusted second feature sequence not only can represent the short-term semantics of the text to be processed, but also can represent the long-term semantics of the text to be processed, and the expression capacity of the second feature sequence is improved.
For example, the hidden state feature 301 and the input feature sequence 302 may be subjected to a concat () connection, and then the characteristics resulting from the connection may be subjected to linear processing, thereby resulting in characteristics that characterize the long-term semantics of the text to be processed.
For example, for coding units arranged in the j-th position, if the current hidden state features
Figure BDA0003905247300000101
The i-th feature in the input feature sequence +.>
Figure BDA0003905247300000102
This embodiment may also achieve fusion of the two features by computing the inner product of the hidden state feature 301 and the i-th feature in the input feature sequence. I.e. the inner product calculated as a feature characterizing the long-term semantics of the text to be processed. On the basis of this, the features characterizing the long-term semantics of the text to be processed can be added to the first feature sequence, so that a second feature sequence is obtained. For example, for the features arranged at the ith position in the first feature sequence, the corresponding features in the second feature sequence may be calculated using the following equation (5)>
Figure BDA0003905247300000103
Where t may denote the order of the text to be processed in the text sequence processed by the text generation method. That is, the text to be processed described above is the t-th text processed by the text generation method, the current hidden state feature +. >
Figure BDA0003905247300000104
The method is updated hidden state characteristics obtained by the coding unit arranged at the j-th position in the process of generating the following text according to the (t-1) th text to be processed.
Figure BDA0003905247300000111
For example, W can be set 1 j The value of (c) is a preset initial value, for example, may be any initial value such as a 0 matrix, and the present disclosure is not limited thereto.
The implementation principle of the above-described operation S223 of updating the hidden state feature will be further extended and defined in conjunction with fig. 4.
Fig. 4 is a schematic diagram of updating hidden state features according to an embodiment of the present disclosure.
According to embodiments of the present disclosure, updating of hidden state features may be performed according to bio-plasticity principles. For example, the input signature sequence of the encoding network may be analogized to a presynaptic neuronal state and the second signature sequence may be analogized to a postsynaptic neuronal state. An update amount of the hidden state is determined from the pre-synaptic neuron state and the post-synaptic neuron state, the update amount being indicative of a connection between the preceding text and the text to be processed. By the method, the hidden state can be continuously updated, and automatic learning without manual intervention can be realized without relying on guidance of a loss function. And the updating principle of the hidden state features can be more attached to a biological mechanism, so that the updating accuracy is improved.
As shown in fig. 4, in this embodiment 400, a feature sequence 401 of the setting input includes a first text feature corresponding to a text unit. If the text units obtained after the text to be processed is segmented are L, the input feature sequence comprises L first text features. Accordingly, the first feature sequence obtained by encoding and the second feature sequence 402 obtained by adjusting according to the hidden state features each include L features corresponding to L text units. The embodiment may take L features included in the second feature sequence 402 as L second text features.
In this embodiment 400, the amount of update for each text unit may be determined based on the first text feature and the second text feature for that text unit. Thus, L update amounts corresponding to L text units can be obtained in total.
For example, the embodiment may cross-multiply the first text feature with the second text feature, and determine the update amount based on the cross-multiplied feature. The size of the cross multiplied feature may be, for example, the same as the size of the hidden state feature to be updated. For example, for the ith text unit in the L text units, the feature sequence of the jth coding unit is input with the ith text unit The first text feature corresponding to the i text units is that
Figure BDA0003905247300000112
Obtaining a second text feature corresponding to the ith text element in the second feature sequence as +.>
Figure BDA0003905247300000113
The adjustment amount aw for the ith text element i j For example, the calculation can be performed by the following equation (6).
Figure BDA0003905247300000121
Wherein,,
Figure BDA0003905247300000122
for example, can be expressed as +.>
Figure BDA0003905247300000123
Wherein, if set
Figure BDA0003905247300000124
Then->
Figure BDA0003905247300000125
For the network parameters in the jth coding unit, is->
Figure BDA0003905247300000126
Figure BDA0003905247300000127
In an embodiment, a Hebb's rule may also be used to process the second text feature and the first text feature for each text unit to obtain an updated amount of text units. For example, the coding unit arranged at the jth position may calculate the adjustment amount ΔW for the ith text unit using the following equation (7) i j
Figure BDA0003905247300000128
Wherein,,
Figure BDA0003905247300000129
are all network parameters of the coding units arranged at the j-th position, and
Figure BDA00039052473000001210
if the coding unit is not considered to be a position in N coding units, the superscript j in the formula (7) can be removed and the H of the superscript can be removed for each coding unit i I.e. for entering the first text feature, Y 'of the corresponding ith text element of each coding element' i I.e. the second text feature of the corresponding i-th text unit obtained for each coding unit.
It will be appreciated that the adjustment amount Δw for the ith text element may also be calculated using a formula including only at least one of the first right term and the second right to fourth terms in the above formula (7) i j The present disclosure is not limited in this regard.
By adopting the law of Hubble to determine the update quantity, the update of the hidden state features can be more in accordance with the biological principle, the update of the hidden state features can be more in accordance with reality and reality, and the accuracy of the generated subsequent text is improved.
Subsequently, after obtaining L update amounts, the embodiment 400 may update the hidden state feature according to the L update amounts. For example, the weighted sum of the L update amounts may be taken as the total update amount 403, and then the total update amount 403 and the hidden state feature 404 may be added, thereby completing the update of the hidden state feature, and obtaining the updated hidden state feature 405. The weights used in calculating the weighted sum can be obtained by pre-training, for example, as network parameters of the coding unit. Alternatively, the weight used in calculating the weighted sum may be a value set according to actual requirements, such as 1/L, which is not limited in the present disclosure.
In one embodiment, the principle of dopamine neurons may be used to determine the weights to be used in weighting, and the weights may be used as learning rates. By this principle, the signals of different areas in the second text feature can be integrated, thereby increasing the plasticity of the hidden state feature. For example, the embodiment may perform a nonlinear process on the second text feature corresponding to each text unit, resulting in a learning rate for each text unit. Finally, the hidden state feature is updated according to the learning rate and the update amount.
For example, for the ith text element, the coding element arranged at the jth position may obtain the learning rate using the following equation (8)
Figure BDA0003905247300000131
Figure BDA0003905247300000132
Wherein,,
Figure BDA0003905247300000133
i.e. the second text feature for the ith text element in the second feature sequence. />
Figure BDA0003905247300000134
Network parameters for coding units arranged in the j-th position, wherein +.>
Figure BDA0003905247300000135
Figure BDA0003905247300000136
σ 2 () Sigma, a nonlinear activation function 2 () And sigma (sigma) 1 () Similarly.
According to the embodiment of the disclosure, the learning rate is obtained
Figure BDA0003905247300000137
Thereafter, the instance 400 can be based on the learning rate and the update amountTo update the hidden state feature.
For example, the weighted update amount for each text unit may be determined based on the learning rate and the update amount for that text unit. Finally, the hidden state feature is updated according to the L weighted update amounts for the L text units. For example, the learning rate for each text unit is used as a weight for the update amount for each text unit, and the learning rate is multiplied by the update amount to obtain a weighted update amount. Finally, the sum of the plurality of weighted update amounts is taken as the total update amount 403. The updated hidden state feature 405 can be obtained by adding the total update amount 403 to the hidden state feature 404 to be updated. For example, for the coding units arranged in the j-th position, the updated hidden state features can be calculated by the following equation (9)
Figure BDA0003905247300000138
Figure BDA0003905247300000139
In an embodiment, a boundary may be set for an element in the hidden state feature, so as to avoid a situation that the value of the element in the hidden state feature is too large, and calculation risk is brought to the text generation method, so that updating of the hidden state feature is unreasonable. And therefore, the accuracy of expressing the semantics due to the hidden state features can be improved.
For example, in updating the hidden state feature according to the plurality of weighted update amounts, the embodiment may adjust the hidden state feature according to the sum of the plurality of weighted update amounts, that is, adjust the hidden state feature using the above-described formula (9), and thus take the feature obtained by the formula (9) as the adjusted state feature. And then updating the adjusted state characteristics by adopting a boundary function to obtain updated hidden state characteristics. For example, the boundary value is set to H W In this embodiment, the above-described formula (9) for obtaining the updated hidden state feature may be rewritten as the following formula (10).
Figure BDA0003905247300000141
Wherein BoundedDecay () is a boundary function.
For example, the embodiment may compare each element in the adjusted state feature to a boundary value, and assign the boundary value to an element if the value of the element exceeds the boundary value. If the value of a certain element does not exceed the boundary value, the value of the certain element is reserved.
In an embodiment, when the adjusted state feature is updated with the boundary function, for example, when a target element exceeding a predetermined boundary is included in the adjusted state feature, the target element may also be updated according to a predetermined forgetting rate.
For example, a difference between 1 and a predetermined forgetting rate may be used as an adjustment coefficient, the adjustment coefficient may be multiplied by the value of the target element, and the multiplied value may be assigned to the target element.
For example, the boundary function can also be expressed by the following formula (11).
Figure BDA0003905247300000142
This embodiment may employ the formula (10)
Figure BDA0003905247300000143
And (3) replacing a in the formula (11) to finish updating the adjusted state characteristics and obtain updated hidden state characteristics. Wherein p is a predetermined forgetting rate, and the value of the predetermined forgetting rate can be set according to actual requirements, for example, can be set to 0.05, which is not limited in the disclosure.
The principles of the text generation method provided by the present disclosure will be further expanded and defined below in connection with fig. 5.
Fig. 5 is an implementation schematic diagram of a text generation method according to an embodiment of the present disclosure.
As shown in fig. 5, this embodiment 500 may employ a text generation model including a preprocessing network 510, an attention network 520, and a decoding network 530 to perform a text generation method.
For the text to be processed 501, the text to be processed 501 may be input into the preprocessing network 510 first, and the embedded feature sequence 502 may be output by the preprocessing network 510. The preprocessing network includes the embedded network described above.
The attention network 520 is composed of L coding units connected in sequence. Each coding unit includes a multi-headed self-attention layer 521, a feature adjustment layer 522, a hidden state update layer 523, and a superposition & regularization layer 524. In order to reduce the complexity of the model calculation, the attention network 520 may also be provided with a regularization layer 525, for example, prior to the L coding units, for regularizing the embedded feature sequence 502 using equation (2) described above.
Accordingly, after the embedded feature sequence 502 is obtained, the embedded feature sequence 502 may be input to the regularization layer 525 first. Subsequently, the feature sequence output from the regularization layer 525 is input to the coding unit that is arranged first among the L coding units that are sequentially connected. The multi-headed self-attention layer 521 in the j-th encoding unit is configured to encode the input feature sequence using the above-described formula (1) to obtain the above-described first feature sequence. For example, the multi-headed self-attention layer 521 may be used to perform operation S221 described above.
The hidden state update layer 523 is used to store and update hidden state features. The first feature sequence may be input to the hidden state update layer and the feature adjustment layer 522. Meanwhile, the hidden state features stored by the feature adjustment layer 522 may be input to the feature adjustment layer 522 in synchronization with the first feature sequence. Note that the hidden state feature input to the feature adjustment layer 522 is a hidden state feature stored before being updated according to the first feature sequence. The feature adjustment layer 522 is configured to adjust the first feature sequence according to the input hidden status feature, so as to obtain a second feature sequence. For example, the feature adjustment layer 522 may be configured to perform operation S222 described above, and the feature adjustment layer 522 may adjust the first feature sequence using equation (5) described above, for example. Meanwhile, the hidden state update layer 523 is configured to update the hidden state feature according to the first feature sequence. For example, the hidden-state updating layer 523 may be used to perform operation S223 described above, and the hidden-state updating layer 523 may be used to update the hidden-layer state using equation (9) described above.
The overlay & regularization layer 524 may be configured to regularize the second feature sequence using equation (3) described above, and add the feature sequence obtained by the regularization process to the first feature sequence. The overlap-and-regularization layer 524 may also be used to non-linearly process the summed features using equation (4) described above, for example, to obtain an output feature sequence. The feature sequence output by the last coding unit among the L coding units may be used as the text feature sequence 503.
The decoding network 530 is used to decode the input text feature sequence 503 to output the following text 504.
It is understood that the attention network may be constructed based on a transducer encoder. The coding units in the attention network differ from the coding units in the transform encoder in that the coding units in the attention network are provided with a feature adjustment layer 522 and a hidden state update layer 523. Therefore, the attention network can have the semantics which are continuously corrected and complemented in the encoding process, and the attention network has the capability of capturing long-time memory on the premise of not adjusting network parameters.
Based on the implementation principle of this embodiment, for the coding unit arranged at the jth position, for example, the following formula (12) may be used to represent its calculation principle for the ith text unit.
Figure BDA0003905247300000161
It can be understood that the calculation principle of the N coding units is equation (12), and the difference is only that the values of the superscripts j are different. If it is not considered which text unit is processed, the calculation principle of the coding network of the coding units arranged at the j-th position for the t-th text X to be processed can be expressed by the following formula (13).
Figure BDA0003905247300000162
The calculation principle can be expressed by the following formula (14) for the whole attention network.
Figure BDA0003905247300000163
Wherein,,
Figure BDA0003905247300000164
for a text feature sequence output for the t-th text to be processed,
Figure BDA0003905247300000165
when the text generation method of the embodiment of the disclosure is applied to the intelligent voice interaction system, the intelligent voice interaction system can memorize the preference of the user through the intelligent voice interaction with the user, and therefore the method is beneficial to providing more real and reliable voice response for the user in the follow-up voice interaction. For example, if the user is in the history of the intelligent voice interaction system voice interaction, the first voice is provided "the highest mountain in the world is himalaya mountain, altitude is 8848 meters". After a period of time following the provision of the first voice, a second voice is provided, "himalayan mountain is changed due to the geological movement height, and 8850 meters. Then after providing the second voice, if the user provides an inquiry voice "how high the highest mountain in the world is," the intelligent voice interactive system can generate the text "8850 meters" from Himalayan mountain, and convert the text into voice for playing.
In order to facilitate implementation of the text generation method provided by the present disclosure, the present disclosure also provides a training method of a text generation model. This training method will be described in detail below in connection with fig. 6.
Fig. 6 is a flow diagram of a training method of a text generation model according to an embodiment of the present disclosure.
As shown in fig. 6, the training method 600 of the text generation model of this embodiment may include operations S610 to S640. Wherein, in operation S620, each encoding unit in the attention network may be configured to perform operations S621 to S623. The text generation model may include a preprocessing network, an attention network, and a decoding network. Wherein the attention network may be constituted by the coding units.
In one embodiment, the text generation model may employ the model structure shown in FIG. 5 as described above.
In operation S610, each target text in the text sequence is preprocessed by the preprocessing network to obtain an embedded feature sequence, where the embedded feature sequence includes embedded features corresponding to text units in each target text. The embodiment can take the text sequence as training text, take any text except the last text in the text sequence as target text, and take the next text of the any text as the following text true value of the any text. The text sequence may be obtained, for example, by a segmentation of a text segment.
The implementation principle of this operation S610 is similar to that of the operation S210 described above, and will not be described here again.
In operation S620, the embedded feature sequence is input into the attention network, resulting in a text feature sequence output by the attention network.
In operation S621, an input feature sequence is encoded using an attention mechanism, resulting in a first feature sequence.
In operation S622, the first feature sequence is adjusted according to the hidden status feature to obtain a second feature sequence. Wherein the hidden state features characterize the semantics of the preceding text of each target text.
In operation S623, the hidden state features are updated according to the second feature sequence.
It is to be understood that the implementation principle of the operation S620 may be similar to the implementation principle of the operation S220 described above, and the implementation principle of the operations S621 to S623 may be similar to the implementation principle of the operations S221 to S223 described above, respectively, and will not be repeated herein.
In operation S630, the text feature sequence is decoded using a decoding network to generate predicted following text for each target text. The implementation principle of this operation S630 may be similar to that of the operation S230 described above, and will not be described here again.
In operation S640, the text generation model is trained based on the predicted following text and the adjacent following text of each target text in the text sequence.
According to an embodiment of the present disclosure, the next-to-next text is the next text of each target text in the text sequence. The embodiment may determine a penalty value for the text generation model based on the difference between the predicted following text and the adjacent following text. The text generation model is trained with the goal of minimizing the loss value. For example, the difference between the predicted post text and the adjacent post text may be determined using the semantic similarity between the texts, the ratio of the same number of characters, and the like. For example, the generation accuracy may be employed to determine a loss value for the text generation model. The present disclosure does not limit the method of determining the difference and loss values. It will be appreciated that a loss function may be designed that requires predicting that the following text approximates the adjacent following text and optimizing network parameters in the text generation model by a gradient back-pass algorithm.
In one embodiment, after the loss value is determined, a gradient back-pass algorithm may be employed to determine the gradient of the network parameters in the text generation model.
In an embodiment, a gradient return algorithm and a predetermined truncation position of the gradient return may be employed to determine the gradient of the loss value with respect to the network parameters in the text generation model. This is because, as can be seen from equation (14) above, in the gradient pass-back process, a pass-back to M is required 1 . Thus, there may be a case where the consumption reduction takes too long under the gradient of the time feedback, and excessive video memory is required. To avoid this, the embodiment may preset the cut-off position of the gradient return. For example, the gradient is set to return only to M t-k Then at the time of determining the loss valueOnly M is involved in the calculation of the gradient of the network parameters in the text generation model t-k Is a gradient of network parameters of (a). For M is involved t-k-1 ~M 1 And the gradient of the network parameters is not calculated. Then, the embodiment aims at minimizing the calculated loss value, and adjusts the network parameters of the text generation model according to the calculated gradient, so as to realize training of the text generation model.
Based on the text generation method provided by the disclosure, the disclosure also provides a text generation device. The device will be described in detail below in connection with fig. 7.
Fig. 7 is a block diagram of a text generating apparatus according to an embodiment of the present disclosure.
As shown in fig. 7, the text generating apparatus 700 of this embodiment includes a preprocessing module 710, a text feature obtaining module 720, and a feature decoding module 730. The text feature obtaining module 720 may include a coding sub-module 721, an adjusting sub-module 722, and an updating sub-module 723, among others.
The preprocessing module 710 is configured to preprocess a text to be processed to obtain an embedded feature sequence, where the embedded feature sequence includes embedded features corresponding to text units in the text to be processed. In an embodiment, the preprocessing module 710 may be used to perform the operation S210 described above, which is not described herein.
The feature decoding module 730 is configured to decode the text feature sequence to generate a text following the text to be processed. In an embodiment, the feature decoding module 730 may be configured to perform the operation S230 described above, which is not described herein.
The encoding submodule 721 is used for encoding the input feature sequence by adopting an attention mechanism aiming at the encoding unit to obtain a first feature sequence. In an embodiment, the encoding submodule 721 may be used to perform the operation S221 described above, which is not described herein.
The adjustment sub-module 722 is configured to adjust the first feature sequence according to the hidden status feature, to obtain a second feature sequence; the hidden state features characterize the semantics of the preceding text of the text to be processed. In an embodiment, the adjustment sub-module 722 may be used to perform the operation S222 described above, which is not described herein.
The update sub-module 723 is configured to update the hidden state feature according to the second feature sequence. In an embodiment, the update sub-module 723 may be used to perform operation S223 described above, and is not described herein.
The text feature obtaining module 720 is configured to input the embedded feature sequence into an attention network composed of coding units, and obtain a text feature sequence output by the attention network. In an embodiment, the text feature obtaining module 720 may be configured to perform the operation S220 described above, which is not described herein.
According to an embodiment of the present disclosure, the adjusting submodule includes: the adjustment amount determining unit is used for determining the adjustment amount corresponding to the first characteristic sequence according to the hidden state characteristics and the input characteristic sequence; and the adjusting unit is used for adjusting the first characteristic sequence according to the adjustment quantity to obtain a second characteristic sequence.
According to an embodiment of the present disclosure, an input feature sequence includes a first text feature corresponding to a text unit; the second feature sequence includes a second text feature corresponding to the text unit; the update sub-module includes: an update amount determination unit configured to determine an update amount for the text unit based on the second text feature and the first text feature; and an updating unit for updating the hidden state feature according to the update amount.
According to an embodiment of the present disclosure, an update amount determination unit is configured to: and processing the second text feature and the first text feature by adopting the Hubby law to obtain the updating quantity aiming at the text unit.
According to an embodiment of the present disclosure, the update amount determination unit is configured to obtain the update amount using the following formula:
Figure BDA0003905247300000191
wherein DeltaW is i The updating quantity of the ith text unit in the text to be processed; y'. i A second text feature corresponding to the ith text element; h i To correspond toFirst text feature of ith text element, W A 、W B 、W C 、W p Is a network parameter of the coding unit.
According to an embodiment of the present disclosure, the update sub-module further includes: the learning rate determining unit is configured to perform nonlinear processing on the second text feature to obtain a learning rate for the text unit, where the updating unit is configured to: and updating the hidden state characteristics according to the learning rate and the updating quantity.
According to an embodiment of the present disclosure, a plurality of text units are included in a text to be processed. The updating unit includes: a weight determining subunit configured to determine a weight update amount for the text unit according to the learning rate and the update amount for the text unit; and an update subunit for updating the hidden state feature according to a plurality of weighted update amounts for a plurality of text units.
According to an embodiment of the present disclosure, the update subunit is configured to: adjusting the hidden state characteristics according to the sum of the weighted updating amounts to obtain adjusted state characteristics; and updating the adjusted state features by adopting the boundary function to obtain updated hidden state features.
According to an embodiment of the present disclosure, the update subunit is configured to: determining that the adjusted state feature is an updated hidden state feature in response to each element in the adjusted state feature being within a predetermined boundary; and in response to the adjusted state feature including a target element beyond a predetermined boundary, updating the target element according to a predetermined forgetting rate to obtain an updated hidden state feature.
According to an embodiment of the present disclosure, an attention network is constituted of a plurality of coding units connected in sequence; inputting a characteristic sequence of a coding unit arranged at the head in a plurality of coding units, wherein the characteristic sequence comprises an embedded characteristic sequence; the text feature obtaining module further includes: the nonlinear processing sub-module is used for carrying out nonlinear processing on the second characteristic sequence to obtain an output characteristic sequence, wherein the text characteristic sequence comprises: a characteristic sequence output by the last coding unit in the plurality of coding units; the feature sequence of the other coding units except the coding unit arranged at the head in the plurality of coding units is input, and the feature sequence comprises the following steps: and the characteristic sequences output by the previous coding units connected with other coding units.
Based on the training method of the text generation model provided by the disclosure, the disclosure also provides a training device of the text generation model. The device will be described in detail below in connection with fig. 8.
Fig. 8 is a block diagram of a training device of a text generation model according to an embodiment of the present disclosure.
As shown in fig. 8, the training apparatus 800 for a text generation model of this embodiment includes a preprocessing module 810, a text feature obtaining module 820, a feature decoding module 830, and a model training module 840. The text feature obtaining module 820 may include a coding sub-module 821, an adjusting sub-module 822, and an updating sub-module 823, among others. The text generation model comprises a preprocessing network, an attention network and a decoding network; the attention network is constituted by coding units.
The preprocessing module 810 is configured to preprocess each target text in the text sequence by using a preprocessing network, so as to obtain an embedded feature sequence, where the embedded feature sequence includes an embedded feature corresponding to a text unit in each target text. In an embodiment, the preprocessing module 810 may be used to perform the operation S610 described above, which is not described herein.
The text feature obtaining module 820 is configured to input the embedded feature sequence into the attention network, and obtain the text feature sequence output by the attention network. In an embodiment, the text feature obtaining module 820 may be used to perform the operation S620 described above, which is not described herein.
The feature decoding module 830 is configured to decode the text feature sequence using a decoding network to generate predicted following text for each target text. In an embodiment, the feature decoding module 830 may be configured to perform the operation S630 described above, which is not described herein.
Model training module 840 is used to train a text generation model based on predicting the following text of each target text in the following text and text sequence. In an embodiment, the model training module 840 may be used to perform the operation S640 described above, which is not described herein.
The coding sub-module 821 is configured to code an input feature sequence by using an attention mechanism with respect to the coding unit, so as to obtain a first feature sequence. In an embodiment, the encoding sub-module 821 may be used to perform the operation S621 described above, which is not described herein.
The adjustment sub-module 822 is configured to adjust the first feature sequence according to the hidden status feature, so as to obtain a second feature sequence; the hidden state features characterize the semantics of the preceding text of the text to be processed. In an embodiment, the adjustment sub-module 822 may be used to perform the operation S622 described above, which is not described herein.
The update sub-module 823 is configured to update the hidden state feature according to the second feature sequence. In an embodiment, the updating sub-module 823 may be used to perform the operation S623 described above, which is not described herein.
According to an embodiment of the present disclosure, a model training module includes: a penalty value determination submodule for determining a penalty value of the text generation model for each target text based on the predicted differences between the following text and the following text of each target text in the text sequence; the gradient determination submodule is used for determining the gradient of the loss value relative to the network parameter in the text generation model by adopting a gradient return algorithm and a preset cut-off position of the gradient return; and the training sub-module is used for training the text generation model according to the gradient of the network parameters by taking the minimum loss value as a target.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and applying personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated. In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that may be used to implement the text generation method and/or training method of the text generation model of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as a text generation method and/or a training method of a text generation model. For example, in some embodiments, the text generation method and/or the training method of the text generation model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the text generation method and/or the training method of the text generation model described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the text generation method and/or the training method of the text generation model in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS"). The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (26)

1. A text generation method, comprising:
preprocessing a text to be processed to obtain an embedded feature sequence, wherein the embedded feature sequence comprises embedded features corresponding to text units in the text to be processed;
inputting the embedded feature sequence into an attention network formed by a coding unit to obtain a text feature sequence output by the attention network; and
decoding the text feature sequence, generating a subsequent text of the text to be processed,
Wherein the encoding unit is configured to:
encoding the input feature sequence by adopting an attention mechanism to obtain a first feature sequence;
adjusting the first characteristic sequence according to the hidden state characteristics to obtain a second characteristic sequence; the hidden state features characterize the semantics of the preceding text of the text to be processed; and
updating the hidden state feature according to the second feature sequence,
the updated hidden state features are used for adjusting the subsequent first feature sequences; the subsequent first feature sequence is obtained by encoding the feature sequence input by the encoding unit in a follow-up way by adopting an attention mechanism; the feature sequence of the subsequent input is obtained by preprocessing the subsequent text of the text to be processed.
2. The method of claim 1, wherein said adjusting the first feature sequence according to the hidden state feature to obtain a second feature sequence comprises:
determining an adjustment amount corresponding to the first feature sequence according to the hidden state feature and the input feature sequence; and
and adjusting the first characteristic sequence according to the adjustment quantity to obtain the second characteristic sequence.
3. The method of claim 1, wherein the entered feature sequence includes a first text feature corresponding to the text unit; the second feature sequence includes a second text feature corresponding to the text unit; the updating the hidden state feature according to the second feature sequence includes:
determining an update amount for the text unit according to the second text feature and the first text feature; and
and updating the hidden state characteristics according to the updating quantity.
4. The method of claim 3, wherein the determining an update amount for the text unit from the second text feature and the first text feature comprises:
and processing the second text feature and the first text feature by adopting a Hubby law to obtain the updating quantity aiming at the text unit.
5. The method of claim 4, wherein said processing said second text feature and said first text feature using herford's law to obtain said updated amount for said text unit comprises: the update amount is obtained using the following formula:
Figure FDA0004214061910000021
wherein DeltaW is i The updated amount for the ith text unit in the text to be processed; y'. i A second text feature corresponding to the ith text element; h i For the first text feature corresponding to the ith text element, W A 、W B 、W C 、W D Network parameters for the coding units;
Figure FDA0004214061910000022
the operation symbol is cross multiplication; the symbol of the dot product.
6. A method according to claim 3, wherein said updating said hidden state features according to said second sequence of features further comprises:
nonlinear processing is carried out on the second text characteristics, and the learning rate aiming at the text units is obtained; and
and updating the hidden state features according to the learning rate and the updating quantity.
7. The method of claim 6, wherein the text to be processed includes a plurality of text units therein; the updating the hidden state feature according to the learning rate and the update amount includes:
determining a weighted update amount for the text unit based on the learning rate and the update amount for the text unit; and
updating the hidden state feature according to a plurality of weighted update amounts for a plurality of the text units.
8. The method of claim 7, wherein said updating said hidden state feature according to a plurality of weighted update amounts for a plurality of said text units comprises:
According to the sum of the weighted updating amounts, adjusting the hidden state characteristics to obtain adjusted state characteristics; and
and updating the adjusted state characteristics by adopting a boundary function to obtain updated hidden state characteristics.
9. The method of claim 8, wherein the updating the adjusted state features with a boundary function to obtain updated hidden state features comprises:
determining that the adjusted state feature is the updated hidden state feature in response to each element in the adjusted state feature being within a predetermined boundary; and
and in response to the adjusted state feature comprising the target element exceeding the preset boundary, updating the target element according to a preset forgetting rate to obtain an updated hidden state feature.
10. The method of claim 1, wherein the attention network is comprised of a plurality of coding units connected in sequence; inputting a characteristic sequence of a coding unit arranged at the head in the plurality of coding units to comprise the embedded characteristic sequence; the encoding unit is further configured to:
performing nonlinear processing on the second characteristic sequence to obtain an output characteristic sequence,
Wherein the text feature sequence comprises: the characteristic sequences output by the last coding unit in the plurality of coding units are arranged; the inputting of the characteristic sequences of the coding units except the coding unit arranged at the head in the plurality of coding units comprises the following steps: and the characteristic sequences output by the previous coding units connected with the other coding units.
11. A training method of a text generation model, wherein the text generation model comprises a preprocessing network, an attention network and a decoding network; the attention network is composed of coding units; the method comprises the following steps:
preprocessing each target text in the text sequence by adopting the preprocessing network to obtain an embedded feature sequence, wherein the embedded feature sequence comprises embedded features corresponding to text units in each target text;
inputting the embedded feature sequence into the attention network to obtain a text feature sequence output by the attention network;
decoding the text feature sequence by adopting the decoding network to generate a predicted following text of each target text; and
training the text generation model according to the predicted following text and the adjacent following text of each target text in the text sequence;
Wherein the encoding unit is configured to perform the following operations on:
encoding the input feature sequence by adopting an attention mechanism to obtain a first feature sequence;
adjusting the first characteristic sequence according to the hidden state characteristics to obtain a second characteristic sequence; the hidden state features characterize the semantics of the preceding text of each of the target texts; and
updating the hidden state feature according to the second feature sequence,
the updated hidden state features are used for adjusting the subsequent first feature sequences; the subsequent first feature sequence is obtained by encoding the feature sequence input by the encoding unit in a follow-up way by adopting an attention mechanism; the feature sequence of the subsequent input is obtained by preprocessing the following text of said each target text in said text sequence.
12. The method of claim 11, wherein the training the text generation model according to the predicted following text and the following text of each of the target texts in the text sequence comprises:
determining a loss value of the text generation model for each target text according to the difference between the predicted following text and the following text of each target text in the text sequence;
Determining the gradient of the loss value relative to the network parameters in the text generation model by adopting a gradient return algorithm and a preset cut-off position of gradient return; and
and training the text generation model according to the gradient of the network parameter aiming at minimizing the loss value.
13. A text generation apparatus comprising:
the preprocessing module is used for preprocessing the text to be processed to obtain an embedded feature sequence, wherein the embedded feature sequence comprises embedded features corresponding to text units in the text to be processed;
the text feature obtaining module is used for inputting the embedded feature sequence into an attention network formed by the coding units to obtain a text feature sequence output by the attention network;
a feature decoding module for decoding the text feature sequence to generate a subsequent text of the text to be processed,
wherein, the characteristic sequence obtaining module comprises:
the coding sub-module is used for coding the input characteristic sequence by adopting an attention mechanism aiming at the coding unit to obtain a first characteristic sequence;
the adjusting sub-module is used for adjusting the first characteristic sequence according to the hidden state characteristics to obtain a second characteristic sequence; the hidden state features characterize the semantics of the preceding text of the text to be processed; and
An updating sub-module for updating the hidden state feature according to the second feature sequence,
the updated hidden state features are used for adjusting the subsequent first feature sequences; the subsequent first feature sequence is obtained by encoding the feature sequence input by the encoding unit in a follow-up way by adopting an attention mechanism; the feature sequence of the subsequent input is obtained by preprocessing the subsequent text of the text to be processed.
14. The apparatus of claim 13, wherein the adjustment submodule comprises:
an adjustment amount determining unit configured to determine an adjustment amount corresponding to the first feature sequence according to the hidden state feature and the input feature sequence; and
and the adjusting unit is used for adjusting the first characteristic sequence according to the adjusting quantity to obtain the second characteristic sequence.
15. The apparatus of claim 13, wherein the entered feature sequence comprises a first text feature corresponding to the text unit; the second feature sequence includes a second text feature corresponding to the text unit; the update sub-module includes:
an update amount determining unit configured to determine an update amount for the text unit according to the second text feature and the first text feature; and
And the updating unit is used for updating the hidden state characteristics according to the updating quantity.
16. The apparatus of claim 15, wherein the update amount determination unit is configured to:
and processing the second text feature and the first text feature by adopting a Hubby law to obtain the updating quantity aiming at the text unit.
17. The apparatus of claim 16, wherein the update amount determination unit is configured to derive the update amount using the following formula:
Figure FDA0004214061910000051
wherein DeltaW is i The updated amount for the ith text unit in the text to be processed; y'. i A second text feature corresponding to the ith text element; h i For the first text feature corresponding to the ith text element, W A 、W B 、W C 、W D Network parameters for the coding units;
Figure FDA0004214061910000052
the operation symbol is cross multiplication; the symbol of the dot product.
18. The apparatus of claim 15, wherein the update sub-module further comprises:
a learning rate determining unit, configured to perform nonlinear processing on the second text feature to obtain a learning rate for the text unit,
wherein the updating unit is used for: and updating the hidden state features according to the learning rate and the updating quantity.
19. The apparatus of claim 18, wherein the text to be processed includes a plurality of text units therein; the updating unit includes:
a weighted amount determination subunit configured to determine a weighted update amount for the text unit according to the learning rate and the update amount for the text unit; and
and the updating subunit is used for updating the hidden state characteristics according to a plurality of weighted updating amounts for a plurality of text units.
20. The apparatus of claim 19, wherein the update subunit is to:
according to the sum of the weighted updating amounts, adjusting the hidden state characteristics to obtain adjusted state characteristics; and
and updating the adjusted state characteristics by adopting a boundary function to obtain updated hidden state characteristics.
21. The apparatus of claim 20, wherein the update subunit is to:
determining that the adjusted state feature is the updated hidden state feature in response to each element in the adjusted state feature being within a predetermined boundary; and
and in response to the adjusted state feature comprising the target element exceeding the preset boundary, updating the target element according to a preset forgetting rate to obtain an updated hidden state feature.
22. The apparatus of claim 13, wherein the attention network is comprised of a plurality of coding units connected in sequence; inputting a characteristic sequence of a coding unit arranged at the head in the plurality of coding units to comprise the embedded characteristic sequence; the text feature obtaining module further includes:
a nonlinear processing sub-module for performing nonlinear processing on the second characteristic sequence to obtain an output characteristic sequence,
wherein the text feature sequence comprises: the characteristic sequences output by the last coding unit in the plurality of coding units are arranged; the inputting of the characteristic sequences of the coding units except the coding unit arranged at the head in the plurality of coding units comprises the following steps: and the characteristic sequences output by the previous coding units connected with the other coding units.
23. A training device of a text generation model, wherein the text generation model comprises a preprocessing network, an attention network and a decoding network; the attention network is composed of coding units; the device comprises:
the preprocessing module is used for preprocessing each target text in the text sequence by adopting the preprocessing network to obtain an embedded feature sequence, wherein the embedded feature sequence comprises embedded features corresponding to text units in each target text;
The text feature obtaining module is used for inputting the embedded feature sequence into the attention network to obtain a text feature sequence output by the attention network;
the feature decoding module is used for decoding the text feature sequence by adopting the decoding network to generate a predicted subsequent text of each target text; and
the model training module is used for training the text generation model according to the predicted following text and the following text of each target text in the text sequence;
wherein, the text feature obtaining module includes:
the coding submodule is used for coding the input characteristic sequence by adopting an attention mechanism aiming at the coding unit to obtain a first characteristic sequence;
the adjusting sub-module is used for adjusting the first characteristic sequence according to the hidden state characteristics to obtain a second characteristic sequence; the hidden state features characterize the semantics of the preceding text of each of the target texts; and
the updating sub-module is used for updating the hidden state features according to the second feature sequence, wherein the updated hidden state features are used for adjusting the subsequent first feature sequence; the subsequent first feature sequence is obtained by encoding the feature sequence input by the encoding unit in a follow-up way by adopting an attention mechanism; the feature sequence of the subsequent input is obtained by preprocessing the following text of said each target text in said text sequence.
24. The apparatus of claim 23, wherein the model training module comprises:
a loss value determination sub-module for determining a loss value of the text generation model for each of the target texts according to a difference between the predicted following text and the following text of each of the target texts in the text sequence;
the gradient determining submodule is used for determining the gradient of the loss value relative to the network parameter in the text generating model by adopting a gradient return algorithm and a preset cut-off position of gradient return; and
and the training sub-module is used for training the text generation model according to the gradient of the network parameter by taking the minimization of the loss value as a target.
25. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 12.
26. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-12.
CN202211306837.XA 2022-10-24 2022-10-24 Text generation method and training method and device of text generation model Active CN115630651B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211306837.XA CN115630651B (en) 2022-10-24 2022-10-24 Text generation method and training method and device of text generation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211306837.XA CN115630651B (en) 2022-10-24 2022-10-24 Text generation method and training method and device of text generation model

Publications (2)

Publication Number Publication Date
CN115630651A CN115630651A (en) 2023-01-20
CN115630651B true CN115630651B (en) 2023-06-02

Family

ID=84906404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211306837.XA Active CN115630651B (en) 2022-10-24 2022-10-24 Text generation method and training method and device of text generation model

Country Status (1)

Country Link
CN (1) CN115630651B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050465B (en) * 2023-02-09 2024-03-19 北京百度网讯科技有限公司 Training method of text understanding model, text understanding method and device
CN117807963B (en) * 2024-03-01 2024-04-30 之江实验室 Text generation method and device in appointed field

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10380236B1 (en) * 2017-09-22 2019-08-13 Amazon Technologies, Inc. Machine learning system for annotating unstructured text
CN111858914A (en) * 2020-07-27 2020-10-30 湖南大学 Text abstract generation method and system based on sentence-level evaluation

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7082357B2 (en) * 2018-01-11 2022-06-08 ネオサピエンス株式会社 Text-to-speech synthesis methods using machine learning, devices and computer-readable storage media
CN113627135B (en) * 2020-05-08 2023-09-29 百度在线网络技术(北京)有限公司 Recruitment post description text generation method, device, equipment and medium
CN113836928B (en) * 2021-09-28 2024-02-27 平安科技(深圳)有限公司 Text entity generation method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10380236B1 (en) * 2017-09-22 2019-08-13 Amazon Technologies, Inc. Machine learning system for annotating unstructured text
CN111858914A (en) * 2020-07-27 2020-10-30 湖南大学 Text abstract generation method and system based on sentence-level evaluation

Also Published As

Publication number Publication date
CN115630651A (en) 2023-01-20

Similar Documents

Publication Publication Date Title
US20230100376A1 (en) Text sentence processing method and apparatus, computer device, and storage medium
CN115630651B (en) Text generation method and training method and device of text generation model
Chen et al. End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding.
CN113553864B (en) Translation model training method and device, electronic equipment and storage medium
CN106910497B (en) Chinese word pronunciation prediction method and device
JP7331975B2 (en) Cross-modal search model training methods, apparatus, equipment, and storage media
CN110326002B (en) Sequence processing using online attention
CN110598713A (en) Intelligent image automatic description method based on deep neural network
CN112528655B (en) Keyword generation method, device, equipment and storage medium
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
JP7346788B2 (en) Speech recognition model training methods, devices, equipment, and storage media
CN116415654A (en) Data processing method and related equipment
US20240029436A1 (en) Action classification in video clips using attention-based neural networks
CN113792855A (en) Model training and word stock establishing method, device, equipment and storage medium
CN116152833B (en) Training method of form restoration model based on image and form restoration method
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN115687934A (en) Intention recognition method and device, computer equipment and storage medium
CN115803806A (en) Systems and methods for training dual-mode machine-learned speech recognition models
CN113869205A (en) Object detection method and device, electronic equipment and storage medium
CN113553412A (en) Question and answer processing method and device, electronic equipment and storage medium
CN111597816A (en) Self-attention named entity recognition method, device, equipment and storage medium
CN112949433B (en) Method, device and equipment for generating video classification model and storage medium
CN114528387A (en) Deep learning conversation strategy model construction method and system based on conversation flow bootstrap
CN112818688B (en) Text processing method, device, equipment and storage medium
CN115359323A (en) Image text information generation method and deep learning model training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant