CN112613307A

CN112613307A - Text processing device, method, apparatus, and computer-readable storage medium

Info

Publication number: CN112613307A
Application number: CN201910894207.0A
Authority: CN
Inventors: 郭垿宏; 刘天赏; 郭心语; 李安新; 陈岚; 池田大志; 藤本拓
Original assignee: NTT Korea Co Ltd
Current assignee: NTT Docomo Inc; NTT Korea Co Ltd
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2021-04-06
Also published as: JP2021051709A

Abstract

The application relates to a text processing device, a text processing method, a text processing apparatus and a computer-readable storage medium. The text processing apparatus includes: the encoding unit is configured to encode the source text to obtain a source text encoding hidden state; a decoding unit configured to determine a decoding concealment state; and the output unit is configured to determine the probability distribution of output words according to the external information, the source text encoding hidden state and the decoding hidden state so as to determine the output words.

Description

Text processing device, method, apparatus, and computer-readable storage medium

Technical Field

The present disclosure relates to the field of text processing, and in particular, to a text processing apparatus, method, device, and computer-readable storage medium.

Background

In the existing text processing, such as text conversion, text generation, etc., the final text processing result can be obtained by processing the input source text.

In some cases, the user may specify some external information for the text processing process for more desirable results, such external information may be important information in the text specified by the user, or may be information of other text associated with the source text. In order to make such external information more likely to appear in the text processing result, a text processing method that sufficiently considers the external information in the text processing is required.

Disclosure of Invention

In order to fully consider external information in a text processing process, the present disclosure provides a text processing method, apparatus, device, and computer-readable storage medium.

According to an aspect of the present disclosure, there is provided a text processing apparatus including: the encoding unit is configured to encode the source text to obtain a source text encoding hidden state; a decoding unit configured to determine a decoding concealment state; and the output unit is configured to determine the probability distribution of output words according to the external information, the source text encoding hidden state and the decoding hidden state so as to determine the output words.

In some embodiments, the output unit is configured to: and determining the words with the probability greater than or equal to an output probability threshold value and belonging to the external information in the candidate output words as the candidate output words of the current time step according to the external information.

In some embodiments, the output unit is further configured to: determining the candidate probability of the candidate words based on the joint probability of the candidate output words of the current time step and the candidate sequence determined by the previous time step and the similarity between the candidate sequence determined by the previous time step and the external information, and determining the candidate words with the highest candidate probability in the preset number as the output words.

In some embodiments, the encoding unit is further configured to encode the extrinsic information to obtain an extrinsic information encoding hidden state; and the output unit is configured to determine the similarity between the external information coding hidden state and the source text decoding hidden state, and when the similarity is greater than or equal to a similarity threshold of a current time step, the output unit outputs the external information as an output word.

In some embodiments, the output unit is further configured to: when the similarity is smaller than the current similarity threshold, the output unit determines a word with the highest probability in the output word probability distribution as an output word of a current time step, adjusts the similarity threshold of the current time step to determine an adjusted similarity threshold, wherein the adjusted similarity threshold is smaller than the similarity threshold of the current time step, and the adjusted similarity threshold is used as the similarity threshold of a next time step.

In some embodiments, the text processing apparatus further comprises: an attention generating unit configured to determine an attention distribution of a current time step based on external information, the source text encoding hidden state, and the decoding hidden state; the output unit is configured to determine an output word probability distribution according to the attention distribution, the source text encoding hidden state, and the decoding hidden state to determine an output word.

In some embodiments, wherein the encoding unit and the decoding unit are trained by: coding the training source text to obtain a hidden state of the training source text code; determining a training decoding hidden state; determining output words of the current time step according to external information, the hidden state of the training source text coding and the hidden state of the training decoding; and adjusting parameters in the encoding unit, the decoding unit so as to minimize a difference between a training output word and a word included in the extrinsic information.

According to another aspect of the present disclosure, there is also provided a text processing method, including: encoding the source text to obtain a source text encoding hidden state; determining a decoding hidden state; and determining output word probability distribution according to the external information, the source text encoding hidden state and the decoding hidden state so as to determine output words.

In some embodiments, the method further comprises: and determining the words with the probability greater than or equal to an output probability threshold value and belonging to the external information in the candidate output words as the candidate output words of the current time step according to the external information.

In some embodiments, the method further comprises: determining the candidate probability of the candidate words based on the joint probability of the candidate output words of the current time step and the candidate sequence determined by the previous time step and the similarity between the candidate sequence determined by the previous time step and the external information, and determining the candidate words with the highest candidate probability in the preset number as the output words.

In some embodiments, the method further comprises: coding the external information to obtain an external information coding hidden state; and determining the similarity between the external information coding hidden state and the decoding hidden state, wherein when the similarity is more than or equal to the similarity threshold of the current time step, the output unit outputs the external information as an output word.

In some embodiments, the method further comprises: when the similarity is smaller than the current similarity threshold, the output unit determines a word with the highest probability in the output word probability distribution as an output word of a current time step, adjusts the similarity threshold of the current time step to determine an adjusted similarity threshold, wherein the adjusted similarity threshold is smaller than the similarity threshold of the current time step, and the adjusted similarity threshold is used as the similarity threshold of a next time step.

In some embodiments, the method further comprises: determining the attention distribution of the current time step according to external information, the source text coding hidden state and the decoding hidden state; determining output word probability distribution according to external information, the source text encoding hidden state and the source text decoding hidden state to determine that the output words comprise: and determining output word probability distribution according to the attention distribution, the source text encoding hidden state and the decoding hidden state so as to determine output words.

According to still another aspect of the present disclosure, there is provided a text processing apparatus including: a processor; and a memory having computer-readable program instructions stored therein, wherein the text processing method as described above is performed when the computer-readable program instructions are executed by the processor.

According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer-readable instructions which, when executed by a computer, the computer performs the text processing method as described above.

By using the text processing method, the text processing device, the text processing equipment and the computer readable storage medium provided by the disclosure, in the text generation process, the attention distribution of the current time step is determined by using the external information and/or the output words of the current time step are determined according to the external information, so that the content of the external information can be effectively considered in the text processing process, the probability of generating the external information is improved in the text generation process, and the effect of generating the text under the condition of considering the external information is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 shows a schematic block diagram of a text processing apparatus according to the present disclosure;

FIGS. 2A and 2B illustrate an exemplary embodiment of determining candidate output words from an output probability subsection according to an embodiment of the present application;

fig. 3A shows a schematic block diagram of an attention generating unit according to an embodiment of the present disclosure;

FIG. 3B shows an exemplary process for an attention generation unit to determine an attention distribution for a current time step according to an embodiment of the application;

FIG. 4 shows another schematic block diagram of an attention generating unit according to an embodiment of the present application;

FIG. 5 shows another illustrative embodiment of a text processing apparatus according to an embodiment at the time of this application;

FIG. 6 shows a schematic flow diagram of a text processing method according to the application;

FIG. 7 shows a schematic flow diagram of determining an attention distribution for a current time step from extrinsic information according to an embodiment of the application;

FIG. 8 shows another schematic flow diagram for determining an attention distribution for a current time step from extrinsic information according to an embodiment of the application;

FIG. 9 shows a schematic flow diagram of another text processing method according to an embodiment of the application;

FIG. 10 shows a schematic flow chart of yet another method of text processing according to an embodiment of the present application; and

FIG. 11 is a schematic diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without any inventive step, are intended to be within the scope of the present disclosure.

Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs. As used in this application, the terms "first," "second," and the like do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

The principles of the present disclosure will be described below by way of example in the context of generating a text summary. However, it will be appreciated by those skilled in the art that the methods provided by the present disclosure may also be used in other text processing processes, such as text conversion processes, machine translation processes, and the like, without departing from the principles of the present disclosure.

Fig. 1 shows a schematic block diagram of a text processing apparatus according to the present disclosure. As shown in fig. 1, the text processing apparatus 100 may include an encoding unit 110, a decoding unit 120, an attention generating unit 130, and an output unit 140. The text processing apparatus 100 may be configured to perform text processing on the source text I to generate a corresponding text processing result. For example, a summary for the source text I may be generated with the text processing apparatus 100. Wherein the source text I may comprise at least one sentence, wherein each sentence comprises at least one word.

The text processing apparatus 100 provided by the present disclosure may be configured to receive external information and perform a text processing procedure for a source text according to the external information. In some embodiments, the external information refers to predefined textual information that is desired as a result of processing of the source text. In some examples, the external information may be at least one word or sentence in the source text. In other examples, the external information may be words or sentences at predetermined locations in the source text, such as the first sentence, the last sentence, or textual information at any other specified location of the source text. In still other examples, the external information may be additional text associated with the source text. Such as the title of the source text. In one implementation, the external information may be additional text determined from user input. The present application does not limit the manner in which the external information is determined. In fact, any possible way of determining the external information to be used in the text processing may be used.

When a source text is processed by the text processing apparatus 100, the probability of appearance of external information in a text processing result can be increased by considering the external information at various stages of the text processing. For example, when an article title in a source text is determined to be external information, words and/or sentences in the article title will likely or certainly appear in a digest of the source text output by the text processing apparatus 100 provided by the present application.

When a computer is used to execute a text processing method, since the computer often cannot directly process text data, when processing a source text and/or external information, the source text needs to be converted into numerical data.

In some embodiments, the source text I is implemented in the form of a natural language. In this case, the text processing apparatus 100 may further include a preprocessing unit (not shown). The pre-processing unit may be adapted to convert the source text into numerical data before the source text is input into the encoding unit. For example, each sentence in the source text I may be divided into a plurality of words by performing a word segmentation process on each sentence. Then, a plurality of words obtained by the word segmentation processing may be converted into word vectors of a specific dimension, respectively, by means of word embedding (word embedding), for example.

Similarly, the external information may also be converted to obtain at least one word vector corresponding to the external information for subsequent text processing.

In some embodiments, the source text I referred to in the present disclosure may also be implemented in the form of numerical data, for example, the source text I may be represented by using at least one word vector. In this case, the source text I may be processed directly with the encoding unit 110. The natural language may be preprocessed by a preprocessing means provided independently of the text processing means 100.

Hereinafter, it is not distinguished that the external information and the source text are in the form of natural language or in the form of numerical data, and when it is required to process the external information and/or the source text in the form of natural language using a computer, those skilled in the art may convert the external information and/or the source text in the form of natural language into numerical data as needed.

The encoding unit 110 may be configured to encode the source text I to be processed to obtain the source text encoding hidden state h.

In some embodiments, the encoding unit 110 may be implemented as an encoding network. Exemplary encoding networks include Long Short Term Memory (LSTM) networks, which may be used by systems based on LSTM networks for tasks such as machine translation, text summarization, and the like. It will be appreciated that the encoding network may also be implemented as any machine learning model capable of encoding word vectors.

For example, the encoding unit may output at least one word vector corresponding to the source text I as input, together with each word vector x₁、x₂、x₃… corresponding to the source text encoding hidden states h₁、h₂、h₃… are provided. The number of source text encoding hidden states and the number of word vectors of the source text may be the same or different. For example, when generating k word vectors from the source text I, the encoding unit may process the k word vectors to generate k corresponding source text encoding hidden states. k is an integer greater than one.

The decoding unit 120 may be configured to determine a decoding concealment state s. In some embodiments, the decoding unit 120 may be configured to receive the decoded hidden state s at the last time step t-1_t-1And the output word x obtained by the text processing device at the last time step_tAnd to s_t-1And x_tProcessing to obtain decoded hidden state s at current time step_t. In the processing of the first time step s₀And x₁May be determined as a default initial value. Wherein the decoded hidden state s may also comprise a plurality of decoded hidden states s corresponding to the source text I₁、s₂、s₃…。

In some embodiments, the decoding unit 120 may be implemented as a decoding network. Exemplary decoding networks include long and short term memory networks. It will be appreciated that the decoding network may also be implemented as any machine learning model capable of decoding the output of the encoding network.

In some embodiments, the encoding network and decoding network may be represented as a Sequence-to-Sequence (Seq 2Seq) model, which is used to enable the conversion of one input Sequence, such as "WXYZ" (e.g., as input text), into another output Sequence, such as "AXY" (e.g., as a text digest).

The attention generating unit 130 may be configured to determine an attention distribution a from the source text encoding hidden state h and the decoding hidden state s, and may output the attention distribution a for a subsequent text processing procedure of the current time step.

In some embodiments, the attention profile A for the current time step t^tMay be the encoded attention distribution of the source text.

In some implementations, at each time step (time step) t, the hidden state h can be encoded with the source text of the current time step_tAnd decoding the hidden state s_tDetermining a coded attention distribution a of a source text for a current time step^t. For example, the encoding attention distribution a of the source text can be determined using equations (1), (2)^t。

a^t＝softmax(e^t) (1)

Where t denotes the current time step, softmax refers to the normalized exponential function, e^tCan be determined using equation (2) as:

where i is the index number of the word vector, h_iIs the source text encoding hidden state, v, corresponding to the ith word vector^T、W_h、W_S、b_attnIs the learning parameter to be trained, h is the source text encoding concealment at the current time stepState, s_tIs the decoding hidden state at the current time step.

In other embodiments, the attention generating unit 130 may determine the attention distribution a of the source text according to the external information and according to formula (1)^tDetermining an attention distribution A for a current time step comprising external information^tAnd outputs attention distribution A containing external information^tFor processing of subsequent units.

In some implementations, the attention profile A of the current time step containing the external information^tMay be the attention distribution a of the source text using external information^tDetermined after the adjustment.

In other implementations, the attention profile A of the current time step containing the external information^tThe attention distribution a of the source text may also be included^tAnd attention distribution a 'of external information'^tAnd both.

The attention distribution a containing external information will be described below with reference to fig. 3A, 3B, and 4^tThe determination process of (2) will not be described herein again.

The output unit 140 may be configured to determine an output word probability distribution according to the attention distribution a, the source text encoding hidden state h, and the decoding hidden state s to determine an output word O at the current time step.

Outputting the word probability distribution may include generating a probability distribution P_vocab. The generation probability distribution P can be determined using formula (3) and formula (4)_vocab。

Where V ', V, b' are the learning parameters to be trained in the output unit, h_t ^*Is according to the attention distribution a^tA determined context vector. For example, h can be determined using equation (4)_t ^*：

Wherein

Is the attention profile A of the output of the attention generating unit^tThe ith element in (1), h_iIs the source text encoding hidden state for the ith word vector.

In some embodiments, outputting the word probability distribution may further include attention distribution a output by attention generation unit 130^t。

For example, an output word probability distribution may be determined by a weighted sum of the generation probability distribution and the attention distribution a.

In some implementations, the hidden state can be encoded, decoded, attention-distribution, and input x of the decoding unit at the current time step according to the source text at the current time step_tDetermining weights P for generating a probability distribution and an attention distribution_gen。

For example, a weight P for weighted summation of the generated probability distribution and the attention distribution_genCan be expressed as formula (5):

where σ denotes an activation function, e.g. sigmoid function, w_h ^T、w_s ^T、w_x ^TAnd b_ptrIs a training parameter, h_t ^*Is a parameter determined by equation (4) at time step t, s_tIs the decoding hidden state, x, at time step t_tIs the input of the decoding unit at time step t, i.e. the output of the output unit at the last time step t-1. Weight P determined in equation (5)_genMay be implemented in the form of a scalar. By using the weight P_genTo generate a probability distribution P_vocabAnd attention distribution A^tThe weighted average is performed to obtain the outputAnd (6) giving out word probability distribution.

In the attention distribution A^tAttention distribution a comprising source text^tAnd attention distribution a 'of external information'^tIn both cases, the attention distribution a of the source text^tAnd attention distribution a 'of external information'^tMay be the same or different, and the determination of the generated probability distribution P will be described with reference to fig. 4_vocabAttention distribution of source text a^tAnd attention distribution a 'of external information'^tThe method of weighting parameters in (1) is not described herein again.

In some embodiments, the output unit 140 may determine a word with the highest probability in the output word probability distribution as the output word at the current time step.

In other embodiments, the output unit 140 may further determine the output word at the current time step according to the external information determination and the output word probability distribution.

In some implementations, the output unit 140 may be configured to determine, according to the external information, a word having a probability greater than or equal to an output probability threshold and belonging to the external information among the candidate output words as a candidate output word at the current time step. In some examples, the output unit 140 may determine the candidate output term using principles of a bundle search.

For example, the output unit 140 may determine at least two words at each time step as candidate output words for the current time step, and then may use the candidate output words for a text processing procedure for the next time step. Similarly, at the next time step, the output unit 140 may also determine at least two candidate output words.

Specifically, taking the number of candidate output words as an example of 2, two candidate output words a, b may be output at time step t. The candidate output words a, b are then used in the text processing at the next time step, and candidate output words c, d at time step t +1 may be determined.

Fig. 2A and 2B illustrate an exemplary embodiment of determining candidate output words from an output probability subsection according to an embodiment of the present application.

In some embodiments, in determining the candidate output word at each time step, a predetermined number M of words (M is equal to 2 in the above example) having the highest output probability in the output probability distribution may be determined as the candidate output words. Wherein M is an integer of 2 or more.

Two words having the highest output probability in the output word probability distribution shown in fig. 2A are w3 and w11, and thus w3 and w11 can be determined as candidate output words.

In other embodiments, in determining the candidate output words at each time step, it may be determined according to a predefined manner that N words with the highest output probability are selected in the output probability distribution, and M words are determined from the N words as candidate output words. Wherein N is an integer greater than M. In some implementations, the value of N may be pre-specified.

In other implementations, an output probability threshold may be predetermined, and M words may be determined as candidate output words from among the N words having output probabilities greater than the output probability threshold.

If there is no word belonging to the external information among the N words having the highest output probability, the M words having the highest output probability among the N words may be determined as candidate output words.

If a word belonging to the external information exists among N words having the highest output probability, if the number N of words belonging to the external information existing among the N words is equal to or greater than M, the M words having the highest output probability and belonging to the external information among the N words may be determined as candidate output words. If the number N of words belonging to the external information among the N words is less than the predetermined number M, the M-N words having the highest output probability among the words belonging to the external information among the N words and the remaining N-N words may be determined as candidate output words.

As shown in fig. 2B, the two words highest in the output word probability distribution are w3 and w11, and the words having an output probability greater than a preset output probability threshold include w3, w7, and w11, and w2 and w7 belong to external information.

In this case, since w7 belongs to the external information and the output probability of w7 is greater than the output probability threshold, w7 and w11 may be selected as candidate output words without selecting w3 whose output probability is higher.

In this way, the probability that a word in the external information is determined as an output word can be increased.

At least four candidate output sequences ac, ad, bc and bd can be determined by using the candidate output words a and b output by the time step t and the candidate output words c and d of the time step t +1, the output probability of each candidate output sequence can be determined in a joint probability mode, and two of the four candidate output sequences ac, ad, bc and bd with the highest output probability are used as candidate texts after the time step t + 1.

For example, the output probabilities of candidate output words a, b, c, d may be represented as P_a、P_b、P_cAnd P_d. Then the candidate output sequences ac, ad, bc, bd may be denoted as P, respectively_ac＝P_a*P_c、P_ad＝P_a*P_d、P_bc＝P_b*P_cAnd P_bd＝P_b*P_d. If P is_ac＞P_ad＞P_bc＞P_bdTime step t +1 then outputs the sequence ac, ad for subsequent text processing.

In some embodiments, the candidate output sequences may also be determined based on external information. For example, a penalty value for the candidate output sequence may be determined using equation (6). The joint output probability of the candidate output sequences may be adjusted using the penalty value determined by equation (6).

s(x,y)＝logP(y_t|x)+sim(y_＜t,h) (6)

Wherein P (y)_t| x) represents the probability of outputting the word x at time step t, h identifies the external information, sim (y)_＜tAnd h) represents the similarity between the candidate text sequence generated before time step t and the external information.

In one implementation, any possible text similarity algorithm may be utilized to determine the similarity between the candidate text sequence generated before time step t and the external information. For example, the similarity between the candidate text sequence generated before the time step t and the external information may be determined using a cosine similarity method.

With equation (6) above, if the similarity between the candidate text sequence generated before the time step t and the external information is higher, the penalty value will be used to increase the output probability of the candidate output sequence. In some implementations, the penalty value s (x, y) can be multiplied or added with the output probability in the candidate output sequence, thereby achieving the effect of determining the candidate output sequence according to the similarity between the candidate text sequence generated before the determination time step t and the external information.

That is, by determining the penalty value for the candidate output sequence as described above from the external information, the probability of the external information appearing in the candidate text sequence can be increased. It is thus possible to increase the probability that the external information appears in the finally output text processing result.

In another implementation, the output unit may be configured to determine a similarity between the external information and the source text encoding hidden state, and determine a word to be output at the current time step according to the similarity between the external information and the source text encoding hidden state.

For example, the extrinsic information may be encoded by the encoding unit 110 to obtain an extrinsic information encoding hidden state.

The output unit 140 may be configured to determine a similarity of the extrinsic information encoding hidden state and the decoding hidden state. When the similarity between the external information coding hidden state and the decoding hidden state is larger than or equal to a predefined similarity threshold, the output unit outputs the external information as the output of the current time step.

In a case where the external information is a word, the external information may be output as a word of a current time step. In case the external information is a sentence, the external information may be inserted directly after the text sequence that has been generated before the current time step t.

It is to be understood that the text sequence that has been generated before the current time step t may be generated based on the word with the highest probability in the aforementioned output probability distribution, or may be generated according to several candidate words with the highest probability in the output probability distribution. The candidate words may be determined using the process described in the foregoing implementation, and will not be described herein.

When the similarity between the extrinsic information encoding hidden state and the decoding hidden state is smaller than the predefined similarity threshold, the output unit may determine an output word probability distribution at the current time step according to the results output by the decoding unit and the attention generating unit, and determine an output word at the current time step based on the output word probability distribution at the current time step.

With the above method, when the similarity between the result output by the decoding unit and the extrinsic information is high, the result output by the decoding unit can be directly replaced with the extrinsic information. That is, in this case, the result of the text sequence determined after the output of the current time step is the result of inserting the external information after the text sequence determined after the output of the previous time step.

Then, when the next time step is processed, the decoding unit can be used for encoding the external information to obtain the decoding hidden state of the next time step, so that the subsequent decoding process can utilize the result of the external information to ensure the semantic consistency between the result obtained by the subsequent decoding and the inserted external information.

In the case where the external information is a word, the decoded hidden state at the previous time step and the external information may be used as input of a decoding unit for processing, and the decoded hidden state at the current time step may be obtained.

In the case where the external information includes a plurality of words, the loop processing may be performed a plurality of times by the decoding unit. The input of the decoding unit in the first cycle is the first word of the decoded hidden state and the external information of the previous time step, and the input of the decoding unit in the subsequent cycle is the next word of the decoded hidden state and the external information obtained in the previous cycle. Each word in the extrinsic information may be processed through multiple cycles to obtain a decoded hidden state containing all extrinsic information as a decoded hidden state for the current time step.

In some implementations, the comparison of the similarity of the extrinsic information encoding hidden state and the decoding hidden state described above is not performed any more after the extrinsic information has been inserted into the text processing result in place of the result output by the decoding unit.

In some examples, the similarity threshold may be implemented as a preset function with respect to time step t.

As described above, when the similarity between the extrinsic information encoding hidden state and the decoding hidden state is smaller than the predefined similarity threshold, the above-described operation of replacing the decoding unit with extrinsic information as an output is not performed, but an output result is determined according to the output word probability distribution. In this case, in order to increase the probability that external information appears in the final text processing result, the similarity threshold of the current time step may be adjusted to determine an adjusted similarity threshold, where the adjusted similarity threshold is smaller than the similarity threshold of the current time step, and the adjusted similarity threshold is used as the similarity threshold of the next time step.

For example, the similarity threshold may be adjusted using equation (7):

ε_SIM,t+1＝ε_SIM,t*f(t) (7)

wherein epsilon_SIM,t+1Is the similarity threshold, ε, for time step t +1_SIM,tIs a similarity threshold for time step t, and f (t) is a monotonically decreasing function with respect to time t. For example, f (t) may be implemented as equation (8).

f(t)＝e^-t/k (8)

Where t is the current time step, k is the length of the source text, and e is the natural logarithm. In some alternative examples, k may also be expressed as a function of the source text length. For example, k may be expressed as the product of β and the length of the source text, where β is a predefined parameter greater than zero and less than 1.

With the above method, by performing a monotonically decreasing adjustment of the similarity threshold value at each time step, even if the similarity between the extrinsic information and the output result of the decoding unit is low during text processing, the similarity threshold value can be lowered to a very low degree, so that the probability that the similarity between the extrinsic information and the output result of the decoding unit is greater than the similarity threshold value at the current time step increases. I.e. the probability that the external information appears in the final text processing result can be increased.

With the text processing apparatus 100 provided by the present application, in the process of generating the text abstract, by determining the attention distribution of the current time step by using the external information and/or determining the output words of the current time step according to the external information, the content of the external information can be effectively considered in the process of text processing, and the probability of generating the external information is increased in the process of text generation, thereby improving the effect of generating the text in consideration of the external information.

In implementing the text processing device 100 provided in the present application, those skilled in the art can arbitrarily combine the above technical solutions. For example, in the process of text processing of a source text by the text processing apparatus 100, an attention profile containing external information may be generated from the external information for subsequent text processing using only the attention generating unit, and the external information is not considered in the subsequent text processing. For another example, the word to be output at the current time step may be determined from the external information only by the output unit, and the external information may not be considered in the previous encoding, decoding, and attention generation processes. For another example, the external information may be considered in both the process of the attention generation unit generating the attention distribution of the current time step and the process of the output unit determining the word to be output at the current time step, so as to further improve the possibility of including the external information in the text processing result.

Fig. 3A shows a schematic block diagram of an attention generating unit according to an embodiment of the present disclosure. With the attention generating unit 300 shown in fig. 3A, the attention distribution a of the external information to the source text can be utilized^tMaking adjustments and determining an attention score containing external informationCloth A'.

As shown in fig. 3A, the attention generating unit 300 may include a source text attention determining unit 310, a content selecting unit 320.

The source text attention determining unit 310 may be configured to determine an encoding attention distribution a of the source text from the source text encoding hidden state and the decoding hidden state^t. In some embodiments, the encoded attention distribution a of the source text may be determined using the foregoing equation (1)^t。

The content selection unit 320 may be used to determine a selection probability for each word in the source text. In some embodiments, the content selection unit 320 may determine a selection probability distribution for the source text based on external information, the selection probability distribution including a selection probability for each word in the source text.

In some embodiments, the content selection unit 220 may process the source text I using a content selection network (e.g., an LSTM network) to determine a first selection probability for each word in the source text.

The content selection network used here can be trained with reference text processing result ref (i.e., the text processing result of the predetermined training data). In the training process of the content selection network, the tag sequences generated according to the source text I and the reference text processing result ref may be input to the content selection network for processing. Where the tag sequence is the same length as the word sequence of the source text I, the value of the I-th element of the tag sequence may be used to indicate whether the I-th word of the source text I belongs to the content of the reference text processing result ref. By training the content selection network using the above method, the content selection network can process the source text I and output a result of a first selection probability for each word in the source text, where the first selection probability represents a probability that the word in the source text I is selected to appear in a final text processing result according to the content selection network.

In some embodiments, for at least one word belonging to the external information in the source text, a selection probability of the at least one word may be determined at least as a predefined probability value λ. For example, the second selection probability of each word belonging to the external information in the source text may be determined as a predefined probability value λ, and the second selection probability of other words not belonging to the external information may be determined as 0.

The selection probability for each word in the source text may be determined based on the first selection probability and the second selection probability described above. For example, a selection probability for each word in the source text may be determined as the sum of the first selection probability and the second selection probability. It follows that for words belonging to external information, their selection probability will be greater than or equal to a predefined probability value λ.

Based on the selection probability distribution, the content selection unit 220 may be configured to determine, for each word in the source text, the attention of the word according to the selection probability of the word to obtain the attention distribution a. In one embodiment, the content selection unit 220 may be configured to determine the attention for the word in the attention distribution at the current time step as zero when the selection probability of the word is lower than a preset selection probability threshold epsilon. Furthermore, the content selection unit 220 may be further configured to determine the attention for the word in the attention distribution of the current time step as the encoding attention distribution a of the source text when the selection probability of the word is greater than or equal to a preset selection probability threshold epsilon^tAttention to the word in (1).

With the above-described attention generation unit, a selection probability can be generated for each word in the source text, that is, at least both the magnitude of attention calculated using formula (1) and the selection probability of the word need to be considered in determining the attention of each word. When the selection probability of the word is lower than the preset selection threshold, the probability that the word appears in the current time step is considered to be low, and therefore, the attention of the word can not be considered in the subsequent text processing process.

The result of the attention distribution determined with the content selection unit can be represented by equation (9):

wherein

Is the attention distribution of the current time step determined by the content selection, x is the word to be output currently, j is the serial number of the current time step, y_1:j-1Is a sequence of text that has already been output,

is the coded attention distribution of the source text, e.g.

May be calculated by the above equations (1) and (2). q is the first selection probability, and λ hint _ tag is the second selection probability. Wherein for the ith word belonging to the external information, the value of the ith term of q + λ hit _ tag may be q_i+ λ, for the kth word not belonging to the external information, the value of the kth term of q + λ hit _ tag may be q_k。

By setting the selection probability of the words contained in the external information to be at least the predefined probability value lambda, if the predefined probability value is greater than the preset selection probability threshold epsilon, the words in the external information can be ensured not to be filtered in the step of content selection, and the words in the external information can enter the subsequent text processing process, so that the probability of the words in the external information appearing in the text processing result is improved. It will be appreciated that in some implementations, the predefined probability value λ may also be set to be less than or equal to a preset selection probability threshold ε. In this case, by determining the selection probability of each word as the sum of the first selection probability and the second selection probability described above, it is also possible to achieve the effect of increasing the selection probability of words in the external information and improving the probability of the words in the external information appearing in the text processing result.

Fig. 3B shows an exemplary procedure for determining an attention profile for a current time step according to the attention generating unit shown in fig. 3A.

As shown in fig. 3B, taking four words in the source text as an example, the attention of the 1 st and 3 rd words can be selected for the subsequent text processing process by using the content selection network.

Fig. 4 shows another schematic block diagram of an attention generating unit according to an embodiment of the application. As shown in fig. 4, the attention generating unit 400 may include a source text attention determining unit 410 and an external information attention determining unit 420. The attention distribution a determined by the attention generating unit shown in fig. 4 includes an attention distribution a of the source text^tAnd attention distribution a 'of external information'^tAnd both.

Source in some embodiments, the source text attention determination unit 410 may determine the encoded attention parameter for each word in the source text based on the source text encoded hidden state for the current time step and the decoded hidden state for the current time step using equation (2)

The extrinsic information attention determining unit 420 may be configured to determine an extrinsic attention parameter for each word in the source text, wherein an extrinsic attention parameter for a word belonging to extrinsic information is determined as a preset first extrinsic attention parameter and an extrinsic attention parameter for a word not belonging to extrinsic information is determined as a preset second extrinsic attention parameter. In some implementations, the first external attention parameter can be set to λ 'and the second external attention parameter can be set to 0, where λ' can be a value greater than 0.

For example, the attention parameter for each word may be determined by summing the encoded attention parameter and the extrinsic attention parameter for that word

Then, canBased on attention parameters

An attention distribution for a current time step of the source text is determined. For example, by focusing attention on the parameters

Applying the softmax function, the attention of the current time step for each word of the source text can be obtained.

By the method, the attention parameter of the word belonging to the external information in the source text can be adjusted through the predefined external attention parameter, and the attention adjustment of the word belonging to the external information is further realized. It is understood that in the case where the first external attention parameter is set to the hyperparameter λ' greater than 0 and the second external attention parameter is set to 0, the attention distribution of each word of the source text may be adjusted based on the external information so that the attention of the word belonging to the external information is more important.

Although the principles of the present application have been described in the above examples with the first external attention parameter being λ' and the second external attention parameter being 0, it will be appreciated that the scope of the present application is not so limited. The person skilled in the art can set the parameters for the external attention of each word in the source text according to the actual situation, as long as the effect of making the attention of the words belonging to the external information more important can be finally achieved. For example, the first external attention parameter may be set to λ₁', the second external attention parameter may be set to λ₂’，λ₁’、λ₂' may be any real number as long as λ₁’＞λ₂' then, the process is finished.

In some embodiments, the source text attention determination unit 410 may be configured to determine the encoded attention distribution a of the source text using the aforementioned formulas (1), (2)^t. External information attention determination unit 420 may be configured to determine an encoded attention distribution a 'of the external information'^t。

In this case, the extrinsic information may be encoded by the encoding unit 110 shown in fig. 1 to obtain an extrinsic information encoding hidden state h'. The extrinsic information attention unit 420 may determine the encoded attention distribution of the extrinsic information from the extrinsic information encoding hidden state h' and the decoding hidden state s.

For example, the coded attention distribution a 'of the external information may be determined by the above equations (1) and (2)'^tWherein the source text encoding hidden state h in the formulas (1), (2) should be replaced with the extrinsic information encoding hidden state h'.

In some implementations, the encoded attention profile a 'of the external information is separately computed'^tAnd the coded attention distribution a of the source text^tIn this case, the calculation may be performed by using equations (1) and (2) of the shared parameter, that is, the coded attention distribution a 'of the external information'^tAnd the coded attention distribution a of the source text^tParameter v used at the time^T、W_h、W_S、b_attnMay be identical. In other implementations, the encoded attention profile a 'used to compute the external information may also be trained separately'^tAnd for calculating the coded attention distribution a of the source text^tI.e. calculating the coded attention distribution a 'of the extrinsic information'^tAnd the coded attention distribution a of the source text^tParameter v used at the time^T、W_h、W_S、b_attnMay be different.

Encoded attention distribution a 'of external information generated by attention generating unit 400 shown in fig. 4'^tAnd the coded attention distribution a of the source text^tEncoding attention distribution a 'of external information may be performed using the output unit shown in fig. 1'^tAnd the coded attention distribution a of the source text^tFurther processing is performed to determine an output word probability distribution for the current time step.

In the attention distribution A^tAttention distribution a comprising source text^tAnd attention distribution a 'of external information'^tIn both cases, the output word probability distribution may be expressed as an attention distribution a of the source text, which generates a probability distribution^tAnd attention distribution a 'of external information'^tWeighted average of (2).

In some embodiments, the output word probability distribution may be determined by equation (10).

Wherein a probability distribution P is generated_vocabMay be determined by equation (3) based on the encoded hidden state of the source text, the decoded hidden state and the encoded attention distribution of the source text,

indicating the attention of the ith word in the encoded attention distribution of the source text,

the attention of the ith word in the encoded attention distribution representing the extrinsic information. P_generator、P_pointerAnd P_TRespectively for generating a probability distribution P_vocabCoded attention distribution of source text a^tAnd coded attention distribution a 'of external information'^tThe weight of (c).

In some implementations, P can be determined according to the encoding hidden state, decoding hidden state, encoding attention distribution of external information of the source text of the current time step t and the output of the output unit of the last time step t-1_generator、P_pointerAnd P_T。

For example, P can be determined according to equation (11)_generator、P_pointerAnd P_T。

Where σ denotes an activation function, such as a sigmoid function,

is the parameter to be trained, h_t ^*Is a parameter determined by formulas (3) and (4) at time step t according to the encoding hidden state h and the decoding hidden state s of the source text_tIs the decoding hidden state, x, at time step t_tIs the input of the decoding unit at time step t, i.e. the output of the output unit at the last time step t-1, a^′tIs the encoded attention distribution of the external information output by the attention generating unit 400.

In some embodiments, the output unit may also consider probability distribution results determined in other ways when determining the output word probability distribution. The importance of each word in the source text I in the source text may be determined, for example, by considering the relevance between a plurality of sentence vectors formed by word vectors in the source text I. Outputting the word probability distribution may also include forming a word probability distribution using the importance described above. Those skilled in the art will recognize that the manner in which the output word probability distribution is generated is not limited thereto, and that the output word probability distribution may also include various forms of word probability distributions without departing from the principles of the present disclosure.

In the above manner, the attention distribution a of the external information at the current time step can be determined by the attention generating unit 400^′tAnd distributes the attention of the external information at the current time step a^′tFor determining an output probability distribution for the current time step. In the embodiments provided by the present disclosure, determining the output probability distribution of the current time step by using the attention distribution of the external information instead of the characteristics of the external information may avoid that invalid information in the characteristics of the external information affects the output probability distribution of the current time step.

FIG. 5 shows another illustrative embodiment of a text processing apparatus according to embodiments at the time of this application.

As shown in fig. 5, the text processing apparatus 500 may include an encoding unit 510, a decoding unit 520, an attention generating unit 530, an output unit 540, and a post-processing unit 550. The encoding unit 510, the decoding unit 520, the attention generating unit 530, and the output unit 540 may be implemented as the encoding unit 110, the decoding unit 120, the attention generating unit 130, and the output unit 140 described with reference to fig. 1 to 3, and are not described herein again.

The post-processing unit 550 may be configured to post-process the candidate text according to the external information to determine an output text containing the external information.

As described above, by implementing the encoding unit 510, the decoding unit 520, the attention generating unit 530, and the output unit 540 as the encoding unit 110, the decoding unit 120, the attention generating unit 130, and the output unit 140 described in conjunction with fig. 1 to 3, the output unit 540 may possibly output a text processing result containing external information.

In the case where the external information is already contained in the result output by the output unit 540, the result output by the output unit 540 may be directly used as the result of the text processing.

In the case where the result output by the output unit 540 does not include external information yet, the result output by the output unit 540 may be regarded as a candidate text, and the candidate text may be post-processed by the post-processing unit 550 according to the external information to determine an output text including the external information.

In some embodiments, the external information may be a message containing a pre-specified information. For example, the external information may be a pre-specified sentence, or a sentence in the source text containing a pre-specified word.

In the case where the predetermined external information is a sentence, the post-processing unit 550 may be configured to determine a similarity between the sentence in the candidate text and the external information. When the similarity is greater than a preset candidate similarity threshold, the sentence in the candidate text may be replaced with the external information.

If the predetermined external information is a word, the post-processing unit 550 may be configured to determine a similarity between a sentence containing the external information and a sentence in the candidate text, and may replace the sentence in the candidate text with the external information when the similarity is greater than a preset candidate similarity threshold.

In some implementations, when the similarity is greater than a preset candidate similarity threshold, the post-processing unit 550 may be configured to delete the sentence in the candidate text and replace the sentence in the deleted candidate text with a sentence as the external information or a sentence containing a word as the external information.

In some examples, the extrinsic information may be inserted in the source text based on the relevance in the source text of the extrinsic information in the source text and the remaining information in the candidate text. For example, the external information may be inserted into the remaining information of the candidate text according to the order in which the external information and the remaining information in the candidate text appear in the source text.

In other implementations, when the similarity is smaller than a preset candidate similarity threshold, the post-processing unit 550 may be configured to insert external information into the candidate text according to the relevance of the external information and sentences in the candidate text in the source text.

The similarity between the external information and each sentence in the candidate text may be compared. If the similarity of each sentence in the external information and the candidate text is smaller than a preset candidate similarity threshold, the generated text processing result does not include information similar to the external information. In this case, the final text processing result may be determined by directly concatenating the external information and the candidate text.

For example, the external information may be inserted into the candidate texts according to the order in which sentences among the external information and the candidate texts appear in the source text to determine a final text processing result.

By utilizing the text processing device provided by the application, the content of the external information can be effectively added into the text processing result, so that the content of the external information expected to appear in the text processing result is ensured.

As described above, the encoding unit 110, the decoding unit 120, and the attention generating unit 130 in the text processing apparatus all include parameters to be trained, as described in conjunction with fig. 1 to 3. Therefore, at least one of the encoding unit 110, the decoding unit 120, and the attention generating unit 130 needs to be trained by machine learning.

In some embodiments, the encoding unit, the attention generating unit, and the decoding unit may be trained using a preset training set of source texts. Wherein the source text training set comprises a plurality of training source texts.

The training source text may be processed by the text processing apparatus shown in fig. 1 to obtain a training text processing result for the training source text. For example, the training source text may be encoded with an encoding unit to obtain a hidden state of the training source text encoding. The training decoding hidden state may then be determined with the decoding unit. Then, a training attention distribution for a current time step may be determined based on the extrinsic information, the training source text encoding hidden state, and the training decoding hidden state using an attention generation unit. And determining training output word probability distribution according to the training attention distribution, the training source text encoding hidden state and the training decoding hidden state by using an output unit so as to determine training output words.

Parameters in the encoding unit, the attention generating unit, and the decoding unit may be adjusted to minimize a loss function used in a training process to implement training for the encoding unit, the attention generating unit, and the decoding unit.

In some examples, the loss function loss used in the training process may be implemented as equation (12).

Wherein the content of the first and second substances,

is the probability value of the positive solution word at time step t in the training output word probability distribution at time step t,

is the difference between the probability value of the positive solution in the probability distribution of the training output words and the probability value of the external information in the probability distribution of the training output words. When a word belonging to the external information appears in the training output words,

the smaller the value of (a), when a word belonging to the external information does not appear in the training output word,

the larger the value of (c).

In other examples, the loss function loss used in the training process may be implemented using equations (13), (14).

Wherein T is the total time step in the text processing process, T represents the current time step,

representing a negative log-likelihood/cross-entropy term,

is a convergence mechanism loss term of the source text, wherein

Is the source text attention of the current time step tThe distribution of the force is distributed, and the force distribution,

namely, it is

Is the sum of the previous all time step text attention distributions,

is the difference between the probability value of the positive solution in the probability distribution of the training output words and the probability value of the external information in the probability distribution of the training output words,

is a convergence mechanism loss item of external information, wherein

Is the external information attention distribution of the current time step t

Gamma and beta are preset hyper-parameters.

Fig. 6 shows a schematic flow diagram of a text processing method according to the present application. As shown in fig. 6, in step S602, the source text may be encoded to obtain a source text encoding hidden state. In some embodiments, the source text may be encoded using an encoding network. Exemplary encoding networks include Long Short Term Memory (LSTM) networks, which may be used by systems based on LSTM networks for tasks such as machine translation, text summarization, and the like. It will be appreciated that the encoding network may also be implemented as any machine learning model capable of encoding word vectors.

For example, using at least one word vector corresponding to the source text I as input, the encoding network may output a word vector x₁、x₂、x₃… corresponding to the source text encoding hidden states h₁、h₂、h₃… are provided. The number of source text encoding hidden states and the number of word vectors of the source text may be the same or different. For example, when generating k word vectors from the source text I, the encoding network processes the k word vectors to generate k corresponding source text encoding hidden states. k is an integer greater than one.

In step S604, a decoding hidden state may be determined. In some embodiments, the decoding unit 120 may be configured to receive the decoded hidden state s at the last time step t-1_t-1And the output word x obtained by the text processing device at the last time step_tAnd to s_t-1And x_tProcessing to obtain decoded hidden state s at current time step_t. In the processing of the first time step s₀And x₁May be determined as a default initial value. Wherein the decoded hidden state S may also comprise a plurality of decoded hidden states S corresponding to S of the source text₁、s₂、s₃… are provided. Exemplary decoding networks include long and short term memory networks. It will be appreciated that the decoding network may also be implemented as any machine learning model capable of decoding the output of the encoding network.

In step S606, the attention distribution of the current time step may be determined according to the external information, the source text encoding hidden state and the decoding hidden state.

In some embodiments, the attention profile A for the current time step t^tMay be the encoded attention distribution of the source text. For example, the encoding attention distribution a of the source text can be determined using equations (1), (2)^t。

In other embodiments, the attention distribution a of the source text may be determined from external information and from equation (1)^tDetermining an attention distribution A for a current time step comprising external information^tAnd outputs attention distribution A containing external information^tFor subsequent text processing procedures.

Fig. 7 shows a schematic flow diagram of determining an attention distribution for a current time step from extrinsic information according to an embodiment of the application.

In step S702, an encoding attention distribution of the source text may be determined according to the source text encoding hidden state and the decoding hidden state. In some embodiments, the encoded attention distribution a of the source text may be determined using the foregoing equation (1)^t。

In step S704, a selection probability distribution for the source text may be determined from external information, the selection probability distribution including a selection probability for each word in the source text.

In some embodiments, the source text I may be processed using a content selection network (e.g., an LSTM network) to determine a first selection probability for each word in the source text.

The content selection network is capable of processing the source text I and outputting a result of a first selection probability for each word in the source text, wherein the first selection probability represents a probability that the word in the source text I is selected to appear in a final text processing result according to the content selection network.

In step S706, for each word in the source text, the attention of the word may be determined according to the selection probability of the word to obtain the attention distribution.

Based on the selection probability distribution, step S706 may include, for each word in the source text, determining the attention of the word according to the selection probability of the word to obtain an attention distribution a. In one embodiment, step S706 may include determining the attention for the word in the attention distribution at the current time step as zero when the selection probability of the word is lower than a preset selection probability threshold epsilon. Furthermore, step S706 may further include determining the attention for the word in the attention distribution of the current time step as the coded attention distribution a of the source text when the selection probability of the word is greater than or equal to a preset selection probability threshold epsilon^tAttention to the word in (1).

With the above-described method of attention generation, a selection probability can be generated for each word in the source text, that is, at least both the magnitude of attention calculated using formula (1) and the selection probability of the word need to be considered in determining the attention of each word. When the selection probability of the word is lower than the preset selection threshold, the probability that the word appears in the current time step is considered to be low, and therefore, the attention of the word can not be considered in the subsequent text processing process.

FIG. 8 shows another schematic flow diagram for determining an attention distribution for a current time step from extrinsic information according to an embodiment of the application.

In step S802, the encoding attention of the source text at the current time step may be determined.

In some embodiments, the encoded attention parameter for each word in the source text may be determined based on the source text encoded hidden state for the current time step and the decoded hidden state for the current time step using equation (2)

Then, the encoding attention distribution a of the source text can be determined using the aforementioned formulas (1), (2)^t。

In step S804, the extrinsic information coding attention of the current time step may be determined.

In some embodiments, an extrinsic attention parameter may be determined for each word in the source text, wherein an extrinsic attention parameter for words belonging to the extrinsic information is determined as a first extrinsic attention parameter, and an extrinsic attention parameter for words not belonging to the extrinsic information is determined as a second extrinsic attention parameter. In some implementations, the first external attention parameter can be set to λ 'and the second external attention parameter can be set to 0, where λ' can be a value greater than 0.

For example, the attention parameter for each word in the source text may be determined by summing the encoded attention parameter and the extrinsic attention parameter for that word

Then, attention parameters may be based

In some embodiments, the extrinsic information may be encoded to obtain an extrinsic information encoding hidden state h'. In the case where the coded attention distribution a 'of the external information can be determined by the above equations (1) and (2)'^tWherein the source text encoding hidden state h in the formulas (1), (2) should be replaced with the extrinsic information encoding hidden state h'.

Referring back to fig. 6, in step S608, an output word probability distribution may be determined according to the attention distribution, the source text encoding hidden state, and the decoding hidden state.

In some embodiments, the output word probability distribution may also include an attention distribution A for the current time step^t。

In some implementations, the weight P for generating the probability distribution and the attention distribution may be determined according to a source text encoding hidden state, a source text decoding hidden state, an attention distribution of a current time step, and an output of a last time step decoding network_gen。

For example, a weight P for weighted summation of the generated probability distribution and the attention distribution_genCan be expressed as equation (5).

In the attention distribution A^tAttention distribution a comprising source text^tAnd attention distribution a 'of external information'^tIn both cases, the attention distribution a of the source text^tThe weighting parameters of the attention distribution a't and the extrinsic information may be the same or different.

In some implementations, the encoding hidden state of the source text, the decoding hidden state of the source text, the encoding attention distribution of the extrinsic information, and the previous time step t may be based on the encoding hidden state of the source text, the decoding hidden state of the source text, and the previous time step t-1 output determination P of the decoding network_generator、P_pointerAnd P_T. For example, P can be determined using equation (11)_generator、P_pointerAnd P_T。

In some embodiments, step S608 may include determining a word with the highest probability in the output word probability distribution as the output word at the current time step.

FIG. 9 shows a schematic flow diagram of a text processing method according to an embodiment of the application.

As shown in fig. 9, in step S902, the source text may be encoded to obtain a source text encoding hidden state.

In step S904, a decoding hidden state may be determined. In some embodiments, the decoding hidden state may be determined by using step S604 shown in fig. 6, which is not described herein again.

In step S906, an output word probability distribution may be determined according to the external information, the source text encoding hidden state, and the decoding hidden state to determine an output word.

In still other embodiments, step S906 may further include determining an output term for the current time step based on the extrinsic information determination and the output term probability distribution.

In one implementation, step S906 may include determining, according to the external information, a word having a probability greater than or equal to an output probability threshold and belonging to the external information among the candidate output words as a candidate output word for a current time step.

For example, at each time step, at least two words may be determined as candidate output words for the current time step, and then the candidate output words may be used for text processing at the next time step. Similarly, at the next time step, at least two candidate output words may also be determined.

In some embodiments, the candidate output sequences may also be determined based on external information. For example, a penalty value for the candidate output sequence may be determined using equation (6).

In another implementation, step S906 may include determining a similarity between the external information and the source text encoding hidden state, and determining a word to be output at the current time step according to the similarity between the external information and the source text encoding hidden state.

For example, the extrinsic information may be encoded using an encoding network to obtain an extrinsic information encoding hidden state.

Step S906 may include determining a similarity of the extrinsic information encoding hidden state and the decoding hidden state. When the similarity between the extrinsic information coding hidden state and the decoding hidden state is greater than or equal to a predefined similarity threshold, the extrinsic information may be output as an output of the current time step.

When the similarity between the external information coding hidden state and the source text decoding hidden state is smaller than a predefined similarity threshold, determining the output word probability distribution of the current time step according to the result output by the decoding network, and determining the output word of the current time step based on the output word probability distribution of the current time step.

With the above method, when the similarity between the result output by the decoding network and the external information is high, the result output by the decoding network can be directly replaced by the external information. That is, in this case, the result of the text sequence determined after the output of the current time step is the result of inserting the external information after the text sequence determined after the output of the previous time step.

Then, when the next time step is processed, the external information can be encoded by using the decoding network to obtain the decoding hidden state of the next time step, so that the subsequent decoding process can utilize the result of the external information to ensure the semantic consistency between the result obtained by the subsequent decoding and the inserted external information.

In the case where the external information is a word, the decoding hidden state at the previous time step and the external information may be used as input of a decoding network for processing, and the decoding hidden state at the current time step may be obtained.

In the case where the external information includes a plurality of words, a plurality of loop processes may be performed using the decoding network. The input of the decoding network in the first cycle is the first word of the decoding hidden state and the external information of the previous time step, and the input of the decoding network in the later cycle is the next word of the decoding hidden state and the external information obtained in the previous cycle. Each word in the extrinsic information may be processed through multiple cycles to obtain a decoded hidden state containing all extrinsic information as a decoded hidden state for the current time step.

In some implementations, the comparison of the similarity between the extrinsic information encoded hidden state and the decoded hidden state described above is not performed any more after the extrinsic information has been inserted into the text processing result in place of the result output by the decoding network.

As described above, when the similarity between the extrinsic information encoding hidden state and the decoding hidden state is smaller than the predefined similarity threshold, the above-described operation of replacing the output of the decoding network with extrinsic information as an output is not performed, but the output result is determined according to the output word probability distribution. In this case, in order to increase the probability that external information appears in the final text processing result, the similarity threshold of the current time step may be adjusted to determine an adjusted similarity threshold, where the adjusted similarity threshold is smaller than the similarity threshold of the current time step, and the adjusted similarity threshold is used as the similarity threshold of the next time step.

For example, the similarity threshold may be adjusted using equation (7).

By performing a monotonically decreasing adjustment of the similarity threshold at each time step, the similarity threshold may be lowered to a very low degree during text processing even if the similarity between the external information and the output result of the decoding network is low, so that the probability that the similarity between the external information and the output result of the decoding network is greater than the similarity threshold at the current time step increases. I.e. the probability that the external information appears in the final text processing result can be increased.

FIG. 10 shows a schematic flow diagram of a text processing method according to an embodiment of the application.

In step S1002, the source text may be encoded to obtain a source text encoding hidden state.

In step S1004, a decoding hidden state may be determined.

In step S1006, an output word at each time step may be determined according to the source text encoding hidden state and the decoding hidden state to determine candidate texts.

In step S1008, the candidate text may be post-processed according to the external information to determine an output text containing the external information.

In the case where the external information is not included in the result output in step S1006, the result output in step S1006 may be regarded as a candidate text, and the candidate text may be post-processed according to the external information to determine an output text including the external information.

In a case where the predetermined external information is a sentence, a similarity between the sentence in the candidate text and the external information may be determined. When the similarity is greater than a preset candidate similarity threshold, the sentence in the candidate text may be replaced with the external information.

If the predetermined external information is a word, a similarity between a sentence containing the external information and a sentence in the candidate text may be determined, and when the similarity is greater than a preset candidate similarity threshold, the sentence in the candidate text may be replaced with the external information.

In some implementations, when the similarity is greater than a preset candidate similarity threshold, the sentence in the candidate text may be deleted, and the sentence in the deleted candidate text may be replaced with a sentence as the external information or a sentence containing a word as the external information.

In other implementations, when the similarity is smaller than a preset candidate similarity threshold, the external information may be inserted into the candidate text according to the relevance of the external information and the sentences in the candidate text in the source text.

By using the text processing method provided by the application, the content of the external information can be effectively added into the text processing result, so that the text processing result is ensured to include the content of the external information expected to appear.

By using the text processing method provided by the application, in the text generation process, the attention distribution of the current time step is determined by using the external information and/or the output words of the current time step are determined according to the external information, so that the content of the external information can be effectively considered in the text processing process, the probability of generating the external information is improved in the text generation process, and the effect of generating the text under the condition of considering the external information is improved.

Furthermore, the method or apparatus according to the embodiments of the present application may also be implemented by means of the architecture of a computing device shown in fig. 11. Fig. 11 illustrates an architecture of the computing device. As shown in fig. 11, the computing device 1100 may include a bus 1110, one or at least two CPUs 1120, a Read Only Memory (ROM)1130, a Random Access Memory (RAM)1140, a communication port 1150 for connecting to a network, an input/output component 1160, a hard disk 1170, and the like. A storage device, such as ROM 1130 or hard disk 1170, in the computing device 1100 may store various data or files used by the processing and/or communication of the methods for detecting objects in video provided herein and program instructions executed by the CPU. The computing device 1100 may also include a user interface 1180. Of course, the architecture shown in FIG. 11 is merely exemplary, and one or at least two components of the computing device shown in FIG. 11 may be omitted when implementing different devices, as desired.

Embodiments of the present application may also be implemented as a computer-readable storage medium. Computer-readable storage media according to embodiments of the present application have computer-readable instructions stored thereon. The computer readable instructions, when executed by a processor, may perform a method according to embodiments of the application described with reference to the above figures. The computer-readable storage medium includes, but is not limited to, volatile memory and/or non-volatile memory, for example. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc.

Those skilled in the art will appreciate that various modifications and improvements may be made to the disclosure herein. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.

Furthermore, as used in this application and in the claims, the terms "a," "an," "the," and/or "the" do not denote any order or importance, but rather the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

In addition, although various references are made herein to certain elements of a system according to embodiments of the present application, any number of different elements may be used and run on a client and/or server. The units are illustrative only, and different aspects of the systems and methods may use different units.

Furthermore, flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A text processing apparatus comprising:

the encoding unit is configured to encode the source text to obtain a source text encoding hidden state;

a decoding unit configured to determine a decoding concealment state; and

and the output unit is configured to determine the probability distribution of output words according to the external information, the source text encoding hidden state and the decoding hidden state so as to determine the output words.

2. The text processing apparatus of claim 1, wherein the output unit is further configured to:

and determining the words with the probability greater than or equal to an output probability threshold value and belonging to the external information in the candidate output words as the candidate output words of the current time step according to the external information.

3. The text processing apparatus of claim 2, wherein the output unit is further configured to:

determining a candidate probability of the candidate word based on a joint probability of the candidate output word at the current time step and a candidate sequence determined at a previous time step and a similarity of the candidate sequence determined at the previous time step and the external information, an

And determining the candidate words with the highest candidate probability in the preset number as output words.

4. The text processing apparatus according to claim 1, wherein

The encoding unit is further configured to encode the external information to obtain an external information encoding hidden state;

the output unit is configured to determine a similarity of the extrinsic information encoding hidden state and the decoding hidden state, and when the similarity is greater than or equal to a similarity threshold of a current time step, the output unit outputs the extrinsic information as an output word.

5. The text processing apparatus of claim 4, wherein the output unit is further configured to:

when the similarity is smaller than the current similarity threshold, the output unit determines the word with the highest probability in the output word probability distribution as the output word at the current time step,

adjusting a similarity threshold for the current time step to determine an adjusted similarity threshold, wherein the adjusted similarity threshold is less than the similarity threshold for the current time step and the adjusted similarity threshold is used as the similarity threshold for the next time step.

6. The text processing apparatus of claim 1, further comprising:

an attention generating unit configured to determine an attention distribution of a current time step based on external information, the source text encoding hidden state, and the decoding hidden state;

the output unit is configured to determine an output word probability distribution according to the attention distribution, the source text encoding hidden state, and the decoding hidden state to determine an output word.

7. The text processing apparatus of any of claims 1-6, wherein the encoding unit and the decoding unit are trained by:

coding the training source text to obtain a hidden state of the training source text code;

determining a training decoding hidden state;

determining output words of the current time step according to external information, the hidden state of the training source text coding and the hidden state of the training decoding; and

parameters in the encoding unit, the decoding unit are adjusted to minimize a difference between a training output word and a word included in extrinsic information.

8. A text processing method, comprising:

encoding the source text to obtain a source text encoding hidden state;

determining a decoding hidden state; and

and determining the probability distribution of output words according to the external information, the source text encoding hidden state and the decoding hidden state so as to determine the output words.

9. A text processing apparatus comprising:

a processor; and

a memory having computer-readable program instructions stored therein,

wherein the text processing method of claim 8 is performed when the computer readable program instructions are executed by the processor.

10. A computer-readable storage medium having computer-readable instructions stored thereon, which when executed by a computer, the computer performs the text processing method of claim 8.