CN112732902A

CN112732902A - Cross-language abstract generation method and device, electronic equipment and computer readable medium

Info

Publication number: CN112732902A
Application number: CN202110132074.0A
Authority: CN
Inventors: 王亦宁; 刘升平; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-01-31
Filing date: 2021-01-31
Publication date: 2021-04-30

Abstract

The invention relates to a method and a device for generating a cross-language abstract, electronic equipment and a computer readable medium. The method comprises the following steps: determining language tags of a source language and a target language; generating a sub-word sequence based on the source language document; adding the language tags to the sub-word sequences and inputting the language tags into an encoder to generate input word sequence codes; inputting the input word sequence code into a decoder to generate an abstract word sequence code; and generating a source language abstract document and a target language abstract document based on the abstract word sequence codes. According to the cross-language abstract generating method, the cross-language abstract generating device, the electronic equipment and the computer readable medium, only one decoder is used for generating the cross-language abstract, on the premise of maintaining the abstract generating quality, resource consumption of model training and deployment is greatly reduced, and the problem that a cross-language abstract system is difficult to deploy is solved.

Description

Cross-language abstract generation method and device, electronic equipment and computer readable medium

Technical Field

The invention relates to the field of computer information processing, in particular to a method and a device for generating a cross-language abstract, electronic equipment and a computer readable medium.

Background

The cross-language automatic summarization is a task of summarizing the content of the text core information of a source language and organizing the summarized information in a target language form. Cross-language summarization may generate a summarized result for one source language document (e.g., Chinese) in another language (e.g., Japanese). The cross-language automatic summarization method research has important significance for application scenes such as cross-border e-commerce, public opinion analysis and content recommendation. Due to the loss of parallel data, most of the existing cross-language automatic summarization methods can only be realized based on a pipeline method, so that the serious error propagation problem is caused, and the summarization quality is greatly restricted.

To alleviate this problem, researchers have tried to build cross-language automatic summarization parallel data, and among them, there is a typical method based on multi-task learning, which uses single-language automatic summarization and machine translation data to improve the performance of cross-language automatic summarization model based on the multi-task learning framework, and obtains quite good performance. The existing method usually adopts a plurality of decoders to generate abstract results of different languages, however, the model structure of the method using the plurality of decoders is complex, and a specific training paradigm is required. Meanwhile, the multiple decoder networks cause the problems of large parameter scale, high training resource consumption, difficult deployment and the like.

Therefore, a new method, an apparatus, an electronic device and a computer-readable medium for generating a cross-language abstract are needed.

The above information disclosed in this background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, an electronic device, and a computer-readable medium for generating a cross-language abstract, which only use one decoder to generate the cross-language abstract, and greatly reduce resource consumption of model training and deployment on the premise of maintaining the quality of abstract generation, thereby alleviating the problem of difficulty in deploying a cross-language abstract system.

Additional features and advantages of the invention will be set forth in the detailed description which follows, or may be learned by practice of the invention.

According to an aspect of the present invention, a method for generating a cross-language abstract is provided, the method comprising: determining language tags of a source language and a target language; generating a sub-word sequence based on the source language document; adding the language tags to the sub-word sequences and inputting the language tags into an encoder to generate input word sequence codes; inputting the input word sequence code into a decoder to generate an abstract word sequence code; and generating a source language abstract document and a target language abstract document based on the abstract word sequence codes.

In an exemplary embodiment of the present invention, further comprising: acquiring document data generated from a source language; and preprocessing the document data to generate the source language document.

In an exemplary embodiment of the invention, generating a sequence of subwords based on a source language document comprises: acquiring a word sequence in a source language document; and performing sub-word segmentation on the subsequence to generate the sub-word sequence.

In an exemplary embodiment of the present invention, adding the language tag to the sub-word sequence and inputting the sub-word sequence into an encoder to generate an input word sequence code includes: generating an input word vector matrix based on the input word sequence; adding the language labels to the input word sequence and the input word vector matrix; inputting the input word sequence with language labels and the input word vector matrix into an encoder to generate the input word sequence code.

In an exemplary embodiment of the present invention, inputting the input word sequence code into a decoder, generating a digest word sequence code, includes: inputting the input word sequence code into the decoder in an initial state; the decoder generates the abstract word sequence code according to the input word sequence code based on an attention mechanism.

In an exemplary embodiment of the invention, the decoder generates the digest word sequence encoding from the input word sequence encoding based on an attention mechanism, including: and the output hidden state of the decoder at the t moment is determined by the input of the decoder at the t moment, the hidden state of the encoder and the decoding representation of the nth layer of the tth word sequence.

In an exemplary embodiment of the present invention, generating a source language digest document and a target language digest document based on the digest word sequence encoding includes: inputting the abstract word sequence code into a logistic regression layer of the decoder to perform linear transformation; and generating the source language abstract document and the target language abstract document based on the result of the linear transformation.

In an exemplary embodiment of the present invention, the linear transformation of the digest word sequence code input to the logistic regression layer of the decoder comprises: inputting the abstract word sequence code into the decoder to obtain a hidden layer state; and inputting the hidden layer state into a logistic regression layer of the decoder to perform linear transformation and output probability distribution in all word lists at t moment.

In an exemplary embodiment of the present invention, generating the source language digest document and the target language digest document based on a result of the linear transformation includes: taking the vocabulary corresponding to the maximum probability as the vocabulary at the time t; and generating the source language abstract document and the target language abstract document based on the vocabularies at all the time.

According to an aspect of the present invention, an apparatus for generating a cross-language abstract is provided, the apparatus comprising: the label module is used for determining language labels of a source language and a target language; the sequence module is used for generating a sub-word sequence based on the source language document; the coding module is used for adding the language labels to the sub-word sequences and inputting the language labels into a coder to generate input word sequence codes; the decoding module is used for inputting the input word sequence codes into a decoder to generate abstract word sequence codes; and the document module is used for generating a source language abstract document and a target language abstract document based on the abstract word sequence codes.

According to an aspect of the present invention, there is provided an electronic apparatus including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.

According to an aspect of the invention, a computer-readable medium is proposed, on which a computer program is stored which, when being executed by a processor, carries out the method as above.

According to the cross-language abstract generation method, the cross-language abstract generation device, the electronic equipment and the computer readable medium, language tags of a source language and a target language are determined; generating a sub-word sequence based on the source language document; adding the language tags to the sub-word sequences and inputting the language tags into an encoder to generate input word sequence codes; inputting the input word sequence code into a decoder to generate an abstract word sequence code; and the cross-language abstract is generated by only using one decoder based on the mode of generating the source language abstract document and the target language abstract document by the abstract word sequence coding, so that the resource consumption of model training and deployment is greatly reduced on the premise of maintaining the abstract generation quality, and the problem of difficult deployment of a cross-language abstract system is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are only some embodiments of the invention and other drawings may be derived from those drawings by a person skilled in the art without inventive effort.

FIG. 1 is a flow diagram illustrating a method for cross-language digest generation in accordance with an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method for cross-language digest generation in accordance with an exemplary embodiment.

FIG. 3 is a flowchart illustrating a method of generating a cross-language abstract, according to another exemplary embodiment.

FIG. 4 is a block diagram illustrating an apparatus for cross-language digest generation in accordance with an exemplary embodiment.

FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.

FIG. 6 is a block diagram illustrating a computer-readable medium in accordance with an example embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below could be termed a second component without departing from the teachings of the present concepts. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or flow charts in the drawings are not necessarily required to practice the present invention and are, therefore, not intended to limit the scope of the present invention.

Cross-language summarization refers to the generation of a summarized result of another language from a source language document. The existing generating type cross-language abstract method mainly comprises the following steps: an encoder encodes a document of a source language to generate continuous semantic vectors; and the decoding end uses the semantic vector to decode to obtain the abstract result of the target language. Current cross-language summarization generally employs a framework based on multitask learning, i.e., the summarization results of source and target languages are generated at the decoding end simultaneously.

The existing method usually adopts a plurality of decoders to generate abstract results of different languages, however, the model structure of the method using the plurality of decoders is complex, and a specific training paradigm is required. Meanwhile, the multiple decoder networks cause the problems of large parameter scale, high training resource consumption, difficult deployment and the like.

In view of the difficulties in the prior art, the present invention provides a method for generating a cross-language abstract, which can be implemented by using only one decoder to generate different languages in a multitask cross-language abstract. By inputting a document in a source language (e.g., English) into the encoder model, the decoder can generate results in summaries in two languages (e.g., English and Chinese), respectively, where an English summary can assist in generating better Chinese summary results. The invention is described in detail below with the aid of specific examples.

FIG. 1 is a flow diagram illustrating a method for cross-language digest generation in accordance with an exemplary embodiment. The cross-language abstract generation method 10 at least includes steps S102 to S110.

As shown in FIG. 1, in S102, language tags of the source and target languages are determined. Further comprising: acquiring document data generated from a source language; and preprocessing the document data to generate the source language document.

In S104, a sequence of subwords is generated based on the source language document. Inputting the preprocessed source language document into an encoder network, wherein the encoder network encodes the source language document information to obtain encoded representation of input word sequence information, and the encoding representation specifically comprises: acquiring a word sequence in a source language document; and performing sub-word segmentation on the subsequence to generate the sub-word sequence.

In S106, the language tag is added to the sub-word sequence and input into an encoder to generate an input word sequence code. The method comprises the following steps: generating an input word vector matrix based on the input word sequence; adding the language labels to the input word sequence and the input word vector matrix; inputting the input word sequence with language labels and the input word vector matrix into an encoder to generate the input word sequence code.

In S108, the input word sequence code is input into a decoder, and a digest word sequence code is generated. The method comprises the following steps: inputting the input word sequence code into the decoder in an initial state; the decoder generates the abstract word sequence code according to the input word sequence code based on an attention mechanism.

Wherein the decoder generates the digest word sequence code from the input word sequence code based on an attention mechanism, including: and the output hidden state of the decoder at the t moment is determined by the input of the decoder at the t moment, the hidden state of the encoder and the decoding representation of the nth layer of the tth word sequence.

In S110, a source language digest document and a target language digest document are generated based on the digest word sequence encoding. The method comprises the following steps: inputting the abstract word sequence code into a logistic regression layer of the decoder to perform linear transformation; and generating the source language abstract document and the target language abstract document based on the result of the linear transformation.

According to the cross-language abstract generation method, language tags of a source language and a target language are determined; generating a sub-word sequence based on the source language document; adding the language tags to the sub-word sequences and inputting the language tags into an encoder to generate input word sequence codes; inputting the input word sequence code into a decoder to generate an abstract word sequence code; and the cross-language abstract is generated by only using one decoder based on the mode of generating the source language abstract document and the target language abstract document by the abstract word sequence coding, so that the resource consumption of model training and deployment is greatly reduced on the premise of maintaining the abstract generation quality, and the problem of difficult deployment of a cross-language abstract system is solved.

It should be clearly understood that the present disclosure describes how to make and use particular examples, but the principles of the present disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

FIG. 2 is a flowchart illustrating a method of generating a cross-language abstract, according to another exemplary embodiment. The process 20 shown in fig. 2 is a detailed description of S106 "adding the language tag to the sub-word sequence and inputting the language tag into the encoder to generate the input word sequence encoding" in the process shown in fig. 1.

As shown in fig. 2, in S202, a word sequence in a source language document is acquired.

In S204, performing sub-word segmentation on the subsequence to generate the sub-word sequence. In order to reduce the influence of the out-of-set words on the abstract performance, the word sequence of the source language document is firstly subjected to sub-word segmentation, so that the input unit of the encoder network is the sub-word sequence.

In S206, an input word vector matrix is generated based on the input word sequence. First, x is defined as ═ x₁，...，x_n]Indicating the sequence of words entered, X ═ v₁，...，v_n]Representing a matrix of input word sequences pre-processed by word vectors, where v_iA vector representing the ith word.

In S208, the language label is added to the input word sequence and the input word vector matrix. A language-indicating label l is added before the input word sequence, for example <2Eh > when training generates a summary in the source language (English) and <2Zh > when generating a summary in the target language (Chinese).

Thus, the word sequence and the word vector matrix are respectively x' ═ l, x₁，...，x_n]，X′＝[v_l，v₁，...，v_n]。

In S210, the input word sequence with the language tag and the input word vector matrix are input into an encoder to generate the input word sequence code.

Definition f_enc() For the encoder computation unit, the encoded representation of each word passing through the encoder can be calculated by the following formula:

wherein

And (3) representing the coded representation of the t-th word sequence in the nth layer, wherein l is a language tag of the target language. Using the encoder, the top-most coded representation h can be obtained^N。

Decoder network dependent h obtained in first step^NAnd the attention mechanism module obtains the coding representation of the abstract word sequence fusing the information of the two languages.

Firstly, before decoding to generate abstract, the input initial state of decoder network is the hidden layer state u for representing generated language label_l。

Definition f_dec() For the decoder computing unit, the decoder outputs a hidden state at time t

Calculated from the following formula:

wherein u is_tRepresenting the input of the decoder at time t, h^NIndicating the hidden state that the encoder has obtained,

is a decoded representation of the t-th word sequence in the nth layer.

The cross-language abstract generation method of the invention realizes the cross-language abstract based on the multi-task learning framework by only using one decoder through fusing language label information. The method includes information of different languages in both the encoder network and the decoder network, and guides generation of digests of different target languages. The method can greatly reduce resource consumption of model training and deployment on the premise of maintaining the quality of abstract generation, and solves the problem of difficulty in deployment of a cross-language abstract system.

FIG. 3 is a flowchart illustrating a method of generating a cross-language abstract, according to another exemplary embodiment. The flow 30 shown in fig. 3 is a detailed description of S110 "generating a source language digest document and a target language digest document based on the digest word sequence encoding" in the flow shown in fig. 1.

As shown in fig. 3, in S302, the digest word sequence code is input to the logistic regression layer of the decoder to be linearly transformed. The method comprises the following steps: inputting the abstract word sequence code into the decoder to obtain a hidden layer state; and inputting the hidden layer state into a logistic regression layer of the decoder to perform linear transformation and output probability distribution in all word lists at t moment.

Top layer hidden state of decoder output

After a layer of linear transformation f (), as follows:

wherein the content of the first and second substances,

is the input representation of the softmax layer.

In S304, the vocabulary corresponding to the maximum probability is defined as the vocabulary at time t.

In S306, the source language digest document and the target language digest document are generated based on the vocabulary at all times. And the coding of the abstract information represents the word sequence of the abstract result obtained finally through softmax layer calculation.

Obtained by linear transformation

The probability distribution in all the vocabulary at each instant t is output by softmax.

Wherein W and b are training parameters of the model, and the dimension of W is the same as the dimension of the vocabulary.

Selecting the word corresponding to the maximum probability as the result of the generation of time t:

y_t＝Max(Prob_t)

according to the steps, the final digest result Y is generated by decoding in sequence₁，...，y_n]。

Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. The computer program, when executed by the CPU, performs the functions defined by the method provided by the present invention. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.

Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

FIG. 4 is a block diagram illustrating an apparatus for cross-language digest generation in accordance with an exemplary embodiment. As shown in fig. 4, the cross-language abstract generation apparatus 40 includes: a tag module 402, a sequence module 404, an encoding module 406, a decoding module 408, and a document module 410.

The tags module 402 is used to determine language tags of a source language and a target language; the tag module 402 is also used to obtain document data generated from a source language; and preprocessing the document data to generate the source language document.

The sequence module 404 is configured to generate a sequence of subwords based on the source language document; the sequence module 404 is further configured to obtain a word sequence in the source language document; and performing sub-word segmentation on the subsequence to generate the sub-word sequence.

The encoding module 406 is configured to add the language tag to the sub-word sequence and input the language tag into an encoder to generate an input word sequence code; the encoding module 406 is further configured to generate an input word vector matrix based on the input word sequence; adding the language labels to the input word sequence and the input word vector matrix; inputting the input word sequence with language labels and the input word vector matrix into an encoder to generate the input word sequence code.

The decoding module 408 is configured to input the input word sequence code into a decoder, and generate an abstract word sequence code; the decoding module 408 is further configured to encode the input word sequence into the decoder in an initial state; the decoder generates the abstract word sequence code according to the input word sequence code based on an attention mechanism.

The document module 410 is used to generate a source language digest document and a target language digest document based on the digest word sequence codes. The document module 410 is further configured to input the sequence code of the abstract words into a logistic regression layer of the decoder for linear transformation; and generating the source language abstract document and the target language abstract document based on the result of the linear transformation.

According to the cross-language abstract generating device, language tags of a source language and a target language are determined; generating a sub-word sequence based on the source language document; adding the language tags to the sub-word sequences and inputting the language tags into an encoder to generate input word sequence codes; inputting the input word sequence code into a decoder to generate an abstract word sequence code; and the cross-language abstract is generated by only using one decoder based on the mode of generating the source language abstract document and the target language abstract document by the abstract word sequence coding, so that the resource consumption of model training and deployment is greatly reduced on the premise of maintaining the abstract generation quality, and the problem of difficult deployment of a cross-language abstract system is solved.

An electronic device 500 according to this embodiment of the invention is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: at least one processing unit 510, at least one memory unit 520, a bus 530 that couples various system components including the memory unit 520 and the processing unit 510, a display unit 540, and the like.

Wherein the storage unit stores program code that is executable by the processing unit 510 to cause the processing unit 510 to perform steps according to various exemplary embodiments of the present invention described in this specification. For example, the processing unit 510 may perform the steps as shown in fig. 1, 2, 3.

The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM)5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203.

The memory unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 530 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 500 may also communicate with one or more external devices 500' (e.g., keyboard, pointing device, bluetooth device, etc.), such that a user can communicate with devices with which the electronic device 500 interacts, and/or any devices (e.g., router, modem, etc.) with which the electronic device 500 can communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. The network adapter 560 may communicate with other modules of the electronic device 500 via the bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, as shown in fig. 6, the technical solution according to the embodiment of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiment of the present invention.

The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: determining language tags of a source language and a target language; generating a sub-word sequence based on the source language document; adding the language tags to the sub-word sequences and inputting the language tags into an encoder to generate input word sequence codes; inputting the input word sequence code into a decoder to generate an abstract word sequence code; and generating a source language abstract document and a target language abstract document based on the abstract word sequence codes.

Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.

Exemplary embodiments of the present invention are specifically illustrated and described above. It is to be understood that the invention is not limited to the precise construction, arrangements, or instrumentalities described herein; on the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for generating a cross-language abstract, comprising:

determining language tags of a source language and a target language;

generating a sub-word sequence based on the source language document;

adding the language tags to the sub-word sequences and inputting the language tags into an encoder to generate input word sequence codes;

inputting the input word sequence code into a decoder to generate an abstract word sequence code;

and generating a source language abstract document and a target language abstract document based on the abstract word sequence codes.

2. The generation method of claim 1, further comprising:

acquiring document data generated from a source language;

and preprocessing the document data to generate the source language document.

3. The generation method of claim 1, wherein generating a sequence of subwords based on a source language document comprises:

acquiring a word sequence in a source language document;

and performing sub-word segmentation on the subsequence to generate the sub-word sequence.

4. The method of generating as claimed in claim 1 wherein said adding said linguistic tag to said sequence of subwords and inputting into an encoder to generate an input sequence of words encoding comprises:

generating an input word vector matrix based on the input word sequence;

adding the language labels to the input word sequence and the input word vector matrix;

inputting the input word sequence with language labels and the input word vector matrix into an encoder to generate the input word sequence code.

5. The method of generating in claim 1, wherein inputting said input word sequence code into a decoder to generate a digest word sequence code comprises:

inputting the input word sequence code into the decoder in an initial state;

the decoder generates the abstract word sequence code according to the input word sequence code based on an attention mechanism.

6. The method of generating as claimed in claim 5 wherein said decoder generates said sequence of digest words from said sequence of input words based on an attention mechanism comprising:

and the output hidden state of the decoder at the t moment is determined by the input of the decoder at the t moment, the hidden state of the encoder and the decoding representation of the nth layer of the tth word sequence.

7. The method of generating as claimed in claim 1 wherein generating source language digest documents and target language digest documents based on said digest word sequence encoding comprises:

inputting the abstract word sequence code into a logistic regression layer of the decoder to perform linear transformation;

and generating the source language abstract document and the target language abstract document based on the result of the linear transformation.

8. The method of generating as claimed in claim 7, wherein inputting said sequence of digest words encoded into a logistic regression layer of said decoder for linear transformation comprises:

inputting the abstract word sequence code into the decoder to obtain a hidden layer state;

and inputting the hidden layer state into a logistic regression layer of the decoder to perform linear transformation and output probability distribution in all word lists at t moment.

9. The method of generating as in claim 7 wherein generating the source language digest document and the target language digest document based on results of a linear transformation comprises:

taking the vocabulary corresponding to the maximum probability as the vocabulary at the time t;

and generating the source language abstract document and the target language abstract document based on the vocabularies at all the time.

10. An apparatus for generating a cross-language abstract, comprising:

the label module is used for determining language labels of a source language and a target language;

the sequence module is used for generating a sub-word sequence based on the source language document;

the coding module is used for adding the language labels to the sub-word sequences and inputting the language labels into a coder to generate input word sequence codes;

the decoding module is used for inputting the input word sequence codes into a decoder to generate abstract word sequence codes;

and the document module is used for generating a source language abstract document and a target language abstract document based on the abstract word sequence codes.

11. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

12. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-9.