CN113743089A

CN113743089A - Multilingual text generation method, device, equipment and storage medium

Info

Publication number: CN113743089A
Application number: CN202111033454.5A
Authority: CN
Inventors: 陈梦楠; 高丽; 祖漪清
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2021-12-03

Abstract

The application provides a multilingual text generation method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a multilingual word list, wherein the multilingual word list comprises a plurality of entries, and each entry comprises a word and language information of the word; and generating the multilingual text by using a pre-established multilingual text generation model and a multilingual word list as a basis, wherein the multilingual text generation model generates the multilingual text which accords with the characteristics of the real multilingual text as a generation target. The multilingual text generation method can generate the multilingual text which is smooth and natural and accords with human expression habits.

Description

Multilingual text generation method, device, equipment and storage medium

Technical Field

The present application relates to the field of text generation technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating a multilingual text.

Background

Text generation is a difficult research direction in natural language processing, and the application scenarios are many and wide. In recent years, text generation has made major progress in information extraction, dialogue systems, novel synthesis, and advertising copy generation.

With the development of globalization, in important scenes of text generation application such as daily communication and informal information, language phenomena of different language mixing in texts or voices become more obvious. In addition, in the fields of language discrimination, multi-language voice synthesis, multi-language voice recognition and the like, a large amount of multi-language text corpora are needed, but in real life, multi-language texts are often mixed in single-language texts and have low proportion, namely, a large amount of multi-language texts are difficult to obtain.

In summary, in order to meet the requirements of some text generation application scenarios for generating multilingual texts and the requirements of some fields for a large amount of multilingual texts, a multilingual text generation scheme is urgently needed.

Disclosure of Invention

In view of the above, the present application provides a method, an apparatus, a device and a storage medium for generating a multilingual text, which are used to automatically generate the multilingual text, and the technical solution is as follows:

a multilingual text-generating method comprising:

acquiring a multilingual word list, wherein the multilingual word list comprises a plurality of entries, and each entry comprises a word and language information of the word;

and generating a multilingual text by using a pre-established multilingual text generation model and taking the multilingual word list as a basis, wherein the multilingual text generation model generates the multilingual text which accords with the characteristics of the real multilingual text as a generation target.

Optionally, the multilingual text generation model adopts a generation network in a countermeasure generation network;

the training goal of the multilingual text generation model is to make the discrimination network in the confrontation generation network unable to distinguish whether the input multilingual text is the text generated by the generation network or the real text.

Optionally, the generating a multilingual text based on the multilingual word list includes:

randomly sampling a plurality of entries from the multilingual word list to form a target word list according to language information in the multilingual word list;

and generating a multilingual text according to the target word list.

Optionally, the generating a multilingual text based on the target word list includes:

determining a feature vector of each entry in the target word list and a feature vector of the target word list, wherein the feature vector of the target word list is the feature vector of the whole of all entries in the target word list;

determining a vector containing sentence grammar information as a global plan hidden variable based on the feature vector of the target word list;

and generating a multilingual text based on the global planning hidden variable, the feature vector of each entry in the target word list and the feature vector of the target word list.

Optionally, the determining a vector containing sentence grammar information as a global planning hidden variable based on the feature vector of the target word list includes:

determining normal distribution obeyed by the feature vector of the target word list based on the feature vector of the target word list;

and sampling a plurality of values from the normal distribution to obtain the global plan hidden variable.

Optionally, the generating a multilingual text based on the global hidden variables, the feature vector of each entry in the target word list, and the feature vector of the target word list includes:

determining entries participating in text generation from the target word list as target entries based on the global planning hidden variables and the feature vectors of each entry in the target word list;

and generating a multilingual text based on the feature vector of the target word list, the global planning hidden variable and the feature vector of the target entry.

Optionally, the determining, based on the global hidden variables and the feature vector of each entry in the target word list, an entry participating in text generation from the target word list as a target entry includes:

predicting the probability of each entry in the target word list participating in text generation based on the global planning hidden variable and the feature vector of each entry in the target word list;

and determining the entries participating in the text generation in the target word list as target entries based on the probability of each entry participating in the text generation in the target word list.

Optionally, the determining, based on the probability that each entry in the target word list participates in text generation, an entry that participates in text generation in the target word list includes:

if the target word list has entries of which the probability of participating in text generation is greater than a preset probability threshold, determining the entries of which the probability of participating in text generation is greater than the preset probability threshold as the entries participating in text generation;

and if the entry with the probability of participating in the text generation larger than the preset probability threshold does not exist in the target word list, determining the entry with the maximum probability of participating in the text generation as the entry participating in the text generation.

Optionally, the generating a multilingual text based on the feature vector of the target word list, the global latent variable, and the feature vector of the target entry includes:

calculating the mean value of the feature vectors of all target entries;

and decoding the mean value of the feature vectors of all the target entries, the feature vectors of the target word list and the global planning hidden variable to obtain a multilingual text.

Optionally, the decoding the mean of the feature vectors of all target entries, the feature vectors of the target word list, and the global planning hidden variable to obtain a multilingual text includes:

at each decoding moment, determining a text prediction vector at the current decoding moment according to the mean value of the feature vectors of all target entries, the feature vector of the target word list, the global planning hidden variable and the text prediction vector at the previous decoding moment;

predicting the probability that the words generated at the current decoding moment are all words in the multilingual dictionary by taking the text prediction vector at the current decoding moment as a prediction basis;

and determining the words generated at the current decoding moment according to the probability that the words generated at the current decoding moment are all words in the multilingual dictionary.

Optionally, the process of establishing the multilingual text generation model includes:

generating a multilingual text by using a generation network serving as a multilingual text generation model in the confrontation generation network and taking the multilingual word list as a basis, and taking the multilingual text as the multilingual generation text;

inputting the multilingual generation text into a discrimination network in the countermeasure generation network to obtain the probability that the multilingual generation text is a real text;

determining a language variety indicated value corresponding to the multilingual generated text, wherein the language variety indicated value can represent the language variety of the corresponding text;

and updating parameters of the multilingual text generation model according to the probability that the multilingual text generation is the real text and the indicated value of the language diversity corresponding to the multilingual text generation.

Optionally, the determining the language diversity indication value corresponding to the multilingual generated text includes:

calculating the average value of the expression vectors of the languages to which the target words in the multilingual word generation text belong as an average language expression vector, wherein the target words are words in entries participating in text generation in the multilingual word list;

and determining a language diversity indicated value corresponding to the multilingual generated text according to the number of the target words in the multilingual generated text, the expression vector of the language to which the target words belong in the multilingual generated text and the average language expression vector.

A multilingual text-generating apparatus comprising: the system comprises a multilingual word list acquisition module and a multilingual text generation module;

the multilingual word list acquisition module is used for acquiring a multilingual word list, wherein the multilingual word list comprises a plurality of entries, and each entry comprises a word and language information of the word;

the multilingual text generation module is used for generating multilingual texts by utilizing a pre-established multilingual text generation model and taking the multilingual word lists as a basis, wherein the multilingual text generation model generates the multilingual texts which accord with the characteristics of real multilingual texts as a generation target to generate the texts.

A multilingual text-generating apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the multilingual text generation method according to any one of the above-mentioned embodiments.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the multilingual text generation method of any one of the preceding claims.

According to the scheme, the multilingual text generation method, the device, the equipment and the storage medium, which are provided by the application, firstly obtain the multilingual word list, and then generate the multilingual text by using the multilingual word list as a basis by using a pre-established multilingual text generation model. The multilingual text generation method can automatically generate multilingual texts, and since the multilingual texts are generated by using the multilingual text generation model and the multilingual text generation model generates texts with the generation target of the multilingual texts which accord with the characteristics of the real multilingual texts, the multilingual texts generated by using the multilingual text generation model accord with the characteristics of the real multilingual texts, namely, the multilingual text generation method provided by the application can generate the multilingual texts which are smooth and natural and accord with the expression habits of human beings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart illustrating a multilingual text-generating method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a process for generating a multilingual text based on a target word list according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating the generation of a multilingual text based on global scheduling hidden variables, feature vectors of each entry in a target word list, and feature vectors of the target word list according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a possible structure of a multilingual text-generating model according to an embodiment of the present application;

FIG. 5 is a diagram illustrating a process of an input encoding module in a multilingual text-based model processing an input target word list according to an embodiment of the present application;

FIG. 6 is a diagram illustrating a process of obtaining a plan vector by a computation processing module in a multilingual text-based model according to an embodiment of the present application;

FIG. 7 is a diagram illustrating a sentence generation module in a multilingual text generation model according to an embodiment of the present application generating multilingual text;

FIG. 8 is a flowchart illustrating a process for creating a multilingual text model according to an embodiment of the present application;

FIG. 9 is a schematic flowchart of training a discriminant network according to an embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of a multilingual text-generating apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a multilingual text-generating device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to be able to generate multilingual text, the applicant has studied, starting with the idea: and (3) adopting a rule-based multilingual text generation scheme, namely setting a large number of multilingual text generation rules and generating the multilingual text according to the set multilingual text generation rules.

However, applicants have discovered during their research into a rule-based multilingual text generation scheme that: the multi-language text generation scheme based on the rules has many defects, for example, more expert knowledge is needed, a large amount of variable multi-language texts are difficult to generate, the generated texts are not smooth and natural enough to meet the expression habits of human beings, and the like.

In view of the problems of the rule-based multilingual text generation scheme, the inventor of the present invention continues research and finally provides a multilingual text generation method with a good effect, which does not need much expert knowledge, can automatically generate a smooth and natural multilingual text conforming to human expression habits, and can generate a wide variety of multilingual texts. The multilingual text generation method can be applied to any scene needing to generate multilingual texts.

The multilingual text generation method provided by the application can be applied to electronic equipment with data processing capacity, and the electronic equipment can be a server on a network side (a single server or a plurality of servers or a server cluster), and can also be a terminal used by a user side, such as a PC, a notebook, a PAD, a smart phone and the like. The following embodiments are provided to describe the multilingual text generation method provided in the present application.

First embodiment

Referring to fig. 1, a flow diagram of a multilingual text generation method provided in an embodiment of the present application is shown, where the method may include:

step S101: a multilingual word list is obtained.

The multilingual word list includes a plurality of entries, each entry including a word and language information of the word. An example of a multilingual word list is shown below:

language kind	Word
		Chinese character	I am
English	plan
		Chinese character	You good
English	document
		…	…

In the table above, < chinese, i > is an entry, < english, plan > is an entry, < chinese, hello > is an entry, < english, document > is an entry, ….

The multilingual word list in this embodiment may be a word list of two languages, for example, a word list of chinese and english, or a word list of three or more languages, for example, a word list of chinese and english. It should be noted that the words in the multilingual word list are words that often appear in other languages of text, for example, the english word "plan" is a word that often appears in chinese text.

Step S102: and generating the multilingual text by using a pre-established multilingual text generation model and taking the multilingual word list as a basis.

In this embodiment, generating a multilingual text based on a multilingual word list is implemented using a multilingual text generation model established in advance, where the multilingual text generation model performs text generation with generating a multilingual text that matches the characteristics of the real multilingual text as a generation target. The multilingual text generation model generates multilingual texts which are fluent and natural and accord with human expression habits by generating the multilingual texts which accord with the characteristics of real multilingual texts as a generation target.

In a possible implementation manner, the multilingual text generation model may adopt a generation network in an adversarial generation network, which is obtained through training, and the training goal of the multilingual text generation model during training is that a discrimination network in the adversarial generation network cannot distinguish whether the input multilingual text is generated into a text generated by the network or a real text. It should be noted that the fact that the discrimination network in the countermeasure generation network cannot distinguish whether the input multilingual text is generated as the text generated by the network or the real text means that the discrimination network determines that the multilingual text generated by the generation network as the multilingual text generation model is the real text after the multilingual text is input to the discrimination network.

The method for generating the multilingual text by utilizing the pre-established multilingual text generation model and taking the multilingual word list as the basis has various implementation modes:

in one possible implementation, the multilingual word list may be directly input to a pre-established multilingual text generation model, and the multilingual text generation model generates the multilingual text according to the input multilingual word list.

Considering that the multilingual text generated by directly inputting the multilingual word list into the multilingual text generation model is not good in diversity, in order to improve the diversity of the generated text, the embodiment provides another preferred implementation method:

according to language information in the multilingual word list, randomly sampling a plurality of entries from the multilingual word list to form a target word list, inputting the target word list into a preset multilingual text generation model, and generating the multilingual text according to the input target word list by the multilingual text generation model.

The process of randomly sampling a plurality of entries from the multilingual word list to form the target word list according to the language information in the multilingual word list may include: a plurality of words are sampled from each language word in the multi-language word list, and the entry where the sampled words are located constitutes the target word list.

Exemplarily, the multilingual word list obtained in step S101 includes { chinese: "hello", chinese: "world", english: "hello", english: "world", and english: "hi", and the sampling number of each language can be set (the sampling number of each language can be set according to specific situations, and can be the same or different), and assuming that the sampling number of each language is 1, 1 word is randomly extracted from two words "hello" and "world" in the language "chinese", 1 word is randomly extracted from three words "hello", "world", and one possible extraction result is { chinese: "world", and "hello" english.

The multilingual text generation method provided by the embodiment of the application firstly obtains the multilingual word list, and then generates the multilingual text by using the multilingual word list as a basis by using a pre-established multilingual text generation model. The multilingual text generation method provided by the embodiment of the application can automatically generate multilingual texts, and since the multilingual texts are generated by using the multilingual text generation model and the multilingual text generation model generates the texts by taking the multilingual texts which accord with the characteristics of the real multilingual texts as the generation target, the multilingual texts generated by using the multilingual text generation model accord with the characteristics of the real multilingual texts, namely are smooth and natural and accord with the expression habits of human beings.

In addition, according to the multilingual text generation method provided by the embodiment of the application, when the multilingual text is generated according to the multilingual word list, a plurality of entries are randomly sampled from the multilingual word list according to the language information in the multilingual word list to form the target word list, and then the multilingual text is generated according to the target word list, so that the generated multilingual text has more diversity.

Second embodiment

In the above embodiments, it is mentioned that the multilingual text may be generated according to a multilingual word list, when the multilingual text is generated according to the multilingual word list, the multilingual text may be generated directly according to the multilingual word list, or a target word list may be formed by sampling entries from the multilingual word list according to language information, and the multilingual text is generated according to the target word list.

Referring to fig. 2, a flow diagram illustrating the generation of multilingual text based on a target word list is shown, which may include:

step S201: and determining the characteristic vector of each entry in the target word list and the characteristic vector of the target word list.

In this embodiment, a word embedding vector (i.e., a representative vector of each entry) of each entry in the target word list may be determined, then a feature vector of each entry in the target word list may be determined according to the word embedding vector of each entry in the target word list, and after obtaining the feature vector of each entry in the target word list, a feature vector of the target word list may be further determined according to the feature vector of each entry in the target word list. It should be noted that the feature vector of the target word list is the feature vector of the whole of all the entries in the target word list.

The process of determining the word embedding vector of any entry in the target word list may include: determining word embedding vectors of words in the entry, and determining language embedding vectors of language information in the entry; and splicing the word embedded vector of the word in the entry with the language embedded vector of the language information in the entry, and taking the spliced vector as the word embedded vector of the entry.

Step S202: and determining a vector containing sentence grammar information as a global plan hidden variable based on the feature vector of the target word list.

The sentence grammar information is syntax information, which includes sentence structure information (or sentence component information), dependency relationship information between words in a sentence, and the like.

Specifically, determining a vector containing sentence grammar information based on the feature vector of the target word list, and the process of serving as the global planning hidden variable may include:

step S2021, determining normal distribution which the feature vector of the target word list accords with based on the feature vector of the target word list.

In particular, the mean μmay be determined based on the feature vectors of the target word list^pSum variance σ^pTo obtain a mean value of μ^pVariance is σ^pNormal distribution of

Step S2022, sampling a plurality of values from the normal distribution to obtain global planning hidden variables.

And combining a plurality of values up-sampled from the normal distribution into a global planning hidden variable containing sentence grammar information.

Step S203: and generating a multilingual text based on the global planning hidden variables, the feature vector of each entry in the target word list and the feature vector of the target word list.

Referring to fig. 3, a schematic diagram of a process for generating a multilingual text based on global scheduling hidden variables, feature vectors of each entry in a target word list, and feature vectors of the target word list is shown, which may include:

step S301: and determining the entries participating in text generation from the target word list as target entries based on the global planning hidden variables and the feature vectors of each entry in the target word list, and forming a planning vector by the determined feature vectors of the target entries.

Wherein the plan vector characterizes information of terms subsequently participating in text generation.

Specifically, the process of determining the entries participating in text generation from the target word list based on the global planning hidden variables and the feature vector of each entry in the target word list includes:

step S3011, predicting the probability of each entry in the target word list participating in text generation based on the global planning hidden variables and the feature vector of each entry in the target word list.

And predicting the probability of each entry in the target word list participating in text generation based on the global planning hidden variables and the feature vector of each entry in the target word list, and essentially performing two-classification prediction on each entry in the target word list.

Step S3012, determining the entries participating in the text generation in the target word list based on the probability of each entry participating in the text generation in the target word list.

Specifically, the process of determining the entry participating in the text generation in the target word list based on the probability of each entry participating in the text generation in the target word list may include: if the target word list has entries of which the probability of participating in text generation is greater than a preset probability threshold, determining the entries of which the probability of participating in text generation is greater than the preset probability threshold as the entries participating in text generation; and if the target word list does not have the entry with the probability of participating in the text generation larger than the preset probability threshold, determining the entry with the maximum probability of participating in the text generation as the entry participating in the text generation.

Illustratively, the target word list includes 3 entries, each being d₁、d₂、d₃The probability of 3 entries participating in text generation is P in sequence₁、P₂And P₃Assuming that the predetermined probability threshold is P_thIf P is₁Greater than P_th，P₂Less than P_th，P₃Greater than P_thThen d1 and d3 are confirmedFor entries subsequently participating in text generation, if P₁、P₂And P₃Are all less than a preset probability threshold value of P_thDetermining the entry with the highest probability of participating in the text generation as the entry participating in the text generation, assuming that P₃At maximum, d3 is determined as the entry participating in the text generation.

The above-mentioned manner of determining the entries participating in the text generation in the target word list based on the probability of each entry participating in the text generation in the target word list enables entries to participate in the subsequent text generation anyway.

Step S302: and generating a multilingual text based on the feature vector, the global planning hidden variable and the planning vector of the target word list.

Specifically, based on the feature vector of the target word list, the global planning hidden variable and the planning vector, a multilingual text is generated, and the process of generating the multilingual text may include: and calculating the mean value of the feature vectors of all target entries in the plan vector, and decoding the mean value of the feature vectors of all target entries, the feature vector of the target word list and the global plan hidden variable to obtain the multilingual text.

More specifically, the process of decoding the mean of the feature vectors of all target entries, the feature vectors of the target word lists, and the global planning hidden variables includes: at each decoding moment, determining a text prediction vector at the current decoding moment according to the mean value of the feature vectors of the target entries, the feature vectors of the target word lists, the global planning hidden variable and a text prediction vector (a vector for predicting a text) determined at the previous decoding moment, and determining words generated at the current decoding moment from a pre-constructed multi-language word dictionary based on the text prediction vector at the current decoding moment.

Third embodiment

The first embodiment mentions that the generation of the multilingual text based on the target word list can be realized by a pre-established multilingual text generation model, and on the basis of the second embodiment, this embodiment focuses on the structure of the multilingual text generation model and the establishment process of the multilingual text generation model.

The present embodiment first introduces the structure of the multilingual text generation model.

Referring to fig. 4, a schematic diagram of a possible structure of the multilingual text generation model provided in this embodiment is shown, which may include: an input encoding module 401, a plan processing module 402, and a sentence generation module 403. Wherein:

the input of the input coding module 401 is a target word list, and after the target word list is input into the input coding module 401, the input coding module 401 codes the target word list, and outputs a feature vector of each entry in the target word list, a feature vector of the target word list and a global planning hidden variable containing sentence grammar information. FIG. 5 shows the input encoding module 401 processing the input target word list and outputting the feature vector (h in FIG. 5) of each entry in the target word list₁～h_N) Feature vectors of target word lists and global planned hidden variables (z in FIG. 5) containing sentence grammar information^P) Schematic representation of (a).

In one possible implementation, the input encoding module 401 may include: the system comprises a word embedding vector determination sub-module, a feature vector determination sub-module and a global plan hidden variable determination sub-module. The word embedding vector determining submodule is configured to determine a word embedding vector for each entry in the target word list (the specific determination manner of the word embedding vector for each entry may be referred to in the relevant part in the foregoing embodiment); the feature vector determining submodule is used for determining a feature vector of each entry in the target word list and a feature vector of the target word list according to the word embedding vector of each entry in the target word list; the global planning hidden variable determining submodule is configured to determine a global planning hidden variable including sentence grammar information according to the feature vector of the target word list (the specific determination manner of the global planning hidden variable may be referred to in relevant parts in the above embodiments).

Optionally, the feature vector determination sub-module may include a recurrent neural network, such as a bidirectional GRU, and the feature vector H of the target word list may be obtained by passing the feature vector of each entry in the target word list through the recurrent neural network, such as the bidirectional GRU:

where x represents the target word list, and x ═ d₁，d₂，...，d_ND1 denotes the 1 st entry in the target word list_NIndicating the nth entry in the target word list,

indicating the output of the recurrent neural network when encoding the 1 st entry from back to front,

indicating the output of the recurrent neural network when the nth entry is encoded, coded from front to back.

Ith entry d in target word list_iCharacteristic vector h of_iExpressed as:

optionally, the global plan hidden variable determining submodule may include a normal distribution determining submodule and a sampling submodule. Wherein, the normal distribution determining submodule can be a multilayer perceptron, and the multilayer perceptron is used for determining the mean value mu according to the characteristic vector H of the target word list^pSum variance σ^pThat is, the feature vector H of the target word list is input into the multi-level perceptron to obtain μ^pAnd log σ^p：

[μ^p；log σ^p]＝MLP_θ(H) (3)

Wherein, MLP represents the multi-layer perceptron, and theta represents the training parameters of the multi-layer perceptron.

In the obtaining of^pAnd log σ^pAfter, pass log σ^pCan obtain sigma^pThe mean value mu can be obtained finally^pSum variance σ^pThereby obtaining the positive of the feature vector coincidence of the target word listDistribution of states

After obtaining the normal distribution, a sampling submodule may sample a number of values on the normal distribution to obtain global plan hidden variables. It should be noted that the process of determining the normal distribution to which the feature vector of the target word list conforms and sampling from the normal distribution can be regarded as the process of compressing and reconstructing information to obtain the simplest information (corresponding to a plurality of values sampled in the normal distribution), since the reconstructed information is the most core, most abstract and standing at the global angle, it can be considered that the information contains sentence grammar information, and it should be noted that, in the training phase, since the generation network using discriminant network supervision as the multilingual text generation model generates a text with better quality, the multilingual text generation model can finally reconstruct the most core, most abstract and standing at the global angle through continuous learning, that is, the above global planning hidden variable.

The plan processing module 402 is configured to determine, from the target word list, entries participating in text generation as target entries based on the global plan hidden variables and the feature vectors of each entry in the target word list, and form a plan vector from the determined feature vectors of the target entries. Fig. 6 is a schematic diagram illustrating the process of obtaining the planning vector by the calculation processing module 402.

Optionally, the plan processing module 402 may include a probability prediction sub-module and a plan vector acquisition sub-module. Optionally, the probability prediction sub-module may be a fully-connected network, which performs two-class prediction on each entry in the target word list, specifically, for an entry d in the target word list_iThe probabilistic prediction submodule may predict the entry d by_iProbability of participating in text generation:

P(d_i∈g)＝σ(v_p tanh(W_p[h_i；z^P]+b_p)) (4)

wherein h is_iRepresenting an entry d_iCharacteristic vector of (2), z^PRepresenting global planning hidden variables, g representing planning vectors, v_p，W_p，b_pAre trainable parameters.

After the probability prediction submodule predicts the probability of each entry in the target word list participating in text generation, the plan vector acquisition submodule can determine the entry participating in text generation, namely the target entry according to the probability predicted by the probability prediction submodule, and then acquire a plan vector g consisting of the determined feature vectors of the target entry.

A sentence generation module 403 for generating a global planning hidden variable z based on the feature vector g of the target word list^PAnd a plan vector g, generating a multilingual text. Fig. 7 shows a schematic diagram of the sentence generation module 403 generating multilingual text.

Optionally, the sentence generation module 403 may include: an average pooling sub-module and a decoding sub-module. The average pooling sub-module is used for performing average pooling on the plan vector g, namely calculating the mean value g' of the feature vector of each target entry in the plan vector, and the decoding sub-module is used for performing average pooling on the feature vector H and the global plan hidden variable z of the target word list^PAnd decoding the mean value g' of the feature vector of each target entry in the planning vector to obtain the multilingual text.

When the decoding submodule decodes, at each decoding moment, according to the feature vector H and the global plan hidden variable z of the target word list^PAnd the mean value g' of the feature vector of each target entry in the planning vector and the text prediction vector h determined at the previous decoding moment_t-1(vector for prediction of text), determining a text prediction vector h at the current decoding time_tAnd determining the text generated at the current decoding moment according to the text prediction vector at the current decoding moment.

When determining the text generated at the current decoding moment according to the text prediction vector at the current decoding moment, determining the words generated at the current decoding moment from a pre-constructed multi-language word dictionary based on the text prediction vector at the current decoding moment. Specifically, the probability that the word generated at the current decoding time is each word in a multilingual word dictionary (all words in several languages) is predicted based on the text prediction vector at the current decoding time, the word generated at the current decoding time is determined according to the predicted probability, and more specifically, the word corresponding to the maximum probability in the determined probabilities is determined as the word generated at the current decoding time.

Optionally, the decoding sub-module may be a transform module or a recurrent neural network (e.g., GRU), and when the decoding sub-module is a GRU, the text prediction vector h at the current decoding time is the text prediction vector h_tCan be expressed as:

h_t＝GRU([H；z^P；g′]，h_t-1) (5)

at each decoding time t, predicting the text of the last decoding time t-1 into a vector h_t-1Text prediction vector h as input and output_tAnd entering a full connection layer and a Softmax layer, and selecting the word at the decoding time t.

It should be noted that the words in the multilingual word list are only words that are frequently found in other languages of text, but they cannot form a complete sentence only by means of the words, such as a multilingual word list containing "hello" and "force" and the complete sentence can be formed by inserting some other words, such as "hello" and "force" through "China" and "is".

In view of this, in the embodiment, the output dimension of the fully-connected layer is set to be the size of the whole multi-language word dictionary, not the size of the multi-language word list, and after the output of the fully-connected layer passes through the Softmax layer, the probability that the word generated at the decoding time t is each word in the multi-language word dictionary can be obtained, and the word corresponding to the maximum probability is determined as the word generated at the decoding time t.

Illustratively, the multilingual word dictionary contains three words, "i", "love", "China", the output dimension of the full link layer is set to 3, and the text prediction vector h at the decoding time t_tAfter entering the full connection layer and the Softmax layer, the probability [0.1, 0.3, 0.6 ] that the words generated at the decoding moment t are I, love and Chinese respectively can be obtained]Since the probability that the word generated at the decoding time t is "Chinese" is the largest (0.6), the word will be generated at the decoding time t"china" is determined as the word generated at decoding time t.

It should be noted that the output dimension of the fully-connected layer is set to be the size of the whole multi-language word dictionary, so that each word in the multi-language word dictionary has the possibility of being selected for sentence making, and as to which word is selected at each step, the model learns and self-adjusts by self.

In addition, when the sentence generation module outputs a special sentence end word, the entire text generation is ended. Alternatively, a sentence end word, such as "EOS", may be set in the multilingual word dictionary, and at a certain decoding time, if the probability that the generated word is the sentence end word is the largest, the sentence end word is determined as the generated word, and the text generation is ended.

The above contents provide a structure of a multilingual text generation model and a process for generating multilingual texts under the structure, it should be noted that the present embodiment does not limit the structure of the multilingual text generation model to the above structure, and the above structure is only an example, and all structures capable of generating multilingual texts in the multilingual text generation manner provided in the second embodiment belong to the protection scope of the present application.

Next, the process of creating a multilingual generative model will be described.

Referring to FIG. 8, a flow diagram illustrating a process for creating a multilingual text-generating model is shown, which may include:

step S801: and generating the multilingual text by using the generation network in the countermeasure generation network as a multilingual text generation model according to the multilingual word list as the multilingual generation text.

The process of generating the multilingual text based on the multilingual word list by using the generation network as the multilingual text generation model in the countermeasure generation network is similar to the above-described "step S102: the implementation process of generating the multilingual text based on the multilingual word list using the pre-established multilingual text generation model is similar, which can be specifically referred to the above embodiments, and the details of this embodiment are not repeated herein.

Step S802 a: and inputting the multilingual generated text into a discrimination network in the countermeasure generation network to obtain the probability that the multilingual generated text is the real text.

Step S802 b: and determining a language diversity indicated value corresponding to the multilingual generated text.

The language variety indicated value can represent the language variety of the corresponding text.

Specifically, the process of determining the language diversity indication value corresponding to the multilingual generated text may include: calculating the average value of the expression vectors of the language to which the target word belongs in the multilingual generation text as an average language expression vector; and determining a language diversity indicated value corresponding to the multilingual generated text according to the number of the target words in the multilingual generated text, the expression vector of the language to which the target words belong in the multilingual generated text and the average language expression vector. The target words are words in entries participating in text generation in the multilingual word list.

More specifically, the indicated value L of the language diversity corresponding to the multilingual generated text can be calculated according to the following formula_lang：

Wherein M represents the number of target words in the multilingual production text,

representing vectors, mu, of the language to which the mth target word in a multilingual generative text belongs_lRepresenting the mean language representation vector. It should be noted that the larger the language diversity indication value is, the more languages participating in text generation are indicated, and introducing the language diversity indication value during training can avoid that the multilingual text generation model tends to generate single-language sentences.

It should be noted that, this embodiment does not limit the calculation method shown in equation (6) to calculate the language diversity indication value corresponding to the multilingual generated text, and may also adopt other methods, for example, an average value of the expression vectors of the languages to which the target word belongs in the multilingual generated text may be calculated as an average language expression vector, then an absolute value of a difference between the expression vector of the language to which each target word belongs in the multilingual generated text and the average language expression vector may be calculated to obtain an absolute value of a difference corresponding to each target word in the multilingual generated text, an average value of all the absolute values of the differences may be obtained, and the obtained average value may be used as the language diversity indication value corresponding to the multilingual generated text.

Step S803: and updating parameters of the multilingual text generation model according to the probability that the multilingual generated text is the real text and the indicated value of the language diversity corresponding to the multilingual generated text.

Specifically, the prediction loss of the multilingual text generation model is determined according to the probability that the multilingual text generation model is a real text and the corresponding language diversity indication value of the multilingual text generation model, and the parameter of the multilingual text generation model is updated according to the prediction loss of the multilingual text generation model.

More specifically, the predicted loss LG of the multilingual text-generating model can be calculated as follows

Wherein y represents the multilingual text generated by the multilingual text generation model, D (G (y)) represents the probability that the multilingual text y is the real text, and L_langI.e. the indicated value of the language diversity corresponding to the multi-language text generated by the multi-language text generation model, lambda is the hyperparameter of the balance loss item,

indicating the desire.

And performing iterative training on the multilingual text generation model for multiple times according to the process until a training end condition is met (for example, a preset training time is reached, or the performance of the multilingual text generation model meets the requirement). And obtaining a model after training is the established multilingual text generation model.

It should be noted that the generation network as the multilingual text generation model and the discrimination network may be trained separately, and when the discrimination network is trained, the parameters of the generation network as the multilingual text generation model are fixed, and when the generation network as the multilingual text generation model is trained, the parameters of the discrimination network are fixed. The above training process is a process of training a generation network as a multilingual text generation model by fixing parameters of a discrimination network, and then a process of training a discrimination network by fixing parameters of a generation network as a multilingual text generation model is introduced.

Referring to fig. 9, a schematic flow chart of training a discriminant network is shown, which may include:

step S901: the multilingual text generated by the generation network as the multilingual text generation model is acquired as the multilingual generation text, and the multilingual real text is acquired.

The multilingual real text is collected in the real scene.

Step S902: and inputting the multilingual generated text and the multilingual real text into a discrimination network to obtain the probability that the multilingual generated text is the real text and the probability that the multilingual real text is the real text.

The method utilizes the thought of generation countermeasures, utilizes the discrimination network to promote the generation network serving as the multilingual text generation model to generate texts which are natural and accord with human habits, and the input of the discrimination network is the multilingual texts generated by the generation network serving as the multilingual text generation model and the multilingual texts collected in a real scene. Optionally, the discriminating network may employ multi-lingual BERT while combining the bi-directional LSTM and the fully-connected layer.

Step S903: and updating parameters of the discrimination network based on the probability that the multilingual generated text is the real text and the probability that the multilingual real text is the real text.

Specifically, the prediction loss of the discrimination network is determined according to the probability that the multilingual generated text is the real text and the probability that the multilingual real text is the real text, and the parameters of the discrimination network are updated according to the prediction loss of the discrimination network.

More specifically, the predicted loss L of the discrimination network may be determined according to the following equation^D：

Wherein x represents a multilingual real text, y represents a multilingual text generated by the multilingual text generation model, D (x) represents a probability that x is the real text, D (G (y)) represents a probability that y is the real text,

indicating the desire.

The multilingual text generation model provided by the embodiment can generate multilingual texts which are smooth and natural and accord with human expression habits.

Fourth embodiment

The following describes the multilingual text generation apparatus provided in the embodiment of the present application, and the multilingual text generation apparatus described below and the multilingual text generation method described above may be referred to in correspondence with each other.

Referring to fig. 10, a schematic structural diagram of a multilingual text generation apparatus provided in an embodiment of the present application is shown, where the multilingual text generation apparatus includes: a multilingual word list acquisition module 1001 and a multilingual text generation module 1002.

A multilingual word list obtaining module 1001 configured to obtain a multilingual word list, where the multilingual word list includes a plurality of entries, and each entry includes a word and language information of the word;

a multilingual text generation module 1002, configured to generate a multilingual text based on the multilingual word list by using a pre-established multilingual text generation model, where the multilingual text generation model performs text generation with generating a multilingual text that meets the characteristics of real multilingual text as a generation target.

Optionally, the multilingual text generation module 1002 includes: the system comprises an entry sampling module and a text generation module.

The vocabulary entry sampling module is used for randomly sampling a plurality of vocabulary entries from the multilingual word list to form a target word list according to the language information in the multilingual word list;

and the text generation module is used for generating a multilingual text according to the target word list.

Optionally, when the text generation module generates a multilingual text based on the target word list, the text generation module is specifically configured to:

determining a feature vector of each entry in the target word list and a feature vector of the target word list, wherein the feature vector of the target word list is the feature vector of the whole of all entries in the target word list; determining a vector containing sentence grammar information as a global plan hidden variable based on the feature vector of the target word list; and generating a multilingual text based on the global planning hidden variable, the feature vector of each entry in the target word list and the feature vector of the target word list.

Optionally, the text generating module, when determining a vector containing sentence grammar information based on the feature vector of the target word list and serving as a global planning hidden variable, is specifically configured to:

determining normal distribution obeyed by the feature vector of the target word list based on the feature vector of the target word list; and sampling a plurality of values from the normal distribution to obtain the global plan hidden variable.

Optionally, the text generating module is specifically configured to, when generating a multilingual text based on the global hidden variables, the feature vector of each entry in the target word list, and the feature vector of the target word list:

determining entries participating in text generation from the target word list as target entries based on the global planning hidden variables and the feature vectors of each entry in the target word list; and generating a multilingual text based on the feature vector of the target word list, the global planning hidden variable and the feature vector of the target entry.

Optionally, the text generation module, when determining, from the target word list, a term participating in text generation based on the global hidden variables and the feature vector of each term in the target word list, and when the term is used as a target term, is specifically configured to:

predicting the probability of each entry in the target word list participating in text generation based on the global planning hidden variable and the feature vector of each entry in the target word list; and determining the entries participating in the text generation in the target word list as target entries based on the probability of each entry participating in the text generation in the target word list.

Optionally, when determining the entry participating in the text generation in the target word list based on the probability that each entry participates in the text generation in the target word list, the text generation module is specifically configured to:

if the target word list has entries of which the probability of participating in text generation is greater than a preset probability threshold, determining the entries of which the probability of participating in text generation is greater than the preset probability threshold as the entries participating in text generation; and if the entry with the probability of participating in the text generation larger than the preset probability threshold does not exist in the target word list, determining the entry with the maximum probability of participating in the text generation as the entry participating in the text generation.

Optionally, the text generating module is specifically configured to, when generating a multilingual text based on the feature vector of the target word list, the global planning hidden variable, and the feature vector of the target entry:

calculating the mean value of the feature vectors of all target entries; and decoding the mean value of the feature vectors of all the target entries, the feature vectors of the target word list and the global planning hidden variable to obtain a multilingual text.

Optionally, the text generation module is specifically configured to, when decoding the mean of the feature vectors of all the target entries, the feature vectors of the target word list, and the global planning hidden variable to obtain a multilingual text:

at each decoding moment, determining a text prediction vector at the current decoding moment according to the mean value of the feature vectors of all target entries, the feature vector of the target word list, the global planning hidden variable and the text prediction vector at the previous decoding moment; predicting the probability that the words generated at the current decoding moment are all words in the multilingual dictionary by taking the text prediction vector at the current decoding moment as a prediction basis; and determining the words generated at the current decoding moment according to the probability that the words generated at the current decoding moment are all words in the multilingual dictionary.

Optionally, the multilingual text generating apparatus provided in the embodiment of the present application may further include: and a model building module. The model building module comprises: the system comprises a multilingual generation text acquisition module, a text discrimination module, a language diversity indicated value determination module and a model parameter updating module.

A multilingual-language-generated-text acquisition module for generating a multilingual text based on the multilingual word list by using a generation network as a multilingual-text generation model in the countermeasure generation network as a multilingual-language-generated text;

the text discrimination module is used for inputting the multilingual generation text into a discrimination network in the countermeasure generation network so as to obtain the probability that the multilingual generation text is a real text;

a language diversity indicated value determining module, configured to determine a language diversity indicated value corresponding to the multilingual generated text, where the language diversity indicated value can represent the language diversity of the corresponding text;

and the model parameter updating module is used for updating the parameters of the multilingual text generation model according to the probability that the multilingual generation text is the real text and the indicated value of the language diversity corresponding to the multilingual generation text.

Optionally, when determining the language diversity indicated value corresponding to the multilingual generated text, the language diversity indicated value determining module is specifically configured to:

calculating the average value of the expression vectors of the languages to which the target words in the multilingual word generation text belong as an average language expression vector, wherein the target words are words in entries participating in text generation in the multilingual word list; and determining a language diversity indicated value corresponding to the multilingual generated text according to the number of the target words in the multilingual generated text, the expression vector of the language to which the target words belong in the multilingual generated text and the average language expression vector.

The multilingual text generation device provided by the embodiment of the application firstly obtains the multilingual word list, and then generates the multilingual text by using the multilingual word list as a basis by using a pre-established multilingual text generation model. The multilingual text generation device provided by the embodiment of the application can automatically generate multilingual texts, and since the multilingual texts are generated by using the multilingual text generation model and the multilingual text generation model generates the texts by using the multilingual texts which conform to the characteristics of the real multilingual texts as the generation target, the multilingual texts generated by using the multilingual text generation model conform to the characteristics of the real multilingual texts, namely are smooth and natural and conform to the expression habits of human beings.

Fifth embodiment

An embodiment of the present application further provides a multilingual text generating apparatus, please refer to fig. 11, which shows a schematic structural diagram of the multilingual text generating apparatus, where the multilingual text generating apparatus may include: at least one processor 1101, at least one communication interface 1102, at least one memory 1103, and at least one communication bus 1104;

in the embodiment of the present application, the number of the processor 1101, the communication interface 1102, the memory 1103 and the communication bus 1104 is at least one, and the processor 1101, the communication interface 1102 and the memory 1103 complete communication with each other through the communication bus 1104;

the processor 1101 may be a central processing unit CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 1103 may include a high-speed RAM memory, a non-volatile memory (non-volatile memory), and the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may be as described above.

Sixth embodiment

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multilingual text generating method, comprising:

2. The multilingual text-generating method of claim 1, wherein the multilingual text-generating model employs a generating network in a countermeasure generating network;

3. The method of claim 1, wherein the generating multilingual text based on the multilingual word list comprises:

and generating a multilingual text according to the target word list.

4. The method of claim 3, wherein the generating multilingual text based on the target word list comprises:

5. The method of claim 4, wherein the determining a vector containing sentence grammar information as a global planning hidden variable based on the feature vector of the target word list comprises:

6. The method of claim 4, wherein generating multilingual text based on the global planned hidden variables, the feature vector of each entry in the target word list, and the feature vector of the target word list comprises:

7. The method of claim 6, wherein the determining the entries from the target word list as target entries based on the global latent variables and the feature vectors of each entry in the target word list comprises:

8. The method of claim 7, wherein determining the terms of the target word list that participate in the generation of text based on the probability of each term of the target word list participating in the generation of text comprises:

9. The method of claim 6, wherein the generating multilingual text based on the feature vector of the target word list, the global latent variable, and the feature vector of the target entry comprises:

calculating the mean value of the feature vectors of all target entries;

10. The method of claim 9, wherein the decoding the mean of the feature vectors of all target terms, the feature vectors of the target word lists, and the global schedule hidden variables to obtain multilingual text comprises:

11. The multilingual text-generating method of claim 1, wherein the process of creating the multilingual text-generating model comprises:

12. The method of claim 11, wherein said determining the indication of the type of diversity of the multilingual text comprises:

13. A multilingual text-generating apparatus, comprising: the system comprises a multilingual word list acquisition module and a multilingual text generation module;

14. A multilingual text generating apparatus, comprising: a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the multilingual text generation method according to any one of claims 1 to 12.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the multilingual text generation method according to any one of claims 1 to 12.