CN111680494B

CN111680494B - Similar text generation method and device

Info

Publication number: CN111680494B
Application number: CN202010341544.XA
Authority: CN
Inventors: 骆加维; 吴信朝; 龚连银; 周宝; 陈远旭
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2023-05-12
Anticipated expiration: 2040-04-27
Also published as: WO2021218015A1; CN111680494A

Abstract

The invention discloses a method and a device for generating a similar text, relates to the technical field of semantic analysis, and aims to solve the problem that the actual semantics of the similar text and the initial text are not identical in the prior art. The method mainly comprises the following steps: acquiring text word segmentation after word segmentation of the initial text; searching a text word vector of the text word according to a preset word vector algorithm; splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector; inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text; and inputting the characterization word vector set into a preset decoder, and resolving the similar text of the initial text. The invention is mainly applied to the natural language processing process. In addition, the invention also relates to a blockchain technology, and the splicing vector can be stored in a blockchain node.

Description

Similar text generation method and device

Technical Field

The invention relates to the technical field of semantic analysis, in particular to a method and a device for generating similar texts.

Background

With the continuous development of artificial intelligence, the application of man-machine interaction systems is becoming more and more widespread. In the process of using the man-machine interaction system, text information input by a user or text information obtained by voice conversion may not be meaning actually expressed by the user. In order to avoid misinterpretation of user input information by a man-machine interaction system, the user input information is often converted into a plurality of accurate expression methods by training a bilingual environment or a multilingual environment. But the problems of grammar semantic bias and text alignment are encountered in bilingual translation models.

In the prior art, a current similar text of an initial text is calculated according to a first neural network model, then the current discrimination probabilities of the initial text and the current similar text are calculated according to a second neural network model, then whether the current discrimination probabilities are equal to a preset probability value is judged, if not, the first neural network model is optimized according to a preset model optimization strategy, then the current similar text is calculated again according to the optimized first neural network model, finally, whether the calculated current discrimination probabilities are equal to the preset probability value is circularly judged, and if so, the similar text is used as a target similar text.

The inventor of the invention discovers in the research that in the prior art, the neural network method is adopted to calculate the similar text, the basis of the discrimination is mainly model parameters of the first neural network model and the second neural network model, the model parameters are obtained through training data, namely, the calculated similar text has higher dependence on the training data, and the corresponding dependence on the initial text is lower, so that the actual semantics of the similar text and the initial text are easy to be different.

Disclosure of Invention

In view of this, the invention provides a method and a device for generating a similar text, which mainly aims to solve the problem that the actual semantics of the similar text and the initial text are not identical in the prior art.

According to one aspect of the present invention, there is provided a method for generating a similar text, including:

acquiring text segmentation of an initial text;

searching a text word vector of the text word according to a preset word vector algorithm;

splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;

inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text;

and inputting the characterization word vector set into a preset decoder, and resolving the similar text of the initial text.

According to another aspect of the present invention, there is provided a similar text generating apparatus including:

the acquisition module is used for acquiring text segmentation of the initial text;

the searching module is used for searching the text word vector of the text word according to a preset word vector algorithm;

the first generation module is used for splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;

the second generation module is used for inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text;

and the resolving module is used for inputting the characterization word vector set into a preset decoder and resolving the similar text of the initial text.

According to still another aspect of the present invention, there is provided a computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the method of generating similar text as described above.

According to still another aspect of the present invention, there is provided a computer apparatus including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the generation method of the similar text.

By means of the technical scheme, the technical scheme provided by the embodiment of the invention has at least the following advantages:

the invention provides a method and a device for generating a similar text, which are characterized in that firstly, text word segmentation of an initial text is obtained, then text word vectors of the text word segmentation are searched according to a preset word vector algorithm, then relative position vectors of the text word vectors and the text word vectors are spliced to generate spliced vectors, the spliced vectors are input into a preset encoder to generate a characterization word vector set of the initial text, and finally the characterization word vector set is input into a preset decoder to calculate the similar text of the initial text. Compared with the prior art, the embodiment of the invention generates the characteristic word vector combination of the initial text by taking the spliced vector of the relative position vector and the text word vector as input and presetting the encoder, wherein each text word has a context relation by the relative position vector, so that the position information contained in words of different segments in the same long sentence is the same, the relevance of the context is improved, and the semantic similarity of the similar text and the initial text is further improved.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 shows a flowchart of a method for generating similar text according to an embodiment of the present invention;

FIG. 2 is a flowchart of another method for generating similar text according to an embodiment of the present invention;

FIG. 3 is a block diagram showing the constitution of a device for generating a similar text according to an embodiment of the present invention;

FIG. 4 is a block diagram showing another similar text generating apparatus according to an embodiment of the present invention;

fig. 5 shows a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention provides a method for generating similar texts, which is shown in fig. 1 and comprises the following steps:

101. and acquiring text segmentation of the initial text.

When a user inputs text or voice through a terminal, the actual semantics of the text or voice are often required for questions, recommendations or searches. The initial text refers to text input by a user or text after voice conversion. The word segmentation refers to a process of recombining continuous word sequences into word sequences according to a certain specification, and the initial text is segmented, and a word segmentation method based on character string matching, an understanding-based word segmentation method or a statistical word segmentation method can be adopted, so that the adopted word segmentation method is not limited in the embodiment of the invention.

102. According to a preset word vector algorithm, searching a text word vector of the text word segmentation.

The preset word vector algorithm can be a matrix decomposition-based method, a shallow window-based method, a word2vector algorithm and the like, wherein the word2vector algorithm is a method for training an N-gram language model through a neural network machine learning algorithm and solving a vector corresponding to a word in the training process. In the training process, a word2vector algorithm is accelerated to be trained by adopting two modes of layering and negative sampling. The word2vector presetting algorithm is a trained model algorithm, and text word vectors of text word segmentation can be directly searched through the word2vector presetting algorithm.

103. And splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector.

Each text word vector may be identified based on its relative or absolute position in the original text. If the absolute position is adopted, the position information contained in words of different segments under the same long sentence is the same, but in practice, the position information should be different, so that the relative position is adopted in the application to effectively distinguish each text word vector. The relative position vector is a vector matrix, and the ith row and jth column of the vector matrix identify the relative positions between the ith word and the jth word. The relative position vectors and the text word vectors are in one-to-one correspondence, are high-dimensional vectors with the same dimension, and are directly added for splicing according to the operation rule of the matrix.

104. And inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text.

The function of the preset encoder is to transform an input sequence of indefinite length into a variable of definite length, which is usually implemented by a cyclic neural network. That is, the spliced vector is converted into a synonymous token vector set, and token vector combination refers to a set of word vector tensors expressing different high-dimensional spaces with the same intention as the initial text word. The preset encoder may adopt a depth neural network, a recursive variation, a sum-product network depth, and the like, and the specific method adopted by the preset encoder in the embodiment of the present application is not limited.

The invention aims to output a rich and various text set on the basis of not changing text meaning so as to complete text re-description of an initial text, collect a large amount of similar text data and be used for extracting tasks needing supervised learning in natural language processing such as text abstract, machine translation and the like.

105. And inputting the characterization word vector set into a preset decoder, and resolving the similar text of the initial text.

The function of the preset decoder, contrary to the function of the preset encoder, is the inverse of the preset encoder for converting the fixed-length variable into an output sequence of indefinite length. The preset decoder is designed according to downstream tasks, and the downstream tasks can be divided into two types of generating type tasks and sequence tasks. Illustratively, machine translation is a generative task and determining synonyms is a sequential task. And taking the characterization word vector set as input, and outputting a similar text through the resolving of a preset decoder.

The invention provides a generation method of a similar text, which comprises the steps of firstly obtaining text word segmentation of an initial text, searching text word vectors of the text word segmentation according to a preset word vector algorithm, splicing relative position vectors of the text word vectors and the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characterization word vector set of the initial text, and finally inputting the characterization word vector set into a preset decoder to calculate the similar text of the initial text. Compared with the prior art, the embodiment of the invention generates the characteristic word vector combination of the initial text by taking the spliced vector of the relative position vector and the text word vector as input and presetting the encoder, wherein each text word has a context relation by the relative position vector, so that the position information contained in words of different segments in the same long sentence is the same, the relevance of the context is improved, and the semantic similarity of the similar text and the initial text is further improved.

The embodiment of the invention provides another method for generating similar texts, which is shown in fig. 2 and comprises the following steps:

201. and acquiring text segmentation of the initial text.

When a user inputs text or voice through a terminal, the actual semantics of the text or voice are often required for questions, recommendations or searches. The initial text refers to text input by a user or text after voice conversion. Word segmentation refers to a process of recombining continuous word sequences into word sequences according to a certain specification, and the word segmentation can be carried out on an initial text by adopting: inputting the initial text into a preset barker word segmentation model; and obtaining text word segmentation output by the crust word segmentation model.

The Chinese word segmentation of the sentence comprises the steps of realizing efficient word graph scanning based on a Trie structure, and generating a directed acyclic graph formed by all possible word forming conditions of Chinese characters in the sentence; the maximum probability path is searched by adopting dynamic programming, and the maximum segmentation combination based on word frequency is found; for the unregistered words, an HMM model based on the word forming capability of Chinese characters is adopted, and a Viterbi algorithm is used. The initial text is segmented by loading the dictionary, adjusting the dictionary, and then extracting keywords based on the TF-IDF algorithm or extracting keywords based on the TextRank algorithm.

202. According to a preset word vector algorithm, searching a text word vector of the text word segmentation.

203. And splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector.

204. And calculating factorization vectors of the spliced vectors according to the word sequence probability of the spliced vectors.

For a better understanding of the present scheme, we now exemplify word order probabilities, assuming a given length T sequence xx, for a total of T-! The seed arrangement method corresponds to T-! A chain decomposition method. Assuming that the splice vector x=x1x2x3, then the total common 3 ≡ -! =6 decomposition methods, where p (x2|x1x3) refers to the probability that the first word is x1 and the second word is x2 under the condition that the third word is x3, that is, the order of the original words is maintained. Traverse T-! A decomposition method and sharing model parameters so that context can be learned in extracting factorized vectors. And is commonThe left-to-right or right-to-left language model can only learn one directional dependency, such as "guessing" a word first, then "guessing" a second word from the first word, and "guessing" a third word from the first two words. While word order probabilities of various orders are learned by arranging language models, such as p (x) =p (x) ₁ |x ₃ )p(x ₂ |x ₁ x ₃ )p(x ₃ ) The corresponding order 3→1→2 is that the third word is "guessed" first, then the first word is guessed from the third word, and finally the second word is guessed from the first and third words. If the context dependency relationship is the same as the text order, then the text in the same order has a unique meaning and the likelihood that its similar text can be obtained from its unique meaning is great, whereby the factorized vector of the concatenated vector is calculated with word order probability

The factorization vector of the splicing vector is calculated, and specifically comprises the following steps: calculating word sequence probability of the initial text according to the splicing vector, wherein the word sequence probability refers to conditional probability of each arrangement mode of the text word segmentation in full arrangement, and the occurrence condition of the conditional probability is that all word segmentation arranged before the current word segmentation in the arrangement mode is completely generated; determining the arrangement sequence of the text word segmentation corresponding to the maximum value of the word sequence probability as a word segmentation semantic sequence; and merging adjacent word segmentation vectors, which refer to vector elements in the spliced vector corresponding to text segmentation which is sequentially adjacent in the word segmentation semantic sequence, to generate a factorization vector of the spliced vector.

Assume that the initial text includes 5 text segmentations x ₁ 、x ₂ 、x ₃ 、x ₄ 、x ₅ The corresponding spliced vector comprises 5 vector elements A1, A2, A3, A4 and A5, and text word segmentation of the initial text is fully arranged and comprises 5-! Arrangement in 120, where the order with the highest word order probability is x ₃ 、x ₁ 、x ₂ 、x ₄ 、x ₅ The calculation formula is p=p (x ₁ |x ₃ )p(x ₂ |x ₁ x ₃ )p(x ₃ )p(x ₄ |x ₁ x ₂ x ₃ )p(x ₅ |x ₁ x ₂ x ₃ x ₄ ) The word segmentation semantic order is x ₃ 、x ₁ 、x ₂ 、x ₄ 、x ₅, wherein x₁ and x₂, and x₄ and x₅ The method is characterized in that the method is a sequentially adjacent word segmentation text, vector elements A1 and A2 in corresponding spliced vectors are adjacent word segmentation vectors, A4 and A5 are adjacent word segmentation vectors, A1 and A2 are combined into B1, A4 and A5 are combined into B2, and factor decomposition vectors of the spliced vectors are A1, B1 and B2, so that dimension reduction of the spliced vectors is realized, data size can be reduced, and training and calculating speed is improved. If the elements in the spliced vector are sequentially numbered, the searching method of the adjacent word segmentation vector can acquire the first element position identification of the first element in any position in the word segmentation semantic sequence in the spliced vector and the second element position identification of the second element adjacent to the first element in the spliced vector according to the preset sequence, then the first element position identification is subjected to self-increasing step operation to acquire the predicted position identification, the self-increasing step is the numbered interval of the sequential number of the spliced vector, if the predicted position identification is different from the second element position identification, the first element position is acquired again, if the predicted position identification is identical with the second element position identification, the first element and the second element are determined to be adjacent word segmentation vectors, meanwhile, the second element position identification is redefined to be the first element position identification, and the first element in any position in the word segmentation semantic sequence is used as the second element, and the steps are repeated until all adjacent word segmentation vectors in the spliced vector are searched. The adjacent word segmentation vector may include two elements, three elements, four elements, and the like, and in the embodiment of the present invention, the number of elements included in the adjacent word segmentation vector is not limited.

205. The attention features of the factorized vector are extracted according to a preset self-attention mechanism.

The extraction process of the self-attention feature comprises the following steps: similarity calculation is carried out on the query and each key to obtain a weight, and then a softmax function is used for normalizing the weight; and finally, carrying out weighted summation on the weight and the corresponding key value to obtain the attention characteristic, wherein the key and the value are the same, namely, key=value. The method is used for extracting intention of the spliced vector through factorization vectors and a preset self-attention mechanism so as to obtain obvious text codes with the same intention.

206. The factorized vector is randomly sampled based on a vector average vector and a vector standard deviation vector of the factorized vector to generate a sample.

The method adopts a vector quantization variation mechanism, and a randomly sampled sample with a lower dimension is obtained in the step. In the prior art, the input is converted into vector codes, the potential space of the vector codes is possibly discontinuous, or simple interpolation is allowed, in a bilingual translation task of machine translation, an explicit multidimensional feature tensor is output by an encoder, and due to the specificity of the translation task, the accuracy and the repeatability of the translation are affected by potential semantic features, grammar features and text length. Hiding random distribution characteristics which are not a definite multidimensional tensor, but obey a certain distribution, and randomly sampling through the characteristics to ensure the richness and diversity of the language, so that the translation accuracy and the repeatability are improved.

The process of generating the adoption sample specifically comprises the following steps: counting vector average vectors and vector standard deviation vectors of the factorization vectors; and randomly sampling the factorization vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample. And counting the data distribution characteristics of factorized vectors, and then carrying out generalization to output two vectors with the same size, namely a vector average vector and a vector standard deviation vector. The data subject to this constraint is then randomly sampled based on the vector average vector and the vector standard deviation vector, the potential space of the randomly sampled employed samples being continuous and allowing interpolation.

The statistical factorization vector comprises a vector average vector and a vector standard deviation vector, and specifically comprises the following steps: counting a first probability distribution function of the factorization vector according to a first preset probability distribution formula, and counting a second probability distribution function of the factorization vector according to a second preset probability distribution formula, wherein dependent variables of the first probability distribution function comprise a first average vector and a first standard deviation vector, and dependent variables of the second probability distribution function comprise a second average vector and a second standard deviation vector; calculating KL divergence of the first probability distribution function and the second probability distribution function; if the KL divergence is equal to 0, determining that the factorized vector obeys the first probability distribution function or the second probability distribution function, determining that the vector average vector is the first average vector or the second average vector, and determining that the vector standard deviation vector is a first standard deviation vector or a second standard deviation vector; and if the KL divergence is not equal to 0, calculating the vector average vector and the vector standard deviation vector according to the factorization vector by taking the minimum value of the KL divergence as a target.

After the sampling sample is generated, a residual neural network can be combined to avoid the conditions of gradient explosion and gradient disappearance in the backward propagation process, and an upper layer input is added before the second layer of linearly-changing activation layer is input, so that the cross entropy of the abstract representation in the gradient updating process of the decoder can be reduced, and the convergence rate is accelerated.

207. And generating a characterization word vector set of the initial text according to the sampling sample and the attention characteristic.

The token vector set is a text coding set which is similar to but not identical to the sample on the basis of the sample. Specifically, a characterization vector set of the initial text is generated according to a preset dimension adjustment rule, wherein the characteristics of the preset dimension adjustment rule are described as follows: z _h ＝αe _h +(1-α)q _h； wherein ,z_h To characterize the vector set, α is the learning parameter, e _h For attention features, q _h Is a random sampling result.

The steps 204-207 correspond to the step 104 shown in fig. 1 of inputting the spliced vector into the preset encoder to generate the token vector set of the initial text, wherein the steps 204-207 may be equivalent to the encoding process including a factor conversion layer, a self-attention layer, a vector quantization variant layer and a full connection layer. And waiting for the token vector set of the initial text through four layers of calculation. Token vector combining refers to a collection of word vector tensors representing a different high-dimensional space as intended for the original text word. The invention aims to output a rich and various text set on the basis of not changing text meaning so as to complete text re-description of an initial text, collect a large amount of similar text data and be used for extracting tasks needing supervised learning in natural language processing such as text abstract, machine translation and the like.

208. And inputting the characterization word vector set into a preset decoder, and resolving the similar text of the initial text.

Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention provides a device for generating a similar text, as shown in fig. 3, where the device includes:

an obtaining module 31, configured to obtain a text word of an initial text;

the searching module 32 is configured to search a text word vector of the text word according to a preset word vector algorithm;

a first generating module 33, configured to splice the text word vector and a relative position vector of the text word vector, to generate a spliced vector;

a second generating module 34, configured to input the spliced vector into a preset encoder, and generate a token vector set of the initial text;

and a resolving module 35, configured to input the token vector set into a preset decoder, and resolve the similar text of the initial text.

The invention provides a generation device of a similar text, which comprises the steps of firstly obtaining text word segmentation of an initial text, searching text word vectors of the text word segmentation according to a preset word vector algorithm, splicing relative position vectors of the text word vectors and the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characterization word vector set of the initial text, and finally inputting the characterization word vector set into a preset decoder to calculate the similar text of the initial text. Compared with the prior art, the embodiment of the invention generates the characteristic word vector combination of the initial text by taking the spliced vector of the relative position vector and the text word vector as input and presetting the encoder, wherein each text word has a context relation by the relative position vector, so that the position information contained in words of different segments in the same long sentence is the same, the relevance of the context is improved, and the semantic similarity of the similar text and the initial text is further improved.

Further, as an implementation of the method shown in fig. 2, an embodiment of the present invention provides another apparatus for generating similar text, as shown in fig. 4, where the apparatus includes:

an obtaining module 41, configured to obtain a text word of an initial text;

a searching module 42, configured to search a text word vector of the text word according to a preset word vector algorithm;

a first generating module 43, configured to splice the text word vector and a relative position vector of the text word vector, to generate a spliced vector;

a second generating module 44, configured to input the spliced vector into a preset encoder, and generate a token vector set of the initial text;

and the resolving module 45 is used for inputting the token vector set into a preset decoder to resolve the similar text of the initial text.

Further, the obtaining module 41 includes:

an input unit 411, configured to input the initial text into a preset barker word model;

and the obtaining unit 412 is configured to obtain the text word segmentation output by the barker word segmentation model.

Further, the second generating module 44 includes:

a calculating unit 441, configured to calculate an factorized vector of the concatenated vector according to a word order probability of the concatenated vector, where the concatenated vector is stored in a blockchain;

it should be emphasized that, to further ensure the privacy and security of the splice vector, the splice vector may also be stored in a blockchain node.

An extracting unit 442, configured to extract an attention feature of the factorization vector according to a preset self-attention mechanism;

a sampling unit 443, configured to randomly sample the factorized vector based on a vector average vector and a vector standard deviation vector of the factorized vector to generate a sampling sample;

and the generating unit 444 is configured to generate a token vector set of the initial text according to the sampling sample and the attention feature.

Further, the computing unit 441 includes:

a calculating subunit 4411, configured to calculate, according to the concatenation vector, a word order probability of the initial text, where the word order probability refers to a conditional probability of each arrangement mode of the text word segmentation in a full arrangement, and an occurrence condition of the conditional probability is that all word segments arranged before a current word segmentation in the arrangement mode all occur;

a determining subunit 4412, configured to determine that an arrangement order of the text word segmentation corresponding to the maximum value of the word order probability is a word segmentation semantic order;

and the generation subunit 4413 is configured to combine adjacent word segmentation vectors, and generate a factorization vector of the stitched vector, where the adjacent word segmentation vectors refer to vector elements in the stitched vector corresponding to text words that are sequentially adjacent in the word segmentation semantic sequence.

Further, the sampling unit 443 includes:

a statistics subunit 4431 for counting vector average vectors and vector standard deviation vectors of the factorization vectors;

and a sampling subunit 4432, configured to randomly sample the factorized vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample.

Further, the statistics subunit 4431 is configured to:

counting a first probability distribution function of the factorization vector according to a first preset probability distribution formula, and counting a second probability distribution function of the factorization vector according to a second preset probability distribution formula, wherein dependent variables of the first probability distribution function comprise a first average vector and a first standard deviation vector, and dependent variables of the second probability distribution function comprise a second average vector and a second standard deviation vector;

calculating KL divergence of the first probability distribution function and the second probability distribution function;

if the KL divergence is equal to 0, determining that the factorized vector obeys the first probability distribution function or the second probability distribution function, determining that the vector average vector is the first average vector or the second average vector, and determining that the vector standard deviation vector is a first standard deviation vector or a second standard deviation vector;

and if the KL divergence is not equal to 0, calculating the vector average vector and the vector standard deviation vector according to the factorization vector by taking the minimum value of the KL divergence as a target.

Further, the generating unit 444 is configured to:

generating a characterization vector set of the initial text according to a preset dimension adjustment rule, wherein the characteristics of the preset dimension adjustment rule are described as follows:

z _h ＝αe _h +(1-α)q _h ；

wherein ,z_h For the set of token vectors, α is a learning parameter, e _h For attention features, q _h And (5) the random sampling result.

According to one embodiment of the present invention, there is provided a computer storage medium storing at least one executable instruction for performing the method of generating similar text in any of the above-described method embodiments.

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and the specific embodiment of the present invention is not limited to the specific implementation of the computer device.

As shown in fig. 5, the computer device may include: a processor 502, a communication interface (Communications Interface) 504, a memory 506, and a communication bus 508.

Wherein: processor 502, communication interface 504, and memory 506 communicate with each other via communication bus 508.

A communication interface 504 for communicating with network elements of other devices, such as clients or other servers.

The processor 502 is configured to execute the program 510, and may specifically perform relevant steps in the above-described embodiment of the method for generating similar text.

In particular, program 510 may include program code including computer-operating instructions.

The processor 502 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the computer device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

A memory 506 for storing a program 510. Memory 506 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 510 may be specifically operable to cause the processor 502 to:

acquiring text segmentation of an initial text;

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for generating a similar text, comprising:

acquiring text segmentation of an initial text;

inputting the characterization word vector set into a preset decoder, and resolving similar texts of the initial text;

the step of inputting the spliced vector into a preset encoder to generate the characterization word vector set of the initial text comprises the following steps:

according to the word sequence probability of the spliced vector, calculating the factorization vector of the spliced vector, wherein the spliced vector is stored in a blockchain;

extracting the attention characteristic of the factorization vector according to a preset self-attention mechanism;

randomly sampling the factorization vector based on a vector average vector and a vector standard deviation vector of the factorization vector to generate a sampling sample;

generating a characterization word vector set of the initial text according to the sampling sample and the attention characteristic;

the calculating the factorization vector of the splicing vector specifically comprises the following steps:

calculating word sequence probability of the initial text according to the splicing vector, wherein the word sequence probability refers to conditional probability of each arrangement mode of the text word segmentation in full arrangement, and the occurrence condition of the conditional probability is that all word segmentation arranged before the current word segmentation in the arrangement mode is completely generated; determining the arrangement sequence of the text word segmentation corresponding to the maximum value of the word sequence probability as a word segmentation semantic sequence; and merging adjacent word segmentation vectors, which refer to vector elements in the spliced vector corresponding to text segmentation which is sequentially adjacent in the word segmentation semantic sequence, to generate a factorization vector of the spliced vector.

2. The method of claim 1, wherein the obtaining text segmentation of the initial text comprises:

inputting the initial text into a preset barker word segmentation model;

and obtaining text word segmentation output by the crust word segmentation model.

3. The method of claim 1, wherein the calculating the factorized vector of the concatenated vector based on the word order probability of the concatenated vector comprises:

calculating word sequence probability of the initial text according to the splicing vector, wherein the word sequence probability refers to conditional probability of each arrangement mode of the text word segmentation in full arrangement, and the occurrence condition of the conditional probability is that all word segmentation arranged before the current word segmentation in the arrangement mode is completely generated;

determining the arrangement sequence of the text word segmentation corresponding to the maximum value of the word sequence probability as a word segmentation semantic sequence;

and merging adjacent word segmentation vectors, which refer to vector elements in the spliced vector corresponding to text segmentation which is sequentially adjacent in the word segmentation semantic sequence, to generate a factorization vector of the spliced vector.

4. The method of claim 1, wherein the randomly sampling the factorized vector based on a vector average vector and a vector standard deviation vector of the factorized vector to generate sampled samples comprises:

counting vector average vectors and vector standard deviation vectors of the factorization vectors;

and randomly sampling the factorization vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample.

5. The method of claim 4, wherein said counting vector average vectors and vector standard deviation vectors of said factorized vectors comprises:

6. The method of claim 1, wherein the generating the set of token vectors for the initial text from the sampled samples and attention features comprises:

；

wherein ,

for the set of token vectors, < >>

For learning parameters->

For attention deficit, add>

Is a random sampling result.

7. A similar text generating apparatus, comprising:

the resolving module is used for inputting the characterization word vector set into a preset decoder and resolving the similar text of the initial text;

wherein the second generating module includes:

a calculation unit, configured to calculate a factorization vector of the concatenation vector according to a word order probability of the concatenation vector, where the concatenation vector is stored in a blockchain;

the extraction unit is used for extracting the attention characteristic of the factorization vector according to a preset self-attention mechanism;

the sampling unit is used for randomly sampling the factorization vector based on a vector average vector and a vector standard deviation vector of the factorization vector to generate a sampling sample;

the generating unit is used for generating a characterization word vector set of the initial text according to the sampling sample and the attention characteristic;

wherein the computing unit includes:

a calculating subunit, configured to calculate, according to the concatenation vector, a word sequence probability of the initial text, where the word sequence probability refers to a conditional probability of each arrangement mode of the text word segmentation in a full arrangement, and an occurrence condition of the conditional probability is that all word segments arranged before a current word segmentation in the arrangement mode all occur;

the determining subunit is used for determining that the arrangement sequence of the text word segmentation corresponding to the maximum value of the word sequence probability is a word segmentation semantic sequence;

and the generation subunit is used for merging adjacent word segmentation vectors, which are vector elements in the spliced vector corresponding to the text word segmentation which is sequentially adjacent in the word segmentation semantic sequence, to generate the factorization vector of the spliced vector.

8. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the method for generating similar text according to any one of claims 1-6.

9. A computer device, comprising: the processor, the memory, the communication interface and the communication bus complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the method for generating similar text as defined in any one of claims 1 to 6.