CN111680494A

CN111680494A - Similar text generation method and device

Info

Publication number: CN111680494A
Application number: CN202010341544.XA
Authority: CN
Inventors: 骆加维; 吴信朝; 龚连银; 周宝; 陈远旭
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2020-09-18
Anticipated expiration: 2040-04-27
Also published as: WO2021218015A1; CN111680494B

Abstract

The invention discloses a method and a device for generating a similar text, relates to the technical field of semantic analysis, and aims to solve the problem that the actual semantics of the similar text and the initial text are not completely the same in the prior art. The method mainly comprises the following steps: acquiring text word segmentation after the initial text is subjected to word segmentation; searching text word vectors of the text word segmentation according to a preset word vector algorithm; splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector; inputting the spliced vector into a preset encoder to generate a characteristic word vector set of the initial text; and inputting the set of the token word vectors into a preset decoder, and resolving similar texts of the initial text. The invention is mainly applied to the process of natural language processing. In addition, the invention also relates to a block chain technology, and the splicing vector can be stored in the block chain node.

Description

Similar text generation method and device

Technical Field

The invention relates to the technical field of semantic analysis, in particular to a method and a device for generating similar texts.

Background

With the continuous development of artificial intelligence, the application of a human-computer interaction system is more and more extensive. In the process of using the human-computer interaction system, the text information input by the user or the text information obtained by voice conversion may not be the meaning that the user actually expresses. In order to avoid the wrong interpretation of the user input information by the human-computer interaction system, the user input information is often converted into a plurality of accurate expression methods by training a bilingual environment or a multilingual environment. But problems of semantic bias and text alignment are encountered in bilingual translation models.

In the prior art, a current similar text of an initial text is calculated according to a first neural network model, then a current discrimination probability of the initial text and the current similar text is calculated according to a second neural network model, then whether the current discrimination probability is equal to a preset probability value or not is judged, if not, the first neural network model is optimized according to a preset model optimization strategy, then the current similar text is recalculated according to the optimized first neural network model, finally, whether the calculated current discrimination probability is equal to the preset probability value or not is circularly judged, and if so, the similar text is taken as a target similar text.

The inventor of the invention finds out in research that in the scheme of the prior art, the neural network method is adopted to calculate the similar text, the discrimination dependence is mainly based on the model parameters of the first neural network model and the second neural network model, and the model parameters are obtained through training data, namely the calculated similar text has higher dependence on the training data and lower corresponding dependence on the initial text, so that the actual semantics of the similar text and the initial text are not completely the same easily.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for generating a similar text, and mainly aims to solve the problem in the prior art that the actual semantics of the similar text and the initial text are not completely the same.

According to an aspect of the present invention, there is provided a method for generating similar texts, including:

acquiring text participles of an initial text;

searching text word vectors of the text word segmentation according to a preset word vector algorithm;

splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;

inputting the spliced vector into a preset encoder to generate a characteristic word vector set of the initial text;

and inputting the set of the token word vectors into a preset decoder, and resolving similar texts of the initial text.

According to another aspect of the present invention, there is provided a similar text generation apparatus, including:

the acquisition module is used for acquiring text participles of the initial text;

the searching module is used for searching the text word vector of the text word segmentation according to a preset word vector algorithm;

the first generation module is used for splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;

the second generation module is used for inputting the splicing vector into a preset encoder to generate a characteristic word vector set of the initial text;

and the resolving module is used for inputting the vector set of the characterization words into a preset decoder and resolving the similar text of the initial text.

According to still another aspect of the present invention, a computer storage medium is provided, and at least one executable instruction is stored in the computer storage medium, and the executable instruction causes a processor to execute operations corresponding to the generation method of the similar text.

According to still another aspect of the present invention, there is provided a computer apparatus including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the generation method of the similar text.

By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:

the invention provides a method and a device for generating similar texts, which comprises the steps of firstly obtaining text participles of an initial text, then searching text word vectors of the text participles according to a preset word vector algorithm, splicing the text word vectors and relative position vectors of the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characteristic word vector set of the initial text, and finally inputting the characteristic word vector set into a preset decoder to solve the similar text of the initial text. Compared with the prior art, the embodiment of the invention takes the splicing vector of the relative position vector and the text word vector as input, and generates the combination of the representation word vector of the initial text through the preset encoder, wherein the relative position vector enables each text participle to have a 'context' relationship, so that the position information contained in different segmented words in the same long sentence is the same, the relevance of the contexts is improved, and the semantic similarity of the similar text and the initial text is further improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating a method for generating a similar text according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for generating similar texts according to the embodiment of the present invention;

FIG. 3 is a block diagram illustrating a similar text generating apparatus according to an embodiment of the present invention;

FIG. 4 is a block diagram of another similar text generation apparatus provided in the embodiment of the present invention;

fig. 5 shows a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention provides a method for generating a similar text, which comprises the following steps as shown in figure 1:

101. and acquiring text segmentation of the initial text.

When a user inputs text or voice through a terminal, the actual semantics of the text or voice are usually required for question answering, recommendation or search. The initial text refers to the words input by the user or the words after voice conversion. The word segmentation refers to a process of recombining continuous word sequences into word sequences according to a certain standard, and carries out word segmentation on an initial text, and a word segmentation method based on character string matching, a word segmentation method based on understanding or a word segmentation method based on statistics can be adopted, and the adopted word segmentation method is not limited in the embodiment of the invention.

102. And searching text word vectors of the text word segmentation according to a preset word vector algorithm.

The preset word vector algorithm can be a matrix decomposition-based method, a shallow window-based method, a word2vector algorithm and the like, wherein the word2vector algorithm is a method for training an N-gram language model through a neural network machine learning algorithm and solving a vector corresponding to a word in the training process. During the training process, two modes of hierarchy and negative sampling are adopted to accelerate the training of the word2vector algorithm. The preset word2vector algorithm is a trained model algorithm, and text word vectors of text word segmentation can be directly searched through the preset word2vector algorithm.

103. And splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector.

Each text word vector may be identified based on its relative or absolute position in the initial text. If the absolute position is adopted, the position information contained in the words of different segments under the same long sentence is the same, but the position information is actually different, so that the relative position is adopted in the application to effectively distinguish each text word vector. The relative position vector is a vector matrix, and the ith row and the jth column of the vector matrix identify the relative positions from the ith word to the jth word. The relative position vectors correspond to the text word vectors one by one and are high-dimensional vectors with the same dimensionality, and the vectors are directly added for splicing according to the operation rule of the matrix.

104. And inputting the spliced vector into a preset encoder to generate a characteristic word vector set of the initial text.

The preset encoder is used for converting an input sequence with an indefinite length into a variable with a definite length, and is usually realized by a recurrent neural network. Namely, the spliced vector is converted into a synonymy token vector set, and token vector combination refers to a set of token vector tensors which have the same intention as that of the original text word and express different high-dimensional spaces. The preset encoder can adopt a depth neural network, a recursive variation, a product network depth and other modes, and the specific method adopted by the preset encoder in the embodiment of the application is not limited.

The invention aims to output rich and diverse text sets on the basis of not changing the text meaning so as to complete text repeat of an initial text and collect a large amount of similar text data for tasks needing supervised learning in natural language processing such as character abstract extraction, machine translation and the like.

105. And inputting the vector set of the characterization words into a preset decoder, and resolving the similar text of the initial text.

The role of the preset decoder, as opposed to the preset encoder, is the inverse of the preset encoder, which is used to convert fixed-length variables into an output sequence of indefinite length. The preset decoder is designed according to downstream tasks which can be divided into generative tasks and sequence tasks. Illustratively, machine translation is a generative task and synonyms are judged to be sequential tasks. And (4) taking the vector set of the representation words as input, resolving by a preset decoder, and outputting a similar text.

The invention provides a method for generating similar texts, which comprises the steps of firstly obtaining text participles of an initial text, then searching text word vectors of the text participles according to a preset word vector algorithm, splicing the text word vectors and relative position vectors of the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characteristic word vector set of the initial text, and finally inputting the characteristic word vector set into a preset decoder to solve the similar text of the initial text. Compared with the prior art, the embodiment of the invention takes the splicing vector of the relative position vector and the text word vector as input, and generates the combination of the representation word vector of the initial text through the preset encoder, wherein the relative position vector enables each text participle to have a 'context' relationship, so that the position information contained in different segmented words in the same long sentence is the same, the relevance of the contexts is improved, and the semantic similarity of the similar text and the initial text is further improved.

An embodiment of the present invention provides another method for generating a similar text, as shown in fig. 2, the method includes:

201. and acquiring text segmentation of the initial text.

When a user inputs text or voice through a terminal, the actual semantics of the text or voice are usually required for question answering, recommendation or search. The initial text refers to the words input by the user or the words after voice conversion. The word segmentation refers to a process of recombining continuous word sequences into word sequences according to a certain standard, and the word segmentation can be performed on an initial text by adopting the following steps: inputting the initial text into a preset ending word segmentation model; and acquiring the text participles output by the ending participle model.

The Chinese word segmentation comprises the steps of realizing efficient word graph scanning based on a Trie tree structure and generating a directed acyclic graph formed by all possible word forming conditions of Chinese characters in a sentence; a maximum probability path is searched by adopting dynamic programming, and a maximum segmentation combination based on word frequency is found out; for unknown words, an HMM model based on Chinese character word forming capability is adopted, and a Viterbi algorithm is used. The initial text is segmented by loading a dictionary, adjusting the dictionary, and then extracting keywords based on a TF-IDF algorithm or extracting keywords based on a TextRank algorithm.

202. And searching text word vectors of the text word segmentation according to a preset word vector algorithm.

203. And splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector.

204. And calculating the factorization vector of the spliced vector according to the word sequence probability of the spliced vector.

For a better understanding of the present scheme, we now illustrate the word order probabilities, assuming that given a sequence xx of length T, there is a total of T! The arrangement method is also corresponding to T! A chain decomposition method is provided. Assuming that the stitching vector x is x1x2x3, then 3 | is always shared! The 6 decomposition method, where p (x2| x1x3) refers to the probability that the second word is x2 under the condition that the first word is x1 and the third word is x3, that is, the order of the original words is preserved. Traverse T! The method is decomposed, and model parameters are shared, so that the context relationship can be learned in the process of extracting the factorization vector. While a common left-to-right or right-to-left language model can only learn one directional dependency, for example, "guess" a word first, then "guess" a second word based on the first word, and "guess" a third word based on the first two words. The ranking language model learns the word sequence probabilities in various orders, such as p (x) ═ p (x)₁|x₃)p(x₂|x₁x₃)p(x₃) The corresponding sequence 3 → 1 → 2, which is to "guess" the third word first, then guess the first word based on the third word, and finally guess the second word based on the first and third words. If the context dependency is the same as the text sequence, the text with the same sequence has the unique meaning, and the possibility that similar texts can be obtained according to the unique meaning is very high, so that the factorization vector of the splicing vector is calculated according to the word sequence probability

Calculating a factorized vector of the stitching vector, specifically comprising: calculating the word sequence probability of the initial text according to the concatenation vector, wherein the word sequence probability refers to the conditional probability of each arrangement mode of the full arrangement of the text participles, and the occurrence condition of the conditional probability is that all the participles arranged before the current participle according to the arrangement mode all occur; determining the arrangement sequence of the text participles corresponding to the maximum word sequence probability as a participle semantic sequence; and combining adjacent participle vectors to generate a factorization vector of the spliced vector, wherein the adjacent participle vectors refer to vector elements in the spliced vector corresponding to the text participles which are sequentially adjacent in the participle semantic sequence.

Assume that the initial text includes 5 text participles x₁、x₂、x₃、x₄、x₅The corresponding concatenation vector comprises 5 vector elements A1, A2, A3, A4 and A5, and the text participles of the initial text are arranged completely, including 5! 120, wherein the ordering mode with the maximum word order probability is x₃、x₁、x₂、x₄、x₅The formula is P ═ P (x)₁|x₃)p(x₂|x₁x₃)p(x₃)p(x₄|x₁x₂x₃)p(x₅|x₁x₂x₃x₄) The semantic order of the participle is x₃、x₁、x₂、x₄、x₅, wherein x₁ and x₂, and x₄ and x₅The split word texts are sequentially street-adjacent split word texts, vector elements A1 and A2 in a corresponding splicing vector are adjacent split word vectors, A4 and A5 are adjacent split word vectors, A1 and A2 are combined into B1, A4 and A5 are combined into B2, and factors of the splicing vector are decomposed into A1, B1 and B2, so that dimension reduction of the splicing vector is achieved, data size can be reduced, and training and calculating speed are improved. If each element in the spliced vector is a sequence number, the searching method of the adjacent segmented word vector can acquire a first element position identifier of a first element at any position in the segmented word semantic sequence in the spliced vector and a second element position identifier of a second element adjacent to the first element in the spliced vector according to a preset sequence, and then perform self-increment step length operation on the first element position identifier to obtain a predicted position identifier, wherein the self-increment step length is the sequence number of the spliced vectorIf the predicted position identification is different from the second element position identification, the first element position is obtained again, if the predicted position identification is the same as the second element position identification, the first element and the second element are determined to be adjacent word segmentation vectors, meanwhile, the second element position identification is redefined as the first element position identification, the second element is used as the first element at any position in the word segmentation semantic sequence, and the steps are repeated until all the adjacent word segmentation vectors in the spliced vector are found. The adjacent word segmentation vector may include two elements, three elements, four elements, and the like, and the number of elements included in the adjacent word segmentation vector is not limited in the embodiment of the present invention.

205. According to a preset self-attention mechanism, the attention features of the factorization vectors are extracted.

The extraction process of the self-attention feature comprises the following steps: calculating similarity of the query and each key to obtain a weight, and then normalizing the weight by using a softmax function; and finally, carrying out weighted summation on the weight and the corresponding key value to obtain the attention characteristic, wherein the key and the value are the same, namely the key is equal to the value. And through a factorization vector and a preset self-attention mechanism, the method is used for extracting the intention of the splicing vector so as to obtain the text codes with obviously identical intentions.

206. Based on the vector mean vector and the vector standard deviation vector of the factorized vector, the factorized vector is randomly sampled to generate a sample.

In the step, a vector quantization variation mechanism is adopted, and randomly sampled sampling samples with lower dimensionality are obtained in the step. In the prior art, the input is converted into vector coding, the potential space where the input is located may be discontinuous, or simple interpolation is allowed, in a bilingual translation task of machine translation, a definite multidimensional feature tensor is output by an encoder, and due to the particularity of the translation task, the accuracy and the repeatability of translation are affected by potential semantic features, grammatical features and text length. Hiding the randomly distributed features that obey a certain distribution if the output of the encoder is not a definite multidimensional tensor, and randomly sampling the features to ensure the richness and diversity of languages can improve the accuracy and repeatability of translation.

The process of generating the sampling specifically includes: counting a vector mean vector and a vector standard deviation vector of the factorization vectors; and randomly sampling the factorization vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample. And (4) counting the data distribution characteristics of the factorization vectors, then carrying out induction, and outputting two vectors with the same size, namely a vector average vector and a vector standard deviation vector. The data subject to this constraint is then randomly sampled based on the vector mean vector and the vector standard deviation vector, the potential space of samples for random sampling is continuous and allows interpolation.

Wherein, the vector average vector and the vector standard deviation vector of the statistical factorization vector specifically include: counting a first probability distribution function of the factorization vector according to a first preset probability distribution formula, and counting a second probability distribution function of the factorization vector according to a second preset probability distribution formula, wherein dependent variables of the first probability distribution function comprise a first average vector and a first standard deviation vector, and dependent variables of the second probability distribution function comprise a second average vector and a second standard deviation vector; calculating KL divergence of the first probability distribution function and the second probability distribution function; if KL divergence is equal to 0, determining that the factorized vector is subject to the first probability distribution function or the second probability distribution function, determining that the vector mean vector is the first mean vector or the second mean vector, determining that the vector standard deviation vector is the first standard deviation vector or the second standard deviation vector; and if the KL divergence is not equal to 0, calculating the vector average vector and the vector standard deviation vector by taking the minimum value of the obtained KL divergence as a target according to the factorization vector.

After the sampling sample is generated, a residual error neural network can be combined to avoid the situations of gradient explosion and gradient disappearance in the backward propagation process, and an upper layer input is added before a second layer of linearly-changed activation layer is input, so that the cross entropy of the abstract representation in the process of updating the gradient of a decoder can be reduced, and the convergence rate is accelerated.

207. And generating a set of characteristic word vectors of the initial text according to the sampling samples and the attention characteristics.

The characteristic word vector set is a text coding set which is similar to but not identical with the sample on the basis of the sample. Specifically, a characterization vector set of the initial text is generated according to a preset dimension adjustment rule, wherein the preset dimension adjustment rule is characterized by comprising the following steps: z is a radical of_h＝αe_h+(1-α)q_h； wherein ,z_hTo characterize the vector set, α are learning parameters, e_hFor attention features, q_hIs a random sampling result.

The step 204-. And (4) waiting to the initial text by four-layer calculation. Token word vector union refers to a set of word vector tensors that represent a different high-dimensional space, with the same intent as the original text word. The invention aims to output rich and diverse text sets on the basis of not changing the text meaning so as to complete text repeat of an initial text and collect a large amount of similar text data for tasks needing supervised learning in natural language processing such as character abstract extraction, machine translation and the like.

208. And inputting the vector set of the characterization words into a preset decoder, and resolving the similar text of the initial text.

Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention provides a device for generating a similar text, as shown in fig. 3, where the device includes:

an obtaining module 31, configured to obtain text participles of an initial text;

the searching module 32 is used for searching the text word vector of the text word segmentation according to a preset word vector algorithm;

the first generating module 33 is configured to splice the text word vector and the relative position vector of the text word vector to generate a spliced vector;

a second generating module 34, configured to input the concatenation vector into a preset encoder, and generate a token word vector set of the initial text;

and the resolving module 35 is configured to input the set of token word vectors into a preset decoder, and resolve the similar text of the initial text.

The invention provides a device for generating similar texts, which comprises the steps of firstly obtaining text participles of an initial text, then searching text word vectors of the text participles according to a preset word vector algorithm, splicing the text word vectors and relative position vectors of the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characteristic word vector set of the initial text, and finally inputting the characteristic word vector set into a preset decoder to solve the similar text of the initial text. Compared with the prior art, the embodiment of the invention takes the splicing vector of the relative position vector and the text word vector as input, and generates the combination of the representation word vector of the initial text through the preset encoder, wherein the relative position vector enables each text participle to have a 'context' relationship, so that the position information contained in different segmented words in the same long sentence is the same, the relevance of the contexts is improved, and the semantic similarity of the similar text and the initial text is further improved.

Further, as an implementation of the method shown in fig. 2, an embodiment of the present invention provides another similar text generation apparatus, as shown in fig. 4, where the apparatus includes:

an obtaining module 41, configured to obtain text participles of an initial text;

the searching module 42 is configured to search a text word vector of the text word segmentation according to a preset word vector algorithm;

a first generating module 43, configured to splice the text word vector and the relative position vector of the text word vector to generate a spliced vector;

a second generating module 44, configured to input the concatenation vector into a preset encoder, and generate a token word vector set of the initial text;

and the resolving module 45 is used for inputting the vector set of the characterization words into a preset decoder and resolving the similar text of the initial text.

Further, the obtaining module 41 includes:

an input unit 411, configured to input the initial text into a preset ending segmentation model;

an obtaining unit 412, configured to obtain the text participles output by the ending participle model.

Further, the second generating module 44 includes:

a calculating unit 441, configured to calculate a factorized vector of the concatenated vector according to a word order probability of the concatenated vector, where the concatenated vector is stored in a block chain;

it is emphasized that the stitching vector may also be stored in a node of a blockchain in order to further ensure the privacy and security of the stitching vector.

An extracting unit 442, configured to extract attention features of the factorized vector according to a preset self-attention mechanism;

a sampling unit 443 configured to randomly sample the factorized vector based on a vector mean vector and a vector standard deviation vector of the factorized vector to generate a sampling sample;

a generating unit 444, configured to generate a set of token vectors of the initial text according to the sample and the attention feature.

Further, the calculation unit 441 includes:

a calculating subunit 4411, configured to calculate a word order probability of the initial text according to the concatenation vector, where the word order probability refers to a conditional probability of each arrangement mode in which the text participles are fully arranged, and an occurrence condition of the conditional probability is that all participles arranged before the current participle according to the arrangement mode all occur;

a determining subunit 4412, configured to determine that the arrangement sequence of the text participles corresponding to the maximum value of the word sequence probability is a participle semantic sequence;

a generating subunit 4413, configured to combine adjacent word segmentation vectors to generate a factorized vector of the concatenated vector, where the adjacent word segmentation vectors refer to vector elements in the concatenated vector corresponding to text word segmentations sequentially adjacent in the word segmentation semantic order.

Further, the sampling unit 443 includes:

a statistics subunit 4431, configured to count a vector mean vector and a vector standard deviation vector of the factorized vectors;

a sampling subunit 4432, configured to randomly sample the factorization vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample.

Further, the statistics subunit 4431 is configured to:

counting a first probability distribution function of the factorization vector according to a first preset probability distribution formula, and counting a second probability distribution function of the factorization vector according to a second preset probability distribution formula, wherein dependent variables of the first probability distribution function comprise a first average vector and a first standard deviation vector, and dependent variables of the second probability distribution function comprise a second average vector and a second standard deviation vector;

calculating KL divergence of the first probability distribution function and the second probability distribution function;

if KL divergence is equal to 0, determining that the factorized vector is subject to the first probability distribution function or the second probability distribution function, determining that the vector mean vector is the first mean vector or the second mean vector, determining that the vector standard deviation vector is the first standard deviation vector or the second standard deviation vector;

and if the KL divergence is not equal to 0, calculating the vector average vector and the vector standard deviation vector by taking the minimum value of the obtained KL divergence as a target according to the factorization vector.

Further, the generating unit 444 is configured to:

generating a characterization vector set of the initial text according to a preset dimension adjustment rule, wherein the preset dimension adjustment rule is characterized by comprising the following steps:

z_h＝αe_h+(1-α)q_h；

wherein ,z_hFor the set of characterization vectors, α is a learning parameter, e_hFor attention features, q_hIs the random sampling result.

According to an embodiment of the present invention, a computer storage medium is provided, and the computer storage medium stores at least one executable instruction, and the computer executable instruction can execute the method for generating the similar text in any method embodiment.

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computer device.

As shown in fig. 5, the computer apparatus may include: a processor (processor)502, a Communications Interface 504, a memory 506, and a communication bus 508.

Wherein: the processor 502, communication interface 504, and memory 506 communicate with one another via a communication bus 508.

A communication interface 504 for communicating with network elements of other devices, such as clients or other servers.

The processor 502 is configured to execute the program 510, and may specifically perform relevant steps in the above embodiment of the method for generating a similar text.

In particular, program 510 may include program code that includes computer operating instructions.

The processor 502 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the invention. The computer device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 506 for storing a program 510. The memory 506 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 510 may specifically be used to cause the processor 502 to perform the following operations:

acquiring text participles of an initial text;

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for generating similar texts, comprising:

acquiring text participles of an initial text;

2. The method of claim 1, wherein the obtaining text participles of the initial text comprises:

inputting the initial text into a preset ending word segmentation model;

and acquiring the text participles output by the ending participle model.

3. The method of claim 1, wherein said inputting the stitched vector into a pre-set encoder to generate a set of token vectors for the initial text comprises:

calculating a factorized vector of the spliced vector according to the word order probability of the spliced vector, wherein the spliced vector is stored in a block chain;

extracting attention features of the factorization vectors according to a preset self-attention mechanism;

randomly sampling the factorized vector to generate a sample based on a vector mean vector and a vector standard deviation vector of the factorized vector;

and generating a set of characterization word vectors of the initial text according to the sampling samples and the attention features.

4. The method of claim 3, wherein said computing a factorized vector for the stitched vector based on word order probabilities for the stitched vector comprises:

calculating the word sequence probability of the initial text according to the concatenation vector, wherein the word sequence probability refers to the conditional probability of each arrangement mode of the full arrangement of the text participles, and the occurrence condition of the conditional probability is that all the participles arranged before the current participle according to the arrangement mode all occur;

determining the arrangement sequence of the text participles corresponding to the maximum word sequence probability as a participle semantic sequence;

and combining adjacent participle vectors to generate a factorization vector of the spliced vector, wherein the adjacent participle vectors refer to vector elements in the spliced vector corresponding to the text participles which are sequentially adjacent in the participle semantic sequence.

5. The method of claim 3, wherein the randomly sampling the factorized vector based on a vector mean vector and a vector standard deviation vector of the factorized vector to generate sample samples comprises:

counting a vector mean vector and a vector standard deviation vector of the factorization vectors;

and randomly sampling the factorization vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample.

6. The method of claim 5, wherein said counting vector means vectors and vector standard deviation vectors of said factorized vectors comprises:

7. The method of claim 3, wherein generating a set of token word vectors for the initial text from the sample samples and the attention features comprises:

z_h＝αe_h+(1-α)q_h；

wherein ,z_hFor the set of characterization vectors, α is a learning parameter, e_hFor attention features, q_hFor said random miningAnd (5) sampling results.

8. A device for generating similar text, comprising:

9. A computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the method for generating similar text according to any one of claims 1-7.

10. A computer device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the generation method of the similar text according to any one of claims 1-7.