CN111680494B - Similar text generation method and device - Google Patents

Similar text generation method and device Download PDF

Info

Publication number
CN111680494B
CN111680494B CN202010341544.XA CN202010341544A CN111680494B CN 111680494 B CN111680494 B CN 111680494B CN 202010341544 A CN202010341544 A CN 202010341544A CN 111680494 B CN111680494 B CN 111680494B
Authority
CN
China
Prior art keywords
vector
text
word
preset
spliced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010341544.XA
Other languages
Chinese (zh)
Other versions
CN111680494A (en
Inventor
骆加维
吴信朝
龚连银
周宝
陈远旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010341544.XA priority Critical patent/CN111680494B/en
Publication of CN111680494A publication Critical patent/CN111680494A/en
Priority to PCT/CN2020/117946 priority patent/WO2021218015A1/en
Application granted granted Critical
Publication of CN111680494B publication Critical patent/CN111680494B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method and a device for generating a similar text, relates to the technical field of semantic analysis, and aims to solve the problem that the actual semantics of the similar text and the initial text are not identical in the prior art. The method mainly comprises the following steps: acquiring text word segmentation after word segmentation of the initial text; searching a text word vector of the text word according to a preset word vector algorithm; splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector; inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text; and inputting the characterization word vector set into a preset decoder, and resolving the similar text of the initial text. The invention is mainly applied to the natural language processing process. In addition, the invention also relates to a blockchain technology, and the splicing vector can be stored in a blockchain node.

Description

Similar text generation method and device
Technical Field
The invention relates to the technical field of semantic analysis, in particular to a method and a device for generating similar texts.
Background
With the continuous development of artificial intelligence, the application of man-machine interaction systems is becoming more and more widespread. In the process of using the man-machine interaction system, text information input by a user or text information obtained by voice conversion may not be meaning actually expressed by the user. In order to avoid misinterpretation of user input information by a man-machine interaction system, the user input information is often converted into a plurality of accurate expression methods by training a bilingual environment or a multilingual environment. But the problems of grammar semantic bias and text alignment are encountered in bilingual translation models.
In the prior art, a current similar text of an initial text is calculated according to a first neural network model, then the current discrimination probabilities of the initial text and the current similar text are calculated according to a second neural network model, then whether the current discrimination probabilities are equal to a preset probability value is judged, if not, the first neural network model is optimized according to a preset model optimization strategy, then the current similar text is calculated again according to the optimized first neural network model, finally, whether the calculated current discrimination probabilities are equal to the preset probability value is circularly judged, and if so, the similar text is used as a target similar text.
The inventor of the invention discovers in the research that in the prior art, the neural network method is adopted to calculate the similar text, the basis of the discrimination is mainly model parameters of the first neural network model and the second neural network model, the model parameters are obtained through training data, namely, the calculated similar text has higher dependence on the training data, and the corresponding dependence on the initial text is lower, so that the actual semantics of the similar text and the initial text are easy to be different.
Disclosure of Invention
In view of this, the invention provides a method and a device for generating a similar text, which mainly aims to solve the problem that the actual semantics of the similar text and the initial text are not identical in the prior art.
According to one aspect of the present invention, there is provided a method for generating a similar text, including:
acquiring text segmentation of an initial text;
searching a text word vector of the text word according to a preset word vector algorithm;
splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;
inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text;
and inputting the characterization word vector set into a preset decoder, and resolving the similar text of the initial text.
According to another aspect of the present invention, there is provided a similar text generating apparatus including:
the acquisition module is used for acquiring text segmentation of the initial text;
the searching module is used for searching the text word vector of the text word according to a preset word vector algorithm;
the first generation module is used for splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;
the second generation module is used for inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text;
and the resolving module is used for inputting the characterization word vector set into a preset decoder and resolving the similar text of the initial text.
According to still another aspect of the present invention, there is provided a computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the method of generating similar text as described above.
According to still another aspect of the present invention, there is provided a computer apparatus including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the generation method of the similar text.
By means of the technical scheme, the technical scheme provided by the embodiment of the invention has at least the following advantages:
the invention provides a method and a device for generating a similar text, which are characterized in that firstly, text word segmentation of an initial text is obtained, then text word vectors of the text word segmentation are searched according to a preset word vector algorithm, then relative position vectors of the text word vectors and the text word vectors are spliced to generate spliced vectors, the spliced vectors are input into a preset encoder to generate a characterization word vector set of the initial text, and finally the characterization word vector set is input into a preset decoder to calculate the similar text of the initial text. Compared with the prior art, the embodiment of the invention generates the characteristic word vector combination of the initial text by taking the spliced vector of the relative position vector and the text word vector as input and presetting the encoder, wherein each text word has a context relation by the relative position vector, so that the position information contained in words of different segments in the same long sentence is the same, the relevance of the context is improved, and the semantic similarity of the similar text and the initial text is further improved.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 shows a flowchart of a method for generating similar text according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method for generating similar text according to an embodiment of the present invention;
FIG. 3 is a block diagram showing the constitution of a device for generating a similar text according to an embodiment of the present invention;
FIG. 4 is a block diagram showing another similar text generating apparatus according to an embodiment of the present invention;
fig. 5 shows a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides a method for generating similar texts, which is shown in fig. 1 and comprises the following steps:
101. and acquiring text segmentation of the initial text.
When a user inputs text or voice through a terminal, the actual semantics of the text or voice are often required for questions, recommendations or searches. The initial text refers to text input by a user or text after voice conversion. The word segmentation refers to a process of recombining continuous word sequences into word sequences according to a certain specification, and the initial text is segmented, and a word segmentation method based on character string matching, an understanding-based word segmentation method or a statistical word segmentation method can be adopted, so that the adopted word segmentation method is not limited in the embodiment of the invention.
102. According to a preset word vector algorithm, searching a text word vector of the text word segmentation.
The preset word vector algorithm can be a matrix decomposition-based method, a shallow window-based method, a word2vector algorithm and the like, wherein the word2vector algorithm is a method for training an N-gram language model through a neural network machine learning algorithm and solving a vector corresponding to a word in the training process. In the training process, a word2vector algorithm is accelerated to be trained by adopting two modes of layering and negative sampling. The word2vector presetting algorithm is a trained model algorithm, and text word vectors of text word segmentation can be directly searched through the word2vector presetting algorithm.
103. And splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector.
Each text word vector may be identified based on its relative or absolute position in the original text. If the absolute position is adopted, the position information contained in words of different segments under the same long sentence is the same, but in practice, the position information should be different, so that the relative position is adopted in the application to effectively distinguish each text word vector. The relative position vector is a vector matrix, and the ith row and jth column of the vector matrix identify the relative positions between the ith word and the jth word. The relative position vectors and the text word vectors are in one-to-one correspondence, are high-dimensional vectors with the same dimension, and are directly added for splicing according to the operation rule of the matrix.
104. And inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text.
The function of the preset encoder is to transform an input sequence of indefinite length into a variable of definite length, which is usually implemented by a cyclic neural network. That is, the spliced vector is converted into a synonymous token vector set, and token vector combination refers to a set of word vector tensors expressing different high-dimensional spaces with the same intention as the initial text word. The preset encoder may adopt a depth neural network, a recursive variation, a sum-product network depth, and the like, and the specific method adopted by the preset encoder in the embodiment of the present application is not limited.
The invention aims to output a rich and various text set on the basis of not changing text meaning so as to complete text re-description of an initial text, collect a large amount of similar text data and be used for extracting tasks needing supervised learning in natural language processing such as text abstract, machine translation and the like.
105. And inputting the characterization word vector set into a preset decoder, and resolving the similar text of the initial text.
The function of the preset decoder, contrary to the function of the preset encoder, is the inverse of the preset encoder for converting the fixed-length variable into an output sequence of indefinite length. The preset decoder is designed according to downstream tasks, and the downstream tasks can be divided into two types of generating type tasks and sequence tasks. Illustratively, machine translation is a generative task and determining synonyms is a sequential task. And taking the characterization word vector set as input, and outputting a similar text through the resolving of a preset decoder.
The invention provides a generation method of a similar text, which comprises the steps of firstly obtaining text word segmentation of an initial text, searching text word vectors of the text word segmentation according to a preset word vector algorithm, splicing relative position vectors of the text word vectors and the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characterization word vector set of the initial text, and finally inputting the characterization word vector set into a preset decoder to calculate the similar text of the initial text. Compared with the prior art, the embodiment of the invention generates the characteristic word vector combination of the initial text by taking the spliced vector of the relative position vector and the text word vector as input and presetting the encoder, wherein each text word has a context relation by the relative position vector, so that the position information contained in words of different segments in the same long sentence is the same, the relevance of the context is improved, and the semantic similarity of the similar text and the initial text is further improved.
The embodiment of the invention provides another method for generating similar texts, which is shown in fig. 2 and comprises the following steps:
201. and acquiring text segmentation of the initial text.
When a user inputs text or voice through a terminal, the actual semantics of the text or voice are often required for questions, recommendations or searches. The initial text refers to text input by a user or text after voice conversion. Word segmentation refers to a process of recombining continuous word sequences into word sequences according to a certain specification, and the word segmentation can be carried out on an initial text by adopting: inputting the initial text into a preset barker word segmentation model; and obtaining text word segmentation output by the crust word segmentation model.
The Chinese word segmentation of the sentence comprises the steps of realizing efficient word graph scanning based on a Trie structure, and generating a directed acyclic graph formed by all possible word forming conditions of Chinese characters in the sentence; the maximum probability path is searched by adopting dynamic programming, and the maximum segmentation combination based on word frequency is found; for the unregistered words, an HMM model based on the word forming capability of Chinese characters is adopted, and a Viterbi algorithm is used. The initial text is segmented by loading the dictionary, adjusting the dictionary, and then extracting keywords based on the TF-IDF algorithm or extracting keywords based on the TextRank algorithm.
202. According to a preset word vector algorithm, searching a text word vector of the text word segmentation.
The preset word vector algorithm can be a matrix decomposition-based method, a shallow window-based method, a word2vector algorithm and the like, wherein the word2vector algorithm is a method for training an N-gram language model through a neural network machine learning algorithm and solving a vector corresponding to a word in the training process. In the training process, a word2vector algorithm is accelerated to be trained by adopting two modes of layering and negative sampling. The word2vector presetting algorithm is a trained model algorithm, and text word vectors of text word segmentation can be directly searched through the word2vector presetting algorithm.
203. And splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector.
Each text word vector may be identified based on its relative or absolute position in the original text. If the absolute position is adopted, the position information contained in words of different segments under the same long sentence is the same, but in practice, the position information should be different, so that the relative position is adopted in the application to effectively distinguish each text word vector. The relative position vector is a vector matrix, and the ith row and jth column of the vector matrix identify the relative positions between the ith word and the jth word. The relative position vectors and the text word vectors are in one-to-one correspondence, are high-dimensional vectors with the same dimension, and are directly added for splicing according to the operation rule of the matrix.
204. And calculating factorization vectors of the spliced vectors according to the word sequence probability of the spliced vectors.
For a better understanding of the present scheme, we now exemplify word order probabilities, assuming a given length T sequence xx, for a total of T-! The seed arrangement method corresponds to T-! A chain decomposition method. Assuming that the splice vector x=x1x2x3, then the total common 3 ≡ -! =6 decomposition methods, where p (x2|x1x3) refers to the probability that the first word is x1 and the second word is x2 under the condition that the third word is x3, that is, the order of the original words is maintained. Traverse T-! A decomposition method and sharing model parameters so that context can be learned in extracting factorized vectors. And is commonThe left-to-right or right-to-left language model can only learn one directional dependency, such as "guessing" a word first, then "guessing" a second word from the first word, and "guessing" a third word from the first two words. While word order probabilities of various orders are learned by arranging language models, such as p (x) =p (x) 1 |x 3 )p(x 2 |x 1 x 3 )p(x 3 ) The corresponding order 3→1→2 is that the third word is "guessed" first, then the first word is guessed from the third word, and finally the second word is guessed from the first and third words. If the context dependency relationship is the same as the text order, then the text in the same order has a unique meaning and the likelihood that its similar text can be obtained from its unique meaning is great, whereby the factorized vector of the concatenated vector is calculated with word order probability
The factorization vector of the splicing vector is calculated, and specifically comprises the following steps: calculating word sequence probability of the initial text according to the splicing vector, wherein the word sequence probability refers to conditional probability of each arrangement mode of the text word segmentation in full arrangement, and the occurrence condition of the conditional probability is that all word segmentation arranged before the current word segmentation in the arrangement mode is completely generated; determining the arrangement sequence of the text word segmentation corresponding to the maximum value of the word sequence probability as a word segmentation semantic sequence; and merging adjacent word segmentation vectors, which refer to vector elements in the spliced vector corresponding to text segmentation which is sequentially adjacent in the word segmentation semantic sequence, to generate a factorization vector of the spliced vector.
Assume that the initial text includes 5 text segmentations x 1 、x 2 、x 3 、x 4 、x 5 The corresponding spliced vector comprises 5 vector elements A1, A2, A3, A4 and A5, and text word segmentation of the initial text is fully arranged and comprises 5-! Arrangement in 120, where the order with the highest word order probability is x 3 、x 1 、x 2 、x 4 、x 5 The calculation formula is p=p (x 1 |x 3 )p(x 2 |x 1 x 3 )p(x 3 )p(x 4 |x 1 x 2 x 3 )p(x 5 |x 1 x 2 x 3 x 4 ) The word segmentation semantic order is x 3 、x 1 、x 2 、x 4 、x 5, wherein x1 and x2, and x4 and x5 The method is characterized in that the method is a sequentially adjacent word segmentation text, vector elements A1 and A2 in corresponding spliced vectors are adjacent word segmentation vectors, A4 and A5 are adjacent word segmentation vectors, A1 and A2 are combined into B1, A4 and A5 are combined into B2, and factor decomposition vectors of the spliced vectors are A1, B1 and B2, so that dimension reduction of the spliced vectors is realized, data size can be reduced, and training and calculating speed is improved. If the elements in the spliced vector are sequentially numbered, the searching method of the adjacent word segmentation vector can acquire the first element position identification of the first element in any position in the word segmentation semantic sequence in the spliced vector and the second element position identification of the second element adjacent to the first element in the spliced vector according to the preset sequence, then the first element position identification is subjected to self-increasing step operation to acquire the predicted position identification, the self-increasing step is the numbered interval of the sequential number of the spliced vector, if the predicted position identification is different from the second element position identification, the first element position is acquired again, if the predicted position identification is identical with the second element position identification, the first element and the second element are determined to be adjacent word segmentation vectors, meanwhile, the second element position identification is redefined to be the first element position identification, and the first element in any position in the word segmentation semantic sequence is used as the second element, and the steps are repeated until all adjacent word segmentation vectors in the spliced vector are searched. The adjacent word segmentation vector may include two elements, three elements, four elements, and the like, and in the embodiment of the present invention, the number of elements included in the adjacent word segmentation vector is not limited.
205. The attention features of the factorized vector are extracted according to a preset self-attention mechanism.
The extraction process of the self-attention feature comprises the following steps: similarity calculation is carried out on the query and each key to obtain a weight, and then a softmax function is used for normalizing the weight; and finally, carrying out weighted summation on the weight and the corresponding key value to obtain the attention characteristic, wherein the key and the value are the same, namely, key=value. The method is used for extracting intention of the spliced vector through factorization vectors and a preset self-attention mechanism so as to obtain obvious text codes with the same intention.
206. The factorized vector is randomly sampled based on a vector average vector and a vector standard deviation vector of the factorized vector to generate a sample.
The method adopts a vector quantization variation mechanism, and a randomly sampled sample with a lower dimension is obtained in the step. In the prior art, the input is converted into vector codes, the potential space of the vector codes is possibly discontinuous, or simple interpolation is allowed, in a bilingual translation task of machine translation, an explicit multidimensional feature tensor is output by an encoder, and due to the specificity of the translation task, the accuracy and the repeatability of the translation are affected by potential semantic features, grammar features and text length. Hiding random distribution characteristics which are not a definite multidimensional tensor, but obey a certain distribution, and randomly sampling through the characteristics to ensure the richness and diversity of the language, so that the translation accuracy and the repeatability are improved.
The process of generating the adoption sample specifically comprises the following steps: counting vector average vectors and vector standard deviation vectors of the factorization vectors; and randomly sampling the factorization vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample. And counting the data distribution characteristics of factorized vectors, and then carrying out generalization to output two vectors with the same size, namely a vector average vector and a vector standard deviation vector. The data subject to this constraint is then randomly sampled based on the vector average vector and the vector standard deviation vector, the potential space of the randomly sampled employed samples being continuous and allowing interpolation.
The statistical factorization vector comprises a vector average vector and a vector standard deviation vector, and specifically comprises the following steps: counting a first probability distribution function of the factorization vector according to a first preset probability distribution formula, and counting a second probability distribution function of the factorization vector according to a second preset probability distribution formula, wherein dependent variables of the first probability distribution function comprise a first average vector and a first standard deviation vector, and dependent variables of the second probability distribution function comprise a second average vector and a second standard deviation vector; calculating KL divergence of the first probability distribution function and the second probability distribution function; if the KL divergence is equal to 0, determining that the factorized vector obeys the first probability distribution function or the second probability distribution function, determining that the vector average vector is the first average vector or the second average vector, and determining that the vector standard deviation vector is a first standard deviation vector or a second standard deviation vector; and if the KL divergence is not equal to 0, calculating the vector average vector and the vector standard deviation vector according to the factorization vector by taking the minimum value of the KL divergence as a target.
After the sampling sample is generated, a residual neural network can be combined to avoid the conditions of gradient explosion and gradient disappearance in the backward propagation process, and an upper layer input is added before the second layer of linearly-changing activation layer is input, so that the cross entropy of the abstract representation in the gradient updating process of the decoder can be reduced, and the convergence rate is accelerated.
207. And generating a characterization word vector set of the initial text according to the sampling sample and the attention characteristic.
The token vector set is a text coding set which is similar to but not identical to the sample on the basis of the sample. Specifically, a characterization vector set of the initial text is generated according to a preset dimension adjustment rule, wherein the characteristics of the preset dimension adjustment rule are described as follows: z h =αe h +(1-α)q h; wherein ,zh To characterize the vector set, α is the learning parameter, e h For attention features, q h Is a random sampling result.
The steps 204-207 correspond to the step 104 shown in fig. 1 of inputting the spliced vector into the preset encoder to generate the token vector set of the initial text, wherein the steps 204-207 may be equivalent to the encoding process including a factor conversion layer, a self-attention layer, a vector quantization variant layer and a full connection layer. And waiting for the token vector set of the initial text through four layers of calculation. Token vector combining refers to a collection of word vector tensors representing a different high-dimensional space as intended for the original text word. The invention aims to output a rich and various text set on the basis of not changing text meaning so as to complete text re-description of an initial text, collect a large amount of similar text data and be used for extracting tasks needing supervised learning in natural language processing such as text abstract, machine translation and the like.
208. And inputting the characterization word vector set into a preset decoder, and resolving the similar text of the initial text.
The function of the preset decoder, contrary to the function of the preset encoder, is the inverse of the preset encoder for converting the fixed-length variable into an output sequence of indefinite length. The preset decoder is designed according to downstream tasks, and the downstream tasks can be divided into two types of generating type tasks and sequence tasks. Illustratively, machine translation is a generative task and determining synonyms is a sequential task. And taking the characterization word vector set as input, and outputting a similar text through the resolving of a preset decoder.
The invention provides a generation method of a similar text, which comprises the steps of firstly obtaining text word segmentation of an initial text, searching text word vectors of the text word segmentation according to a preset word vector algorithm, splicing relative position vectors of the text word vectors and the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characterization word vector set of the initial text, and finally inputting the characterization word vector set into a preset decoder to calculate the similar text of the initial text. Compared with the prior art, the embodiment of the invention generates the characteristic word vector combination of the initial text by taking the spliced vector of the relative position vector and the text word vector as input and presetting the encoder, wherein each text word has a context relation by the relative position vector, so that the position information contained in words of different segments in the same long sentence is the same, the relevance of the context is improved, and the semantic similarity of the similar text and the initial text is further improved.
Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention provides a device for generating a similar text, as shown in fig. 3, where the device includes:
an obtaining module 31, configured to obtain a text word of an initial text;
the searching module 32 is configured to search a text word vector of the text word according to a preset word vector algorithm;
a first generating module 33, configured to splice the text word vector and a relative position vector of the text word vector, to generate a spliced vector;
a second generating module 34, configured to input the spliced vector into a preset encoder, and generate a token vector set of the initial text;
and a resolving module 35, configured to input the token vector set into a preset decoder, and resolve the similar text of the initial text.
The invention provides a generation device of a similar text, which comprises the steps of firstly obtaining text word segmentation of an initial text, searching text word vectors of the text word segmentation according to a preset word vector algorithm, splicing relative position vectors of the text word vectors and the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characterization word vector set of the initial text, and finally inputting the characterization word vector set into a preset decoder to calculate the similar text of the initial text. Compared with the prior art, the embodiment of the invention generates the characteristic word vector combination of the initial text by taking the spliced vector of the relative position vector and the text word vector as input and presetting the encoder, wherein each text word has a context relation by the relative position vector, so that the position information contained in words of different segments in the same long sentence is the same, the relevance of the context is improved, and the semantic similarity of the similar text and the initial text is further improved.
Further, as an implementation of the method shown in fig. 2, an embodiment of the present invention provides another apparatus for generating similar text, as shown in fig. 4, where the apparatus includes:
an obtaining module 41, configured to obtain a text word of an initial text;
a searching module 42, configured to search a text word vector of the text word according to a preset word vector algorithm;
a first generating module 43, configured to splice the text word vector and a relative position vector of the text word vector, to generate a spliced vector;
a second generating module 44, configured to input the spliced vector into a preset encoder, and generate a token vector set of the initial text;
and the resolving module 45 is used for inputting the token vector set into a preset decoder to resolve the similar text of the initial text.
Further, the obtaining module 41 includes:
an input unit 411, configured to input the initial text into a preset barker word model;
and the obtaining unit 412 is configured to obtain the text word segmentation output by the barker word segmentation model.
Further, the second generating module 44 includes:
a calculating unit 441, configured to calculate an factorized vector of the concatenated vector according to a word order probability of the concatenated vector, where the concatenated vector is stored in a blockchain;
it should be emphasized that, to further ensure the privacy and security of the splice vector, the splice vector may also be stored in a blockchain node.
An extracting unit 442, configured to extract an attention feature of the factorization vector according to a preset self-attention mechanism;
a sampling unit 443, configured to randomly sample the factorized vector based on a vector average vector and a vector standard deviation vector of the factorized vector to generate a sampling sample;
and the generating unit 444 is configured to generate a token vector set of the initial text according to the sampling sample and the attention feature.
Further, the computing unit 441 includes:
a calculating subunit 4411, configured to calculate, according to the concatenation vector, a word order probability of the initial text, where the word order probability refers to a conditional probability of each arrangement mode of the text word segmentation in a full arrangement, and an occurrence condition of the conditional probability is that all word segments arranged before a current word segmentation in the arrangement mode all occur;
a determining subunit 4412, configured to determine that an arrangement order of the text word segmentation corresponding to the maximum value of the word order probability is a word segmentation semantic order;
and the generation subunit 4413 is configured to combine adjacent word segmentation vectors, and generate a factorization vector of the stitched vector, where the adjacent word segmentation vectors refer to vector elements in the stitched vector corresponding to text words that are sequentially adjacent in the word segmentation semantic sequence.
Further, the sampling unit 443 includes:
a statistics subunit 4431 for counting vector average vectors and vector standard deviation vectors of the factorization vectors;
and a sampling subunit 4432, configured to randomly sample the factorized vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample.
Further, the statistics subunit 4431 is configured to:
counting a first probability distribution function of the factorization vector according to a first preset probability distribution formula, and counting a second probability distribution function of the factorization vector according to a second preset probability distribution formula, wherein dependent variables of the first probability distribution function comprise a first average vector and a first standard deviation vector, and dependent variables of the second probability distribution function comprise a second average vector and a second standard deviation vector;
calculating KL divergence of the first probability distribution function and the second probability distribution function;
if the KL divergence is equal to 0, determining that the factorized vector obeys the first probability distribution function or the second probability distribution function, determining that the vector average vector is the first average vector or the second average vector, and determining that the vector standard deviation vector is a first standard deviation vector or a second standard deviation vector;
and if the KL divergence is not equal to 0, calculating the vector average vector and the vector standard deviation vector according to the factorization vector by taking the minimum value of the KL divergence as a target.
Further, the generating unit 444 is configured to:
generating a characterization vector set of the initial text according to a preset dimension adjustment rule, wherein the characteristics of the preset dimension adjustment rule are described as follows:
z h =αe h +(1-α)q h
wherein ,zh For the set of token vectors, α is a learning parameter, e h For attention features, q h And (5) the random sampling result.
The invention provides a generation device of a similar text, which comprises the steps of firstly obtaining text word segmentation of an initial text, searching text word vectors of the text word segmentation according to a preset word vector algorithm, splicing relative position vectors of the text word vectors and the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characterization word vector set of the initial text, and finally inputting the characterization word vector set into a preset decoder to calculate the similar text of the initial text. Compared with the prior art, the embodiment of the invention generates the characteristic word vector combination of the initial text by taking the spliced vector of the relative position vector and the text word vector as input and presetting the encoder, wherein each text word has a context relation by the relative position vector, so that the position information contained in words of different segments in the same long sentence is the same, the relevance of the context is improved, and the semantic similarity of the similar text and the initial text is further improved.
According to one embodiment of the present invention, there is provided a computer storage medium storing at least one executable instruction for performing the method of generating similar text in any of the above-described method embodiments.
Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and the specific embodiment of the present invention is not limited to the specific implementation of the computer device.
As shown in fig. 5, the computer device may include: a processor 502, a communication interface (Communications Interface) 504, a memory 506, and a communication bus 508.
Wherein: processor 502, communication interface 504, and memory 506 communicate with each other via communication bus 508.
A communication interface 504 for communicating with network elements of other devices, such as clients or other servers.
The processor 502 is configured to execute the program 510, and may specifically perform relevant steps in the above-described embodiment of the method for generating similar text.
In particular, program 510 may include program code including computer-operating instructions.
The processor 502 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the computer device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
A memory 506 for storing a program 510. Memory 506 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 510 may be specifically operable to cause the processor 502 to:
acquiring text segmentation of an initial text;
searching a text word vector of the text word according to a preset word vector algorithm;
splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;
inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text;
and inputting the characterization word vector set into a preset decoder, and resolving the similar text of the initial text.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A method for generating a similar text, comprising:
acquiring text segmentation of an initial text;
searching a text word vector of the text word according to a preset word vector algorithm;
splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;
inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text;
inputting the characterization word vector set into a preset decoder, and resolving similar texts of the initial text;
the step of inputting the spliced vector into a preset encoder to generate the characterization word vector set of the initial text comprises the following steps:
according to the word sequence probability of the spliced vector, calculating the factorization vector of the spliced vector, wherein the spliced vector is stored in a blockchain;
extracting the attention characteristic of the factorization vector according to a preset self-attention mechanism;
randomly sampling the factorization vector based on a vector average vector and a vector standard deviation vector of the factorization vector to generate a sampling sample;
generating a characterization word vector set of the initial text according to the sampling sample and the attention characteristic;
the calculating the factorization vector of the splicing vector specifically comprises the following steps:
calculating word sequence probability of the initial text according to the splicing vector, wherein the word sequence probability refers to conditional probability of each arrangement mode of the text word segmentation in full arrangement, and the occurrence condition of the conditional probability is that all word segmentation arranged before the current word segmentation in the arrangement mode is completely generated; determining the arrangement sequence of the text word segmentation corresponding to the maximum value of the word sequence probability as a word segmentation semantic sequence; and merging adjacent word segmentation vectors, which refer to vector elements in the spliced vector corresponding to text segmentation which is sequentially adjacent in the word segmentation semantic sequence, to generate a factorization vector of the spliced vector.
2. The method of claim 1, wherein the obtaining text segmentation of the initial text comprises:
inputting the initial text into a preset barker word segmentation model;
and obtaining text word segmentation output by the crust word segmentation model.
3. The method of claim 1, wherein the calculating the factorized vector of the concatenated vector based on the word order probability of the concatenated vector comprises:
calculating word sequence probability of the initial text according to the splicing vector, wherein the word sequence probability refers to conditional probability of each arrangement mode of the text word segmentation in full arrangement, and the occurrence condition of the conditional probability is that all word segmentation arranged before the current word segmentation in the arrangement mode is completely generated;
determining the arrangement sequence of the text word segmentation corresponding to the maximum value of the word sequence probability as a word segmentation semantic sequence;
and merging adjacent word segmentation vectors, which refer to vector elements in the spliced vector corresponding to text segmentation which is sequentially adjacent in the word segmentation semantic sequence, to generate a factorization vector of the spliced vector.
4. The method of claim 1, wherein the randomly sampling the factorized vector based on a vector average vector and a vector standard deviation vector of the factorized vector to generate sampled samples comprises:
counting vector average vectors and vector standard deviation vectors of the factorization vectors;
and randomly sampling the factorization vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample.
5. The method of claim 4, wherein said counting vector average vectors and vector standard deviation vectors of said factorized vectors comprises:
counting a first probability distribution function of the factorization vector according to a first preset probability distribution formula, and counting a second probability distribution function of the factorization vector according to a second preset probability distribution formula, wherein dependent variables of the first probability distribution function comprise a first average vector and a first standard deviation vector, and dependent variables of the second probability distribution function comprise a second average vector and a second standard deviation vector;
calculating KL divergence of the first probability distribution function and the second probability distribution function;
if the KL divergence is equal to 0, determining that the factorized vector obeys the first probability distribution function or the second probability distribution function, determining that the vector average vector is the first average vector or the second average vector, and determining that the vector standard deviation vector is a first standard deviation vector or a second standard deviation vector;
and if the KL divergence is not equal to 0, calculating the vector average vector and the vector standard deviation vector according to the factorization vector by taking the minimum value of the KL divergence as a target.
6. The method of claim 1, wherein the generating the set of token vectors for the initial text from the sampled samples and attention features comprises:
generating a characterization vector set of the initial text according to a preset dimension adjustment rule, wherein the characteristics of the preset dimension adjustment rule are described as follows:
Figure QLYQS_1
wherein ,
Figure QLYQS_2
for the set of token vectors, < >>
Figure QLYQS_3
For learning parameters->
Figure QLYQS_4
For attention deficit, add>
Figure QLYQS_5
Is a random sampling result.
7. A similar text generating apparatus, comprising:
the acquisition module is used for acquiring text segmentation of the initial text;
the searching module is used for searching the text word vector of the text word according to a preset word vector algorithm;
the first generation module is used for splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;
the second generation module is used for inputting the spliced vector into a preset encoder to generate a characterization word vector set of the initial text;
the resolving module is used for inputting the characterization word vector set into a preset decoder and resolving the similar text of the initial text;
wherein the second generating module includes:
a calculation unit, configured to calculate a factorization vector of the concatenation vector according to a word order probability of the concatenation vector, where the concatenation vector is stored in a blockchain;
the extraction unit is used for extracting the attention characteristic of the factorization vector according to a preset self-attention mechanism;
the sampling unit is used for randomly sampling the factorization vector based on a vector average vector and a vector standard deviation vector of the factorization vector to generate a sampling sample;
the generating unit is used for generating a characterization word vector set of the initial text according to the sampling sample and the attention characteristic;
wherein the computing unit includes:
a calculating subunit, configured to calculate, according to the concatenation vector, a word sequence probability of the initial text, where the word sequence probability refers to a conditional probability of each arrangement mode of the text word segmentation in a full arrangement, and an occurrence condition of the conditional probability is that all word segments arranged before a current word segmentation in the arrangement mode all occur;
the determining subunit is used for determining that the arrangement sequence of the text word segmentation corresponding to the maximum value of the word sequence probability is a word segmentation semantic sequence;
and the generation subunit is used for merging adjacent word segmentation vectors, which are vector elements in the spliced vector corresponding to the text word segmentation which is sequentially adjacent in the word segmentation semantic sequence, to generate the factorization vector of the spliced vector.
8. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the method for generating similar text according to any one of claims 1-6.
9. A computer device, comprising: the processor, the memory, the communication interface and the communication bus complete communication with each other through the communication bus;
the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the method for generating similar text as defined in any one of claims 1 to 6.
CN202010341544.XA 2020-04-27 2020-04-27 Similar text generation method and device Active CN111680494B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010341544.XA CN111680494B (en) 2020-04-27 2020-04-27 Similar text generation method and device
PCT/CN2020/117946 WO2021218015A1 (en) 2020-04-27 2020-09-25 Method and device for generating similar text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010341544.XA CN111680494B (en) 2020-04-27 2020-04-27 Similar text generation method and device

Publications (2)

Publication Number Publication Date
CN111680494A CN111680494A (en) 2020-09-18
CN111680494B true CN111680494B (en) 2023-05-12

Family

ID=72452258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010341544.XA Active CN111680494B (en) 2020-04-27 2020-04-27 Similar text generation method and device

Country Status (2)

Country Link
CN (1) CN111680494B (en)
WO (1) WO2021218015A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680494B (en) * 2020-04-27 2023-05-12 平安科技(深圳)有限公司 Similar text generation method and device
CN112395385B (en) * 2020-11-17 2023-07-25 中国平安人寿保险股份有限公司 Text generation method and device based on artificial intelligence, computer equipment and medium
CN112580352B (en) * 2021-03-01 2021-06-04 腾讯科技(深圳)有限公司 Keyword extraction method, device and equipment and computer storage medium
CN113822034B (en) * 2021-06-07 2024-04-19 腾讯科技(深圳)有限公司 Method, device, computer equipment and storage medium for replying text
CN113779987A (en) * 2021-08-23 2021-12-10 科大国创云网科技有限公司 Event co-reference disambiguation method and system based on self-attention enhanced semantics
CN114338129B (en) * 2021-12-24 2023-10-31 中汽创智科技有限公司 Message anomaly detection method, device, equipment and medium
CN114357974B (en) * 2021-12-28 2022-09-23 北京海泰方圆科技股份有限公司 Similar sample corpus generation method and device, electronic equipment and storage medium
CN114936548A (en) * 2022-03-22 2022-08-23 北京探境科技有限公司 Method, device, equipment and storage medium for generating similar command texts
CN114742029B (en) * 2022-04-20 2022-12-16 中国传媒大学 Chinese text comparison method, storage medium and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399454A (en) * 2019-06-04 2019-11-01 深思考人工智能机器人科技(北京)有限公司 A kind of text code representation method based on transformer model and more reference systems

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802888B (en) * 2017-01-12 2020-01-24 北京航空航天大学 Word vector training method and device
GB201706047D0 (en) * 2017-04-14 2017-05-31 Digital Genius Ltd Automated tagging of text
JP6976155B2 (en) * 2017-12-18 2021-12-08 ヤフー株式会社 Similar text extractor, automatic response system, similar text extraction method, and program
KR20200015418A (en) * 2018-08-02 2020-02-12 네오사피엔스 주식회사 Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature
CN109145315B (en) * 2018-09-05 2022-03-18 腾讯科技(深圳)有限公司 Text translation method, text translation device, storage medium and computer equipment
CN110147535A (en) * 2019-04-18 2019-08-20 平安科技(深圳)有限公司 Similar Text generation method, device, equipment and storage medium
CN110110045B (en) * 2019-04-26 2021-08-31 腾讯科技(深圳)有限公司 Method, device and storage medium for retrieving similar texts
CN110209801B (en) * 2019-05-15 2021-05-14 华南理工大学 Text abstract automatic generation method based on self-attention network
CN110135507A (en) * 2019-05-21 2019-08-16 西南石油大学 A kind of label distribution forecasting method and device
CN110619034A (en) * 2019-06-27 2019-12-27 中山大学 Text keyword generation method based on Transformer model
CN110362684B (en) * 2019-06-27 2022-10-25 腾讯科技(深圳)有限公司 Text classification method and device and computer equipment
CN110619127B (en) * 2019-08-29 2020-06-09 内蒙古工业大学 Mongolian Chinese machine translation method based on neural network turing machine
CN111680494B (en) * 2020-04-27 2023-05-12 平安科技(深圳)有限公司 Similar text generation method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399454A (en) * 2019-06-04 2019-11-01 深思考人工智能机器人科技(北京)有限公司 A kind of text code representation method based on transformer model and more reference systems

Also Published As

Publication number Publication date
WO2021218015A1 (en) 2021-11-04
CN111680494A (en) 2020-09-18

Similar Documents

Publication Publication Date Title
CN111680494B (en) Similar text generation method and device
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
CN109033068B (en) Method and device for reading and understanding based on attention mechanism and electronic equipment
US20210141799A1 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
US11741109B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
CN106502985B (en) neural network modeling method and device for generating titles
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN112464641A (en) BERT-based machine reading understanding method, device, equipment and storage medium
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN114565104A (en) Language model pre-training method, result recommendation method and related device
CN111191002A (en) Neural code searching method and device based on hierarchical embedding
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
CN112580346B (en) Event extraction method and device, computer equipment and storage medium
CN113691542A (en) Web attack detection method based on HTTP request text and related equipment
CN114064117A (en) Code clone detection method and system based on byte code and neural network
CN116304748A (en) Text similarity calculation method, system, equipment and medium
CN115587594A (en) Network security unstructured text data extraction model training method and system
WO2023155304A1 (en) Keyword recommendation model training method and apparatus, keyword recommendation method and apparatus, device, and medium
CN116955644A (en) Knowledge fusion method, system and storage medium based on knowledge graph
CN115374845A (en) Commodity information reasoning method and device
CN116661805A (en) Code representation generation method and device, storage medium and electronic equipment
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN113836929A (en) Named entity recognition method, device, equipment and storage medium
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant