CN111680494A - Similar text generation method and device - Google Patents

Similar text generation method and device Download PDF

Info

Publication number
CN111680494A
CN111680494A CN202010341544.XA CN202010341544A CN111680494A CN 111680494 A CN111680494 A CN 111680494A CN 202010341544 A CN202010341544 A CN 202010341544A CN 111680494 A CN111680494 A CN 111680494A
Authority
CN
China
Prior art keywords
vector
text
word
preset
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010341544.XA
Other languages
Chinese (zh)
Other versions
CN111680494B (en
Inventor
骆加维
吴信朝
龚连银
周宝
陈远旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010341544.XA priority Critical patent/CN111680494B/en
Publication of CN111680494A publication Critical patent/CN111680494A/en
Priority to PCT/CN2020/117946 priority patent/WO2021218015A1/en
Application granted granted Critical
Publication of CN111680494B publication Critical patent/CN111680494B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method and a device for generating a similar text, relates to the technical field of semantic analysis, and aims to solve the problem that the actual semantics of the similar text and the initial text are not completely the same in the prior art. The method mainly comprises the following steps: acquiring text word segmentation after the initial text is subjected to word segmentation; searching text word vectors of the text word segmentation according to a preset word vector algorithm; splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector; inputting the spliced vector into a preset encoder to generate a characteristic word vector set of the initial text; and inputting the set of the token word vectors into a preset decoder, and resolving similar texts of the initial text. The invention is mainly applied to the process of natural language processing. In addition, the invention also relates to a block chain technology, and the splicing vector can be stored in the block chain node.

Description

Similar text generation method and device
Technical Field
The invention relates to the technical field of semantic analysis, in particular to a method and a device for generating similar texts.
Background
With the continuous development of artificial intelligence, the application of a human-computer interaction system is more and more extensive. In the process of using the human-computer interaction system, the text information input by the user or the text information obtained by voice conversion may not be the meaning that the user actually expresses. In order to avoid the wrong interpretation of the user input information by the human-computer interaction system, the user input information is often converted into a plurality of accurate expression methods by training a bilingual environment or a multilingual environment. But problems of semantic bias and text alignment are encountered in bilingual translation models.
In the prior art, a current similar text of an initial text is calculated according to a first neural network model, then a current discrimination probability of the initial text and the current similar text is calculated according to a second neural network model, then whether the current discrimination probability is equal to a preset probability value or not is judged, if not, the first neural network model is optimized according to a preset model optimization strategy, then the current similar text is recalculated according to the optimized first neural network model, finally, whether the calculated current discrimination probability is equal to the preset probability value or not is circularly judged, and if so, the similar text is taken as a target similar text.
The inventor of the invention finds out in research that in the scheme of the prior art, the neural network method is adopted to calculate the similar text, the discrimination dependence is mainly based on the model parameters of the first neural network model and the second neural network model, and the model parameters are obtained through training data, namely the calculated similar text has higher dependence on the training data and lower corresponding dependence on the initial text, so that the actual semantics of the similar text and the initial text are not completely the same easily.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for generating a similar text, and mainly aims to solve the problem in the prior art that the actual semantics of the similar text and the initial text are not completely the same.
According to an aspect of the present invention, there is provided a method for generating similar texts, including:
acquiring text participles of an initial text;
searching text word vectors of the text word segmentation according to a preset word vector algorithm;
splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;
inputting the spliced vector into a preset encoder to generate a characteristic word vector set of the initial text;
and inputting the set of the token word vectors into a preset decoder, and resolving similar texts of the initial text.
According to another aspect of the present invention, there is provided a similar text generation apparatus, including:
the acquisition module is used for acquiring text participles of the initial text;
the searching module is used for searching the text word vector of the text word segmentation according to a preset word vector algorithm;
the first generation module is used for splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;
the second generation module is used for inputting the splicing vector into a preset encoder to generate a characteristic word vector set of the initial text;
and the resolving module is used for inputting the vector set of the characterization words into a preset decoder and resolving the similar text of the initial text.
According to still another aspect of the present invention, a computer storage medium is provided, and at least one executable instruction is stored in the computer storage medium, and the executable instruction causes a processor to execute operations corresponding to the generation method of the similar text.
According to still another aspect of the present invention, there is provided a computer apparatus including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the generation method of the similar text.
By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:
the invention provides a method and a device for generating similar texts, which comprises the steps of firstly obtaining text participles of an initial text, then searching text word vectors of the text participles according to a preset word vector algorithm, splicing the text word vectors and relative position vectors of the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characteristic word vector set of the initial text, and finally inputting the characteristic word vector set into a preset decoder to solve the similar text of the initial text. Compared with the prior art, the embodiment of the invention takes the splicing vector of the relative position vector and the text word vector as input, and generates the combination of the representation word vector of the initial text through the preset encoder, wherein the relative position vector enables each text participle to have a 'context' relationship, so that the position information contained in different segmented words in the same long sentence is the same, the relevance of the contexts is improved, and the semantic similarity of the similar text and the initial text is further improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart illustrating a method for generating a similar text according to an embodiment of the present invention;
FIG. 2 is a flow chart of another method for generating similar texts according to the embodiment of the present invention;
FIG. 3 is a block diagram illustrating a similar text generating apparatus according to an embodiment of the present invention;
FIG. 4 is a block diagram of another similar text generation apparatus provided in the embodiment of the present invention;
fig. 5 shows a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides a method for generating a similar text, which comprises the following steps as shown in figure 1:
101. and acquiring text segmentation of the initial text.
When a user inputs text or voice through a terminal, the actual semantics of the text or voice are usually required for question answering, recommendation or search. The initial text refers to the words input by the user or the words after voice conversion. The word segmentation refers to a process of recombining continuous word sequences into word sequences according to a certain standard, and carries out word segmentation on an initial text, and a word segmentation method based on character string matching, a word segmentation method based on understanding or a word segmentation method based on statistics can be adopted, and the adopted word segmentation method is not limited in the embodiment of the invention.
102. And searching text word vectors of the text word segmentation according to a preset word vector algorithm.
The preset word vector algorithm can be a matrix decomposition-based method, a shallow window-based method, a word2vector algorithm and the like, wherein the word2vector algorithm is a method for training an N-gram language model through a neural network machine learning algorithm and solving a vector corresponding to a word in the training process. During the training process, two modes of hierarchy and negative sampling are adopted to accelerate the training of the word2vector algorithm. The preset word2vector algorithm is a trained model algorithm, and text word vectors of text word segmentation can be directly searched through the preset word2vector algorithm.
103. And splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector.
Each text word vector may be identified based on its relative or absolute position in the initial text. If the absolute position is adopted, the position information contained in the words of different segments under the same long sentence is the same, but the position information is actually different, so that the relative position is adopted in the application to effectively distinguish each text word vector. The relative position vector is a vector matrix, and the ith row and the jth column of the vector matrix identify the relative positions from the ith word to the jth word. The relative position vectors correspond to the text word vectors one by one and are high-dimensional vectors with the same dimensionality, and the vectors are directly added for splicing according to the operation rule of the matrix.
104. And inputting the spliced vector into a preset encoder to generate a characteristic word vector set of the initial text.
The preset encoder is used for converting an input sequence with an indefinite length into a variable with a definite length, and is usually realized by a recurrent neural network. Namely, the spliced vector is converted into a synonymy token vector set, and token vector combination refers to a set of token vector tensors which have the same intention as that of the original text word and express different high-dimensional spaces. The preset encoder can adopt a depth neural network, a recursive variation, a product network depth and other modes, and the specific method adopted by the preset encoder in the embodiment of the application is not limited.
The invention aims to output rich and diverse text sets on the basis of not changing the text meaning so as to complete text repeat of an initial text and collect a large amount of similar text data for tasks needing supervised learning in natural language processing such as character abstract extraction, machine translation and the like.
105. And inputting the vector set of the characterization words into a preset decoder, and resolving the similar text of the initial text.
The role of the preset decoder, as opposed to the preset encoder, is the inverse of the preset encoder, which is used to convert fixed-length variables into an output sequence of indefinite length. The preset decoder is designed according to downstream tasks which can be divided into generative tasks and sequence tasks. Illustratively, machine translation is a generative task and synonyms are judged to be sequential tasks. And (4) taking the vector set of the representation words as input, resolving by a preset decoder, and outputting a similar text.
The invention provides a method for generating similar texts, which comprises the steps of firstly obtaining text participles of an initial text, then searching text word vectors of the text participles according to a preset word vector algorithm, splicing the text word vectors and relative position vectors of the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characteristic word vector set of the initial text, and finally inputting the characteristic word vector set into a preset decoder to solve the similar text of the initial text. Compared with the prior art, the embodiment of the invention takes the splicing vector of the relative position vector and the text word vector as input, and generates the combination of the representation word vector of the initial text through the preset encoder, wherein the relative position vector enables each text participle to have a 'context' relationship, so that the position information contained in different segmented words in the same long sentence is the same, the relevance of the contexts is improved, and the semantic similarity of the similar text and the initial text is further improved.
An embodiment of the present invention provides another method for generating a similar text, as shown in fig. 2, the method includes:
201. and acquiring text segmentation of the initial text.
When a user inputs text or voice through a terminal, the actual semantics of the text or voice are usually required for question answering, recommendation or search. The initial text refers to the words input by the user or the words after voice conversion. The word segmentation refers to a process of recombining continuous word sequences into word sequences according to a certain standard, and the word segmentation can be performed on an initial text by adopting the following steps: inputting the initial text into a preset ending word segmentation model; and acquiring the text participles output by the ending participle model.
The Chinese word segmentation comprises the steps of realizing efficient word graph scanning based on a Trie tree structure and generating a directed acyclic graph formed by all possible word forming conditions of Chinese characters in a sentence; a maximum probability path is searched by adopting dynamic programming, and a maximum segmentation combination based on word frequency is found out; for unknown words, an HMM model based on Chinese character word forming capability is adopted, and a Viterbi algorithm is used. The initial text is segmented by loading a dictionary, adjusting the dictionary, and then extracting keywords based on a TF-IDF algorithm or extracting keywords based on a TextRank algorithm.
202. And searching text word vectors of the text word segmentation according to a preset word vector algorithm.
The preset word vector algorithm can be a matrix decomposition-based method, a shallow window-based method, a word2vector algorithm and the like, wherein the word2vector algorithm is a method for training an N-gram language model through a neural network machine learning algorithm and solving a vector corresponding to a word in the training process. During the training process, two modes of hierarchy and negative sampling are adopted to accelerate the training of the word2vector algorithm. The preset word2vector algorithm is a trained model algorithm, and text word vectors of text word segmentation can be directly searched through the preset word2vector algorithm.
203. And splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector.
Each text word vector may be identified based on its relative or absolute position in the initial text. If the absolute position is adopted, the position information contained in the words of different segments under the same long sentence is the same, but the position information is actually different, so that the relative position is adopted in the application to effectively distinguish each text word vector. The relative position vector is a vector matrix, and the ith row and the jth column of the vector matrix identify the relative positions from the ith word to the jth word. The relative position vectors correspond to the text word vectors one by one and are high-dimensional vectors with the same dimensionality, and the vectors are directly added for splicing according to the operation rule of the matrix.
204. And calculating the factorization vector of the spliced vector according to the word sequence probability of the spliced vector.
For a better understanding of the present scheme, we now illustrate the word order probabilities, assuming that given a sequence xx of length T, there is a total of T! The arrangement method is also corresponding to T! A chain decomposition method is provided. Assuming that the stitching vector x is x1x2x3, then 3 | is always shared! The 6 decomposition method, where p (x2| x1x3) refers to the probability that the second word is x2 under the condition that the first word is x1 and the third word is x3, that is, the order of the original words is preserved. Traverse T! The method is decomposed, and model parameters are shared, so that the context relationship can be learned in the process of extracting the factorization vector. While a common left-to-right or right-to-left language model can only learn one directional dependency, for example, "guess" a word first, then "guess" a second word based on the first word, and "guess" a third word based on the first two words. The ranking language model learns the word sequence probabilities in various orders, such as p (x) ═ p (x)1|x3)p(x2|x1x3)p(x3) The corresponding sequence 3 → 1 → 2, which is to "guess" the third word first, then guess the first word based on the third word, and finally guess the second word based on the first and third words. If the context dependency is the same as the text sequence, the text with the same sequence has the unique meaning, and the possibility that similar texts can be obtained according to the unique meaning is very high, so that the factorization vector of the splicing vector is calculated according to the word sequence probability
Calculating a factorized vector of the stitching vector, specifically comprising: calculating the word sequence probability of the initial text according to the concatenation vector, wherein the word sequence probability refers to the conditional probability of each arrangement mode of the full arrangement of the text participles, and the occurrence condition of the conditional probability is that all the participles arranged before the current participle according to the arrangement mode all occur; determining the arrangement sequence of the text participles corresponding to the maximum word sequence probability as a participle semantic sequence; and combining adjacent participle vectors to generate a factorization vector of the spliced vector, wherein the adjacent participle vectors refer to vector elements in the spliced vector corresponding to the text participles which are sequentially adjacent in the participle semantic sequence.
Assume that the initial text includes 5 text participles x1、x2、x3、x4、x5The corresponding concatenation vector comprises 5 vector elements A1, A2, A3, A4 and A5, and the text participles of the initial text are arranged completely, including 5! 120, wherein the ordering mode with the maximum word order probability is x3、x1、x2、x4、x5The formula is P ═ P (x)1|x3)p(x2|x1x3)p(x3)p(x4|x1x2x3)p(x5|x1x2x3x4) The semantic order of the participle is x3、x1、x2、x4、x5, wherein x1 and x2, and x4 and x5The split word texts are sequentially street-adjacent split word texts, vector elements A1 and A2 in a corresponding splicing vector are adjacent split word vectors, A4 and A5 are adjacent split word vectors, A1 and A2 are combined into B1, A4 and A5 are combined into B2, and factors of the splicing vector are decomposed into A1, B1 and B2, so that dimension reduction of the splicing vector is achieved, data size can be reduced, and training and calculating speed are improved. If each element in the spliced vector is a sequence number, the searching method of the adjacent segmented word vector can acquire a first element position identifier of a first element at any position in the segmented word semantic sequence in the spliced vector and a second element position identifier of a second element adjacent to the first element in the spliced vector according to a preset sequence, and then perform self-increment step length operation on the first element position identifier to obtain a predicted position identifier, wherein the self-increment step length is the sequence number of the spliced vectorIf the predicted position identification is different from the second element position identification, the first element position is obtained again, if the predicted position identification is the same as the second element position identification, the first element and the second element are determined to be adjacent word segmentation vectors, meanwhile, the second element position identification is redefined as the first element position identification, the second element is used as the first element at any position in the word segmentation semantic sequence, and the steps are repeated until all the adjacent word segmentation vectors in the spliced vector are found. The adjacent word segmentation vector may include two elements, three elements, four elements, and the like, and the number of elements included in the adjacent word segmentation vector is not limited in the embodiment of the present invention.
205. According to a preset self-attention mechanism, the attention features of the factorization vectors are extracted.
The extraction process of the self-attention feature comprises the following steps: calculating similarity of the query and each key to obtain a weight, and then normalizing the weight by using a softmax function; and finally, carrying out weighted summation on the weight and the corresponding key value to obtain the attention characteristic, wherein the key and the value are the same, namely the key is equal to the value. And through a factorization vector and a preset self-attention mechanism, the method is used for extracting the intention of the splicing vector so as to obtain the text codes with obviously identical intentions.
206. Based on the vector mean vector and the vector standard deviation vector of the factorized vector, the factorized vector is randomly sampled to generate a sample.
In the step, a vector quantization variation mechanism is adopted, and randomly sampled sampling samples with lower dimensionality are obtained in the step. In the prior art, the input is converted into vector coding, the potential space where the input is located may be discontinuous, or simple interpolation is allowed, in a bilingual translation task of machine translation, a definite multidimensional feature tensor is output by an encoder, and due to the particularity of the translation task, the accuracy and the repeatability of translation are affected by potential semantic features, grammatical features and text length. Hiding the randomly distributed features that obey a certain distribution if the output of the encoder is not a definite multidimensional tensor, and randomly sampling the features to ensure the richness and diversity of languages can improve the accuracy and repeatability of translation.
The process of generating the sampling specifically includes: counting a vector mean vector and a vector standard deviation vector of the factorization vectors; and randomly sampling the factorization vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample. And (4) counting the data distribution characteristics of the factorization vectors, then carrying out induction, and outputting two vectors with the same size, namely a vector average vector and a vector standard deviation vector. The data subject to this constraint is then randomly sampled based on the vector mean vector and the vector standard deviation vector, the potential space of samples for random sampling is continuous and allows interpolation.
Wherein, the vector average vector and the vector standard deviation vector of the statistical factorization vector specifically include: counting a first probability distribution function of the factorization vector according to a first preset probability distribution formula, and counting a second probability distribution function of the factorization vector according to a second preset probability distribution formula, wherein dependent variables of the first probability distribution function comprise a first average vector and a first standard deviation vector, and dependent variables of the second probability distribution function comprise a second average vector and a second standard deviation vector; calculating KL divergence of the first probability distribution function and the second probability distribution function; if KL divergence is equal to 0, determining that the factorized vector is subject to the first probability distribution function or the second probability distribution function, determining that the vector mean vector is the first mean vector or the second mean vector, determining that the vector standard deviation vector is the first standard deviation vector or the second standard deviation vector; and if the KL divergence is not equal to 0, calculating the vector average vector and the vector standard deviation vector by taking the minimum value of the obtained KL divergence as a target according to the factorization vector.
After the sampling sample is generated, a residual error neural network can be combined to avoid the situations of gradient explosion and gradient disappearance in the backward propagation process, and an upper layer input is added before a second layer of linearly-changed activation layer is input, so that the cross entropy of the abstract representation in the process of updating the gradient of a decoder can be reduced, and the convergence rate is accelerated.
207. And generating a set of characteristic word vectors of the initial text according to the sampling samples and the attention characteristics.
The characteristic word vector set is a text coding set which is similar to but not identical with the sample on the basis of the sample. Specifically, a characterization vector set of the initial text is generated according to a preset dimension adjustment rule, wherein the preset dimension adjustment rule is characterized by comprising the following steps: z is a radical ofh=αeh+(1-α)qh; wherein ,zhTo characterize the vector set, α are learning parameters, ehFor attention features, qhIs a random sampling result.
The step 204-. And (4) waiting to the initial text by four-layer calculation. Token word vector union refers to a set of word vector tensors that represent a different high-dimensional space, with the same intent as the original text word. The invention aims to output rich and diverse text sets on the basis of not changing the text meaning so as to complete text repeat of an initial text and collect a large amount of similar text data for tasks needing supervised learning in natural language processing such as character abstract extraction, machine translation and the like.
208. And inputting the vector set of the characterization words into a preset decoder, and resolving the similar text of the initial text.
The role of the preset decoder, as opposed to the preset encoder, is the inverse of the preset encoder, which is used to convert fixed-length variables into an output sequence of indefinite length. The preset decoder is designed according to downstream tasks which can be divided into generative tasks and sequence tasks. Illustratively, machine translation is a generative task and synonyms are judged to be sequential tasks. And (4) taking the vector set of the representation words as input, resolving by a preset decoder, and outputting a similar text.
The invention provides a method for generating similar texts, which comprises the steps of firstly obtaining text participles of an initial text, then searching text word vectors of the text participles according to a preset word vector algorithm, splicing the text word vectors and relative position vectors of the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characteristic word vector set of the initial text, and finally inputting the characteristic word vector set into a preset decoder to solve the similar text of the initial text. Compared with the prior art, the embodiment of the invention takes the splicing vector of the relative position vector and the text word vector as input, and generates the combination of the representation word vector of the initial text through the preset encoder, wherein the relative position vector enables each text participle to have a 'context' relationship, so that the position information contained in different segmented words in the same long sentence is the same, the relevance of the contexts is improved, and the semantic similarity of the similar text and the initial text is further improved.
Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention provides a device for generating a similar text, as shown in fig. 3, where the device includes:
an obtaining module 31, configured to obtain text participles of an initial text;
the searching module 32 is used for searching the text word vector of the text word segmentation according to a preset word vector algorithm;
the first generating module 33 is configured to splice the text word vector and the relative position vector of the text word vector to generate a spliced vector;
a second generating module 34, configured to input the concatenation vector into a preset encoder, and generate a token word vector set of the initial text;
and the resolving module 35 is configured to input the set of token word vectors into a preset decoder, and resolve the similar text of the initial text.
The invention provides a device for generating similar texts, which comprises the steps of firstly obtaining text participles of an initial text, then searching text word vectors of the text participles according to a preset word vector algorithm, splicing the text word vectors and relative position vectors of the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characteristic word vector set of the initial text, and finally inputting the characteristic word vector set into a preset decoder to solve the similar text of the initial text. Compared with the prior art, the embodiment of the invention takes the splicing vector of the relative position vector and the text word vector as input, and generates the combination of the representation word vector of the initial text through the preset encoder, wherein the relative position vector enables each text participle to have a 'context' relationship, so that the position information contained in different segmented words in the same long sentence is the same, the relevance of the contexts is improved, and the semantic similarity of the similar text and the initial text is further improved.
Further, as an implementation of the method shown in fig. 2, an embodiment of the present invention provides another similar text generation apparatus, as shown in fig. 4, where the apparatus includes:
an obtaining module 41, configured to obtain text participles of an initial text;
the searching module 42 is configured to search a text word vector of the text word segmentation according to a preset word vector algorithm;
a first generating module 43, configured to splice the text word vector and the relative position vector of the text word vector to generate a spliced vector;
a second generating module 44, configured to input the concatenation vector into a preset encoder, and generate a token word vector set of the initial text;
and the resolving module 45 is used for inputting the vector set of the characterization words into a preset decoder and resolving the similar text of the initial text.
Further, the obtaining module 41 includes:
an input unit 411, configured to input the initial text into a preset ending segmentation model;
an obtaining unit 412, configured to obtain the text participles output by the ending participle model.
Further, the second generating module 44 includes:
a calculating unit 441, configured to calculate a factorized vector of the concatenated vector according to a word order probability of the concatenated vector, where the concatenated vector is stored in a block chain;
it is emphasized that the stitching vector may also be stored in a node of a blockchain in order to further ensure the privacy and security of the stitching vector.
An extracting unit 442, configured to extract attention features of the factorized vector according to a preset self-attention mechanism;
a sampling unit 443 configured to randomly sample the factorized vector based on a vector mean vector and a vector standard deviation vector of the factorized vector to generate a sampling sample;
a generating unit 444, configured to generate a set of token vectors of the initial text according to the sample and the attention feature.
Further, the calculation unit 441 includes:
a calculating subunit 4411, configured to calculate a word order probability of the initial text according to the concatenation vector, where the word order probability refers to a conditional probability of each arrangement mode in which the text participles are fully arranged, and an occurrence condition of the conditional probability is that all participles arranged before the current participle according to the arrangement mode all occur;
a determining subunit 4412, configured to determine that the arrangement sequence of the text participles corresponding to the maximum value of the word sequence probability is a participle semantic sequence;
a generating subunit 4413, configured to combine adjacent word segmentation vectors to generate a factorized vector of the concatenated vector, where the adjacent word segmentation vectors refer to vector elements in the concatenated vector corresponding to text word segmentations sequentially adjacent in the word segmentation semantic order.
Further, the sampling unit 443 includes:
a statistics subunit 4431, configured to count a vector mean vector and a vector standard deviation vector of the factorized vectors;
a sampling subunit 4432, configured to randomly sample the factorization vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample.
Further, the statistics subunit 4431 is configured to:
counting a first probability distribution function of the factorization vector according to a first preset probability distribution formula, and counting a second probability distribution function of the factorization vector according to a second preset probability distribution formula, wherein dependent variables of the first probability distribution function comprise a first average vector and a first standard deviation vector, and dependent variables of the second probability distribution function comprise a second average vector and a second standard deviation vector;
calculating KL divergence of the first probability distribution function and the second probability distribution function;
if KL divergence is equal to 0, determining that the factorized vector is subject to the first probability distribution function or the second probability distribution function, determining that the vector mean vector is the first mean vector or the second mean vector, determining that the vector standard deviation vector is the first standard deviation vector or the second standard deviation vector;
and if the KL divergence is not equal to 0, calculating the vector average vector and the vector standard deviation vector by taking the minimum value of the obtained KL divergence as a target according to the factorization vector.
Further, the generating unit 444 is configured to:
generating a characterization vector set of the initial text according to a preset dimension adjustment rule, wherein the preset dimension adjustment rule is characterized by comprising the following steps:
zh=αeh+(1-α)qh
wherein ,zhFor the set of characterization vectors, α is a learning parameter, ehFor attention features, qhIs the random sampling result.
The invention provides a device for generating similar texts, which comprises the steps of firstly obtaining text participles of an initial text, then searching text word vectors of the text participles according to a preset word vector algorithm, splicing the text word vectors and relative position vectors of the text word vectors to generate spliced vectors, inputting the spliced vectors into a preset encoder to generate a characteristic word vector set of the initial text, and finally inputting the characteristic word vector set into a preset decoder to solve the similar text of the initial text. Compared with the prior art, the embodiment of the invention takes the splicing vector of the relative position vector and the text word vector as input, and generates the combination of the representation word vector of the initial text through the preset encoder, wherein the relative position vector enables each text participle to have a 'context' relationship, so that the position information contained in different segmented words in the same long sentence is the same, the relevance of the contexts is improved, and the semantic similarity of the similar text and the initial text is further improved.
According to an embodiment of the present invention, a computer storage medium is provided, and the computer storage medium stores at least one executable instruction, and the computer executable instruction can execute the method for generating the similar text in any method embodiment.
Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computer device.
As shown in fig. 5, the computer apparatus may include: a processor (processor)502, a Communications Interface 504, a memory 506, and a communication bus 508.
Wherein: the processor 502, communication interface 504, and memory 506 communicate with one another via a communication bus 508.
A communication interface 504 for communicating with network elements of other devices, such as clients or other servers.
The processor 502 is configured to execute the program 510, and may specifically perform relevant steps in the above embodiment of the method for generating a similar text.
In particular, program 510 may include program code that includes computer operating instructions.
The processor 502 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the invention. The computer device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And a memory 506 for storing a program 510. The memory 506 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 510 may specifically be used to cause the processor 502 to perform the following operations:
acquiring text participles of an initial text;
searching text word vectors of the text word segmentation according to a preset word vector algorithm;
splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;
inputting the spliced vector into a preset encoder to generate a characteristic word vector set of the initial text;
and inputting the set of the token word vectors into a preset decoder, and resolving similar texts of the initial text.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for generating similar texts, comprising:
acquiring text participles of an initial text;
searching text word vectors of the text word segmentation according to a preset word vector algorithm;
splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;
inputting the spliced vector into a preset encoder to generate a characteristic word vector set of the initial text;
and inputting the set of the token word vectors into a preset decoder, and resolving similar texts of the initial text.
2. The method of claim 1, wherein the obtaining text participles of the initial text comprises:
inputting the initial text into a preset ending word segmentation model;
and acquiring the text participles output by the ending participle model.
3. The method of claim 1, wherein said inputting the stitched vector into a pre-set encoder to generate a set of token vectors for the initial text comprises:
calculating a factorized vector of the spliced vector according to the word order probability of the spliced vector, wherein the spliced vector is stored in a block chain;
extracting attention features of the factorization vectors according to a preset self-attention mechanism;
randomly sampling the factorized vector to generate a sample based on a vector mean vector and a vector standard deviation vector of the factorized vector;
and generating a set of characterization word vectors of the initial text according to the sampling samples and the attention features.
4. The method of claim 3, wherein said computing a factorized vector for the stitched vector based on word order probabilities for the stitched vector comprises:
calculating the word sequence probability of the initial text according to the concatenation vector, wherein the word sequence probability refers to the conditional probability of each arrangement mode of the full arrangement of the text participles, and the occurrence condition of the conditional probability is that all the participles arranged before the current participle according to the arrangement mode all occur;
determining the arrangement sequence of the text participles corresponding to the maximum word sequence probability as a participle semantic sequence;
and combining adjacent participle vectors to generate a factorization vector of the spliced vector, wherein the adjacent participle vectors refer to vector elements in the spliced vector corresponding to the text participles which are sequentially adjacent in the participle semantic sequence.
5. The method of claim 3, wherein the randomly sampling the factorized vector based on a vector mean vector and a vector standard deviation vector of the factorized vector to generate sample samples comprises:
counting a vector mean vector and a vector standard deviation vector of the factorization vectors;
and randomly sampling the factorization vector according to the vector average vector and the vector standard deviation vector to generate a sampling sample.
6. The method of claim 5, wherein said counting vector means vectors and vector standard deviation vectors of said factorized vectors comprises:
counting a first probability distribution function of the factorization vector according to a first preset probability distribution formula, and counting a second probability distribution function of the factorization vector according to a second preset probability distribution formula, wherein dependent variables of the first probability distribution function comprise a first average vector and a first standard deviation vector, and dependent variables of the second probability distribution function comprise a second average vector and a second standard deviation vector;
calculating KL divergence of the first probability distribution function and the second probability distribution function;
if KL divergence is equal to 0, determining that the factorized vector is subject to the first probability distribution function or the second probability distribution function, determining that the vector mean vector is the first mean vector or the second mean vector, determining that the vector standard deviation vector is the first standard deviation vector or the second standard deviation vector;
and if the KL divergence is not equal to 0, calculating the vector average vector and the vector standard deviation vector by taking the minimum value of the obtained KL divergence as a target according to the factorization vector.
7. The method of claim 3, wherein generating a set of token word vectors for the initial text from the sample samples and the attention features comprises:
generating a characterization vector set of the initial text according to a preset dimension adjustment rule, wherein the preset dimension adjustment rule is characterized by comprising the following steps:
zh=αeh+(1-α)qh
wherein ,zhFor the set of characterization vectors, α is a learning parameter, ehFor attention features, qhFor said random miningAnd (5) sampling results.
8. A device for generating similar text, comprising:
the acquisition module is used for acquiring text participles of the initial text;
the searching module is used for searching the text word vector of the text word segmentation according to a preset word vector algorithm;
the first generation module is used for splicing the text word vector and the relative position vector of the text word vector to generate a spliced vector;
the second generation module is used for inputting the splicing vector into a preset encoder to generate a characteristic word vector set of the initial text;
and the resolving module is used for inputting the vector set of the characterization words into a preset decoder and resolving the similar text of the initial text.
9. A computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the method for generating similar text according to any one of claims 1-7.
10. A computer device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the generation method of the similar text according to any one of claims 1-7.
CN202010341544.XA 2020-04-27 2020-04-27 Similar text generation method and device Active CN111680494B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010341544.XA CN111680494B (en) 2020-04-27 2020-04-27 Similar text generation method and device
PCT/CN2020/117946 WO2021218015A1 (en) 2020-04-27 2020-09-25 Method and device for generating similar text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010341544.XA CN111680494B (en) 2020-04-27 2020-04-27 Similar text generation method and device

Publications (2)

Publication Number Publication Date
CN111680494A true CN111680494A (en) 2020-09-18
CN111680494B CN111680494B (en) 2023-05-12

Family

ID=72452258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010341544.XA Active CN111680494B (en) 2020-04-27 2020-04-27 Similar text generation method and device

Country Status (2)

Country Link
CN (1) CN111680494B (en)
WO (1) WO2021218015A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395385A (en) * 2020-11-17 2021-02-23 中国平安人寿保险股份有限公司 Text generation method and device based on artificial intelligence, computer equipment and medium
CN112580352A (en) * 2021-03-01 2021-03-30 腾讯科技(深圳)有限公司 Keyword extraction method, device and equipment and computer storage medium
WO2021218015A1 (en) * 2020-04-27 2021-11-04 平安科技(深圳)有限公司 Method and device for generating similar text
CN113779987A (en) * 2021-08-23 2021-12-10 科大国创云网科技有限公司 Event co-reference disambiguation method and system based on self-attention enhanced semantics
CN113822034A (en) * 2021-06-07 2021-12-21 腾讯科技(深圳)有限公司 Method and device for repeating text, computer equipment and storage medium
CN114357974A (en) * 2021-12-28 2022-04-15 北京海泰方圆科技股份有限公司 Similar sample corpus generation method and device, electronic equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114338129B (en) * 2021-12-24 2023-10-31 中汽创智科技有限公司 Message anomaly detection method, device, equipment and medium
CN114742029B (en) * 2022-04-20 2022-12-16 中国传媒大学 Chinese text comparison method, storage medium and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802888A (en) * 2017-01-12 2017-06-06 北京航空航天大学 Term vector training method and device
CN109145315A (en) * 2018-09-05 2019-01-04 腾讯科技(深圳)有限公司 Text interpretation method, device, storage medium and computer equipment
JP2019109654A (en) * 2017-12-18 2019-07-04 ヤフー株式会社 Similar text extraction device, automatic response system, similar text extraction method, and program
CN110110045A (en) * 2019-04-26 2019-08-09 腾讯科技(深圳)有限公司 A kind of method, apparatus and storage medium for retrieving Similar Text
CN110147535A (en) * 2019-04-18 2019-08-20 平安科技(深圳)有限公司 Similar Text generation method, device, equipment and storage medium
CN110209801A (en) * 2019-05-15 2019-09-06 华南理工大学 A kind of text snippet automatic generation method based on from attention network
CN110362684A (en) * 2019-06-27 2019-10-22 腾讯科技(深圳)有限公司 A kind of file classification method, device and computer equipment
CN110399454A (en) * 2019-06-04 2019-11-01 深思考人工智能机器人科技(北京)有限公司 A kind of text code representation method based on transformer model and more reference systems
CN110619034A (en) * 2019-06-27 2019-12-27 中山大学 Text keyword generation method based on Transformer model
KR20200015418A (en) * 2018-08-02 2020-02-12 네오사피엔스 주식회사 Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201706047D0 (en) * 2017-04-14 2017-05-31 Digital Genius Ltd Automated tagging of text
CN110135507A (en) * 2019-05-21 2019-08-16 西南石油大学 A kind of label distribution forecasting method and device
CN110619127B (en) * 2019-08-29 2020-06-09 内蒙古工业大学 Mongolian Chinese machine translation method based on neural network turing machine
CN111680494B (en) * 2020-04-27 2023-05-12 平安科技(深圳)有限公司 Similar text generation method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106802888A (en) * 2017-01-12 2017-06-06 北京航空航天大学 Term vector training method and device
JP2019109654A (en) * 2017-12-18 2019-07-04 ヤフー株式会社 Similar text extraction device, automatic response system, similar text extraction method, and program
KR20200015418A (en) * 2018-08-02 2020-02-12 네오사피엔스 주식회사 Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature
CN109145315A (en) * 2018-09-05 2019-01-04 腾讯科技(深圳)有限公司 Text interpretation method, device, storage medium and computer equipment
CN110147535A (en) * 2019-04-18 2019-08-20 平安科技(深圳)有限公司 Similar Text generation method, device, equipment and storage medium
CN110110045A (en) * 2019-04-26 2019-08-09 腾讯科技(深圳)有限公司 A kind of method, apparatus and storage medium for retrieving Similar Text
CN110209801A (en) * 2019-05-15 2019-09-06 华南理工大学 A kind of text snippet automatic generation method based on from attention network
CN110399454A (en) * 2019-06-04 2019-11-01 深思考人工智能机器人科技(北京)有限公司 A kind of text code representation method based on transformer model and more reference systems
CN110362684A (en) * 2019-06-27 2019-10-22 腾讯科技(深圳)有限公司 A kind of file classification method, device and computer equipment
CN110619034A (en) * 2019-06-27 2019-12-27 中山大学 Text keyword generation method based on Transformer model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LINFENG SONG ET.AL: "A Graph-to-Sequence Model for AMR-to-Text Generation", 《ARXIV:1805.02473V3》 *
SAM WISEMAN ET.AL: "Learning Neural Templates for Text Generation", 《ARXIV:1808.10122V3》 *
吴仁守等: "全局自匹配机制的短文本摘要生成方法", 《软件学报》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021218015A1 (en) * 2020-04-27 2021-11-04 平安科技(深圳)有限公司 Method and device for generating similar text
CN112395385A (en) * 2020-11-17 2021-02-23 中国平安人寿保险股份有限公司 Text generation method and device based on artificial intelligence, computer equipment and medium
CN112395385B (en) * 2020-11-17 2023-07-25 中国平安人寿保险股份有限公司 Text generation method and device based on artificial intelligence, computer equipment and medium
CN112580352A (en) * 2021-03-01 2021-03-30 腾讯科技(深圳)有限公司 Keyword extraction method, device and equipment and computer storage medium
CN112580352B (en) * 2021-03-01 2021-06-04 腾讯科技(深圳)有限公司 Keyword extraction method, device and equipment and computer storage medium
CN113822034A (en) * 2021-06-07 2021-12-21 腾讯科技(深圳)有限公司 Method and device for repeating text, computer equipment and storage medium
CN113822034B (en) * 2021-06-07 2024-04-19 腾讯科技(深圳)有限公司 Method, device, computer equipment and storage medium for replying text
CN113779987A (en) * 2021-08-23 2021-12-10 科大国创云网科技有限公司 Event co-reference disambiguation method and system based on self-attention enhanced semantics
CN114357974A (en) * 2021-12-28 2022-04-15 北京海泰方圆科技股份有限公司 Similar sample corpus generation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2021218015A1 (en) 2021-11-04
CN111680494B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
CN111680494B (en) Similar text generation method and device
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
CN106502985B (en) neural network modeling method and device for generating titles
Zhang et al. Deep Neural Networks in Machine Translation: An Overview.
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
CN112464641A (en) BERT-based machine reading understanding method, device, equipment and storage medium
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN111428490B (en) Reference resolution weak supervised learning method using language model
CN112232087A (en) Transformer-based specific aspect emotion analysis method of multi-granularity attention model
CN116304748B (en) Text similarity calculation method, system, equipment and medium
CN112507337A (en) Implementation method of malicious JavaScript code detection model based on semantic analysis
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN114064117A (en) Code clone detection method and system based on byte code and neural network
Yu et al. Make it directly: Event extraction based on tree-LSTM and bi-GRU
CN112464655A (en) Word vector representation method, device and medium combining Chinese characters and pinyin
CN115687567A (en) Method for searching similar long text by short text without marking data
CN115437626A (en) OCL statement automatic generation method and device based on natural language
CN112528653B (en) Short text entity recognition method and system
CN112989829A (en) Named entity identification method, device, equipment and storage medium
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN116955644A (en) Knowledge fusion method, system and storage medium based on knowledge graph
Acharjee et al. Sequence-to-sequence learning-based conversion of pseudo-code to source code using neural translation approach
CN111581339B (en) Method for extracting gene events of biomedical literature based on tree-shaped LSTM
CN115221284A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN114925175A (en) Abstract generation method and device based on artificial intelligence, computer equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant