WO2021218015A1 - 相似文本的生成方法及装置 - Google Patents
相似文本的生成方法及装置 Download PDFInfo
- Publication number
- WO2021218015A1 WO2021218015A1 PCT/CN2020/117946 CN2020117946W WO2021218015A1 WO 2021218015 A1 WO2021218015 A1 WO 2021218015A1 CN 2020117946 W CN2020117946 W CN 2020117946W WO 2021218015 A1 WO2021218015 A1 WO 2021218015A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- vector
- text
- word
- splicing
- preset
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- This application relates to the technical field of semantic analysis, in particular to a method and device for generating similar text.
- the text information input by the user or the text information obtained by voice conversion may not be the meaning that the user actually wants to express.
- the user input information is often converted into a variety of accurate expression methods through training in a bilingual environment or a multilingual environment.
- problems of syntax and semantic deviation and text alignment will be encountered.
- the inventor realizes that in the prior art, the current similar text of the initial text is calculated according to the first neural network model, and then the current discrimination probability of the initial text and the current similar text is calculated according to the second neural network model, and then it is judged whether the current discrimination probability is It is equal to the preset probability value. If not, the first neural network model is optimized according to the preset model optimization strategy, and then the current similar text is recalculated according to the optimized first neural network model, and finally the current judgment obtained by the calculation is judged in a loop Whether the probability is equal to the preset probability value, if it is equal, the similar text is regarded as the target similar text.
- the inventor of this application found in his research that the prior art scheme uses neural network methods to calculate similar texts.
- the judgment reliance is mainly based on the model parameters of the first neural network model and the second neural network model, and the model parameters are passed
- the training data is obtained, that is, the calculated similar text has a high dependence on the training data, and the corresponding low dependence on the initial text, so it is easy to cause the actual semantics of the similar text and the initial text to be not exactly the same.
- the present application provides a method and device for generating similar texts, the main purpose of which is to solve the problem in the prior art that the actual semantics of similar texts and initial texts are not exactly the same.
- a method for generating similar text including:
- the set of characterizing word vectors is input to a preset decoder to resolve similar texts of the initial text.
- an apparatus for generating similar texts including:
- the acquisition module is used to acquire the text segmentation of the initial text
- the search module is used to search for the text word vector of the text segmentation according to the preset word vector algorithm
- the first generating module is used for splicing the text word vector and the relative position vector of the text word vector to generate a splicing vector
- the second generating module is configured to input the splicing vector into a preset encoder to generate a set of characterizing word vectors of the initial text;
- the solving module is configured to input the set of characterizing word vectors into a preset decoder to solve similar texts of the initial text.
- a computer storage medium is provided, and at least one executable instruction is stored in the computer storage medium, and the executable instruction causes a processor to perform the following steps:
- the set of characterizing word vectors is input to a preset decoder to resolve similar texts of the initial text.
- a computer device including: a processor, a memory, a communication interface, and a communication bus.
- the processor, the memory, and the communication interface complete mutual communication through the communication bus.
- the memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform the following steps:
- the set of characterizing word vectors is input to a preset decoder to resolve similar texts of the initial text.
- the splicing vector of the relative position vector and the text word vector is used as input, and the characterization word vector combination of the initial text is generated through a preset encoder, where the relative position vector makes each text segmentation have a "context" relationship,
- the relative position vector makes each text segmentation have a "context" relationship
- Fig. 1 shows a flowchart of a method for generating similar text provided by an embodiment of the present application
- Figure 2 shows a flowchart of another method for generating similar text provided by an embodiment of the present application
- Fig. 3 shows a block diagram of a similar text generation device provided by an embodiment of the present application
- Figure 4 shows a block diagram of another device for generating similar text provided by an embodiment of the present application
- Fig. 5 shows a schematic structural diagram of a computer device provided by an embodiment of the present application.
- the technical solution of the present application can also be applied to the fields of artificial intelligence, blockchain and/or big data technology, for example, it can be implemented through a data platform or other devices to improve the semantic similarity between similar texts and initial texts.
- the embodiment of the present application provides a method for generating similar text. As shown in FIG. 1, the method includes:
- the initial text refers to the text entered by the user or the text after voice conversion.
- Word segmentation refers to the process of recombining consecutive word sequences into word sequences according to certain specifications.
- the initial text is segmented. Word segmentation methods based on string matching, word segmentation methods based on understanding, or word segmentation methods based on statistics can be used. The application embodiment does not limit the word segmentation method used.
- the preset word vector algorithm can be a method based on matrix factorization, a method based on shallow windows, word2vector algorithm, etc.
- the word2vector algorithm trains the N-gram language model through a neural network machine learning algorithm, and finds it in the training process The method of the vector corresponding to word. In the training process, two methods of hierarchical and negative sampling are used to accelerate the training of word2vector algorithm.
- the preset word2vector algorithm is a model algorithm that has been trained, and the text word vector of the text segmentation can be directly searched through the preset word2vector algorithm.
- Each text word vector can be identified according to its relative position or absolute position in the initial text. If the absolute position is used, the position information contained in the words of different segments in the same long sentence will be the same, but in fact the position information should be different. Therefore, the relative position is used in this application to effectively distinguish each text word vector.
- the relative position vector is a vector matrix, and the i-th row and j-th column of the vector matrix identify the relative position between the i-th word and the j-th word.
- the relative position vector has a one-to-one correspondence with the text word vector, and is a high-dimensional vector of the same dimension, which is directly added for splicing according to the operation rules of the matrix.
- the function of the preset encoder is to transform an input sequence of indefinite length into a variable of fixed length, which is usually realized by a cyclic neural network. That is, the splicing vector is converted into a set of synonymous representation word vectors.
- the combination of representation word vectors refers to a collection of word vector tensors that have the same intention as the initial text word and express different high-dimensional spaces.
- the preset encoder may adopt a deep neural network, recursive variational, sum product network depth, etc., and the specific method adopted by the preset encoder is not limited in the embodiment of the present application.
- the purpose of this application is to output a rich and diverse text collection without changing the meaning of the text to complete the text retelling of the original text to collect a large amount of similar text data, which is used in natural language processing such as text abstracts and machine translation. Tasks that require supervised learning.
- the role of the preset decoder is the opposite of the role of the preset encoder. It is the reverse process of the preset encoder, which is used to convert fixed-length variables into variable-length output sequences.
- the preset decoder is designed according to downstream tasks, and downstream tasks can be divided into two types: generative tasks and sequence tasks. Exemplarily, machine translation is a generative task, and judging synonyms is a sequence task. Taking the set of characterizing word vectors as input, the similar text is output after the solution of the preset decoder.
- This application provides a method for generating similar text. Firstly, the text segmentation of the initial text is obtained, and then the text word vector of the text segmentation is searched according to the preset word vector algorithm, and then the text word vector and the relative position vector of the text word vector are spliced Generate the splicing vector, and then input the spliced vector into the preset encoder to generate the characterization word vector set of the initial text, and finally input the characterization word vector set into the preset decoder to solve the similar text of the initial text.
- the embodiment of the present application takes the splicing vector of the relative position vector and the text word vector as input, and generates a combination of the characterization word vector of the initial text through a preset encoder, wherein the relative position vector makes each text segmentation They all have a "context" relationship, so that the position information contained in the words of different segments in the same long sentence is the same, which improves the contextual relevance, and thus the semantic similarity between similar texts and the original text.
- the embodiment of the present application provides another method for generating similar text. As shown in FIG. 2, the method includes:
- the initial text refers to the text entered by the user or the text after voice conversion.
- Word segmentation refers to the process of recombining continuous word sequences into word sequences according to certain specifications.
- the initial text is segmented by: inputting the initial text into a preset stuttering word segmentation model; obtaining the stuttering word segmentation model The output text segmentation.
- Stuttering Chinese word segmentation includes efficient word graph scanning based on the Trie tree structure to generate a directed acyclic graph composed of all possible word formations of Chinese characters in the sentence; dynamic programming is used to find the path of maximum probability, and the maximum segmentation based on word frequency is found
- Combination For unregistered words, the HMM model based on the ability of Chinese characters to form words is used, and the Viterbi algorithm is used. By loading the dictionary, adjusting the dictionary, and then extracting keywords based on the TF-IDF algorithm, or extracting keywords based on the TextRank algorithm, the initial text is segmented.
- search for the text word vector of the text segmentation is performed.
- the preset word vector algorithm can be a method based on matrix factorization, a method based on shallow windows, word2vector algorithm, etc.
- the word2vector algorithm trains the N-gram language model through a neural network machine learning algorithm, and finds it in the training process The method of the vector corresponding to word. In the training process, two methods of hierarchical and negative sampling are used to accelerate the training of word2vector algorithm.
- the preset word2vector algorithm is a model algorithm that has been trained, and the text word vector of the text segmentation can be directly searched through the preset word2vector algorithm.
- Each text word vector can be identified according to its relative position or absolute position in the initial text. If the absolute position is used, the position information contained in the words of different segments in the same long sentence will be the same, but in fact the position information should be different. Therefore, the relative position is used in this application to effectively distinguish each text word vector.
- the relative position vector is a vector matrix, and the i-th row and j-th column of the vector matrix identify the relative position between the i-th word and the j-th word.
- the relative position vector has a one-to-one correspondence with the text word vector, and is a high-dimensional vector of the same dimension, which is directly added for splicing according to the operation rules of the matrix.
- the word order probability is used to calculate the factorization vector of the splicing vector
- Calculating the factorization vector of the splicing vector specifically includes: calculating the word order probability of the initial text according to the splicing vector, where the word order probability refers to the conditional probability of each arrangement method of the full arrangement of the text word segmentation, The occurrence condition of the conditional probability is that all the word segmentation arranged before the current word segmentation according to the arrangement manner occurs; it is determined that the arrangement order of the text word segmentation corresponding to the maximum value of the word order probability is the word segmentation order; the adjacent word segmentation Vector merging to generate a factorization vector of the splicing vector, and the adjacent word segmentation vector refers to vector elements in the splicing vector corresponding to the text word segmentation in the sequence of the word segmentation.
- the initial text includes 5 text segmentation x 1 , x 2 , x 3 , x 4 , x 5 , and the corresponding splicing vector includes 5 vector elements A1, A2, A3, A4, A5, the text of the initial text is segmented Perform all permutations, including 5!
- the semantic order of the words is x 3 , x 1 , x 2 , x 4 , X 5 , where x 1 and x 2 , as well as x 4 and x 5 , are the word segmentation texts facing the street, and the vector elements A1 and A2 in the corresponding stitching vector are adjacent word segmentation vectors, and A4 and A5 are adjacent Word segmentation vector, merge A1 and A2 into B1, merge A4 and A5 into B2, the factorization vector of the splicing vector is A1, B1, B2, in order to reduce the dimensionality of the
- the search method of adjacent word segmentation vectors can obtain the first element position identification of the first element in the splicing vector at any position in the word segmentation sequence, and according to the preset The second element position identification of the second element adjacent to the sequence in the splicing vector, and then the first element position identification is subjected to the self-increment step length operation to obtain the predicted position identification, and the self-increment step length is the number of the sequential number of the splicing vector If the predicted position identifier is different from the second element position identifier, the first element position is re-acquired.
- the predicted position identifier is the same as the second element position identifier, it is determined that the first element and the second element are adjacent word segmentation vectors, and the The second element position identifier is redefined as the first element position identifier, and the second element is used as the first element at any position in the semantic order of the word segmentation, and the above steps are repeated until all adjacent word segmentation vectors in the splicing vector are found.
- the adjacent word segmentation vector may include two elements, three elements, four elements, etc., and the number of elements included in the adjacent word segmentation vector is not limited in the embodiment of the present application.
- the factorization vector and the preset self-attention mechanism it is used to extract the intention of the splicing vector to obtain the text code with the same obvious intention.
- the vector quantization variation mechanism is adopted, and randomly sampled samples with lower dimensions are obtained in this step.
- the input is converted into vector encoding.
- the latent space in which it is located may be discontinuous, or simple interpolation is allowed.
- the output of the encoder is a clear multi-dimensional feature tensor. The particularity of translation tasks, potential semantic features, grammatical features, and text length will all affect the accuracy and retelling of translation. If the output of the encoder is not a certain multi-dimensional tensor, but a random distribution feature that obeys a certain distribution, and random sampling through this feature to ensure the richness and diversity of the language, it will improve the accuracy of translation and retelling sex.
- the process of generating the adopted sample specifically includes: counting the vector average vector and the vector standard deviation vector of the factorization vector; according to the vector average vector and the vector standard deviation vector, randomly sampling the factorization vector to generate Sampling samples.
- the statistical factorization of the vector data distribution characteristics, and then generalization output two vectors of the same size, the vector average vector and the vector standard deviation vector. Then, based on the vector average vector and the vector standard deviation vector, the data subject to the constraint is randomly sampled.
- the potential space of the sample used for random sampling is continuous and interpolation is allowed.
- counting the vector average vector and the vector standard deviation vector of the factorization vector specifically includes: according to the first preset probability distribution formula, counting the first probability distribution function of the factorization vector, and according to the second preset probability A distribution formula that counts the second probability distribution function of the factorized vector, the dependent variable of the first probability distribution function includes a first average vector and a first standard deviation vector, and the dependent variable of the second probability distribution function includes The second average vector and the second standard deviation vector; calculate the KL divergence of the first probability distribution function and the second probability distribution function; if the KL divergence is equal to 0, it is determined that the factorization vector obeys the The first probability distribution function or the second probability distribution function, determining that the vector average vector is the first average vector or the second average vector, and determining that the vector standard deviation vector is the first standard deviation vector or the first standard deviation vector A two-standard deviation vector; if the KL divergence is not equal to 0, the vector is factored according to the factor to obtain the minimum value of the KL divergence as a target, and
- the residual neural network can also be combined to avoid the gradient explosion and disappearance in the backward propagation process.
- the upper layer input is added before the linearly changing activation layer of the second layer, which can reduce the abstract representation in the decoding
- the controller does the cross entropy in the gradient update process and accelerates the convergence rate.
- the set of characterizing word vectors is a set of text encodings that are similar but not identical to the used samples on the basis of the used samples.
- steps 204-207 are equivalent to the step 104 shown in Fig. 1 inputting the splicing vector into the preset encoder to generate a set of characterization word vectors of the initial text.
- Steps 204-207 can be equivalent to the encoding process including the factoring layer and self-attention.
- Representation word vector combination refers to a collection of word vector tensors that have the same intention as the initial text word and express different high-dimensional spaces.
- the purpose of this application is to output a rich and diverse text collection without changing the meaning of the text to complete the text retelling of the original text to collect a large amount of similar text data, which is used in natural language processing such as text abstracts and machine translation. Tasks that require supervised learning.
- the role of the preset decoder is the opposite of the role of the preset encoder. It is the reverse process of the preset encoder, which is used to convert fixed-length variables into variable-length output sequences.
- the preset decoder is designed according to downstream tasks, and downstream tasks can be divided into two types: generative tasks and sequence tasks. Exemplarily, machine translation is a generative task, and judging synonyms is a sequence task. Taking the set of characterizing word vectors as input, the similar text is output after the solution of the preset decoder.
- This application provides a method for generating similar text. Firstly, the text segmentation of the initial text is obtained, and then the text word vector of the text segmentation is searched according to the preset word vector algorithm, and then the text word vector and the relative position vector of the text word vector are spliced Generate the splicing vector, and then input the spliced vector into the preset encoder to generate the characterization word vector set of the initial text, and finally input the characterization word vector set into the preset decoder to solve the similar text of the initial text.
- the embodiment of the present application takes the splicing vector of the relative position vector and the text word vector as input, and generates a combination of the characterization word vector of the initial text through a preset encoder, wherein the relative position vector makes each text segmentation They all have a "context" relationship, so that the position information contained in the words of different segments in the same long sentence is the same, which improves the contextual relevance, and thus the semantic similarity between similar texts and the original text.
- an embodiment of the present application provides an apparatus for generating similar text.
- the apparatus includes:
- the obtaining module 31 is used to obtain the text segmentation of the initial text
- the search module 32 is configured to search for the text word vector of the text segmentation according to a preset word vector algorithm
- the first generating module 33 is configured to splice the text word vector and the relative position vector of the text word vector to generate a splicing vector
- the second generating module 34 is configured to input the splicing vector into a preset encoder to generate a set of characterizing word vectors of the initial text;
- the solving module 35 is configured to input the set of characterizing word vectors into a preset decoder to solve similar texts of the initial text.
- This application provides a similar text generation device, which first obtains the text segmentation of the initial text, then finds the text word vector of the text segmentation according to a preset word vector algorithm, and then splices the text word vector and the relative position vector of the text word vector Generate the splicing vector, and then input the spliced vector into the preset encoder to generate the characterization word vector set of the initial text, and finally input the characterization word vector set into the preset decoder to solve the similar text of the initial text.
- the embodiment of the present application takes the splicing vector of the relative position vector and the text word vector as input, and generates a combination of the characterization word vector of the initial text through a preset encoder, wherein the relative position vector makes each text segmentation They all have a "context" relationship, so that the position information contained in the words of different segments in the same long sentence is the same, which improves the contextual relevance, and thus the semantic similarity between similar texts and the original text.
- an embodiment of the present application provides another device for generating similar text.
- the device includes:
- the obtaining module 41 is used to obtain the text segmentation of the initial text
- the searching module 42 is configured to search for the text word vector of the text segmentation according to a preset word vector algorithm
- the first generating module 43 is configured to splice the text word vector and the relative position vector of the text word vector to generate a splicing vector
- the second generating module 44 is configured to input the splicing vector into a preset encoder to generate a set of characterizing word vectors of the initial text;
- the solving module 45 is configured to input the set of characterizing word vectors into a preset decoder to solve similar texts of the initial text.
- the acquisition module 41 includes:
- the input unit 411 is configured to input the initial text into a preset stuttering word segmentation model
- the acquiring unit 412 is configured to acquire the text segmentation output by the stuttering segmentation model.
- the second generating module 44 includes:
- the calculation unit 441 is configured to calculate the factorization vector of the splicing vector according to the word order probability of the splicing vector, where the splicing vector is stored in the blockchain;
- the splicing vector may also be stored in a node of a blockchain.
- the extraction unit 442 is configured to extract the attention feature of the factorized vector according to a preset self-attention mechanism
- a sampling unit 443, configured to randomly sample the factorized vector based on the vector average vector and the vector standard deviation vector of the factorized vector to generate sampling samples;
- the generating unit 444 is configured to generate a set of characterizing word vectors of the initial text according to the sampling sample and the attention feature.
- calculation unit 441 includes:
- the calculation sub-unit 4411 is configured to calculate the word order probability of the initial text according to the splicing vector, where the word order probability refers to the conditional probability of each arrangement method in which the text segmentation is fully arranged, and the conditional probability is The occurrence condition is that all the participles arranged before the current participle according to the arrangement method all occur;
- the determining subunit 4412 is configured to determine that the arrangement order of the text segmentation corresponding to the maximum value of the word order probability is the segmentation order;
- the generating subunit 4413 is configured to merge adjacent word segmentation vectors to generate a factorization vector of the splicing vector, where the adjacent word segmentation vector refers to the splicing corresponding to the text segmentation in the sequence of the word segmentation.
- Vector elements in vectors are configured to merge adjacent word segmentation vectors to generate a factorization vector of the splicing vector, where the adjacent word segmentation vector refers to the splicing corresponding to the text segmentation in the sequence of the word segmentation.
- sampling unit 443 includes:
- the statistics subunit 4431 is used to count the vector average vector and the vector standard deviation vector of the factorized vector
- the sampling subunit 4432 is configured to randomly sample the factorized vector according to the vector average vector and the vector standard deviation vector to generate sampling samples.
- statistics subunit 4431 is used for:
- the first probability distribution function of the factorization vector is counted, and according to the second preset probability distribution formula, the second probability distribution function of the factorization vector is counted, the first
- the dependent variable of a probability distribution function includes a first average vector and a first standard deviation vector, and the dependent variable of the second probability distribution function includes a second average vector and a second standard deviation vector;
- the vector is factored according to the factor to obtain the minimum value of the KL divergence as a target, and the vector average vector and the vector standard deviation vector are calculated.
- the generating unit 444 is configured to:
- a characterization vector set of the initial text is generated, wherein the characteristic description of the preset dimension adjustment rule is:
- z h is the set of characterization vectors
- ⁇ is the learning parameter
- e h is the attention feature
- q h is the random sampling result.
- This application provides a similar text generation device, which first obtains the text segmentation of the initial text, then finds the text word vector of the text segmentation according to a preset word vector algorithm, and then splices the text word vector and the relative position vector of the text word vector Generate the splicing vector, and then input the spliced vector into the preset encoder to generate the characterization word vector set of the initial text, and finally input the characterization word vector set into the preset decoder to solve the similar text of the initial text.
- the embodiment of the present application takes the splicing vector of the relative position vector and the text word vector as input, and generates a combination of the characterization word vector of the initial text through a preset encoder, wherein the relative position vector makes each text segmentation They all have a "context" relationship, so that the position information contained in the words of different segments in the same long sentence is the same, which improves the contextual relevance, and thus the semantic similarity between similar texts and the original text.
- a computer storage medium stores at least one executable instruction, and the computer executable instruction can execute the method for generating similar text in any of the foregoing method embodiments.
- the storage medium (computer storage medium) involved in this application may be a computer-readable storage medium, and the storage medium may be nonvolatile or volatile.
- FIG. 5 shows a schematic structural diagram of a computer device according to an embodiment of the present application, and the specific embodiment of the present application does not limit the specific implementation of the computer device.
- the computer device may include: a processor (processor) 502, a communication interface (Communications Interface) 504, a memory (memory) 506, and a communication bus 508.
- processor processor
- communication interface Communication Interface
- memory memory
- the processor 502, the communication interface 504, and the memory 506 communicate with each other through the communication bus 508.
- the communication interface 504 is used to communicate with other devices, such as network elements such as clients or other servers.
- the processor 502 is configured to execute at least one executable instruction, such as a program 510, and specifically can execute relevant steps in the foregoing embodiment of the method for generating similar text.
- the program 510 may include program code, and the program code includes a computer operation instruction.
- the processor 502 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application.
- the one or more processors included in the computer device may be the same type of processor, such as one or more CPUs, or different types of processors, such as one or more CPUs and one or more ASICs.
- the memory 506 is used to store at least one executable instruction such as a program 510.
- the memory 506 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), for example, at least one disk memory.
- the program 510 may be specifically used to cause the processor 502 to perform the following operations:
- the set of characterizing word vectors is input to a preset decoder to resolve similar texts of the initial text.
- the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
- Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
- modules or steps of this application can be implemented by a general computing device, and they can be concentrated on a single computing device or distributed in a network composed of multiple computing devices.
- they can be implemented with program codes executable by the computing device, so that they can be stored in the storage device for execution by the computing device, and in some cases, can be executed in a different order than here.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
一种相似文本的生成方法及装置,涉及语义解析技术领域。该方法主要包括:获取初始文本的文本分词(101);根据预置词向量算法,查找文本分词的文本词向量(102);将文本词向量和文本词向量的相对位置向量进行拼接,生成拼接向量(103);将拼接向量输入预置编码器,生成初始文本的表征词向量集合(104);将表征词向量集合输入预置解码器,解算初始文本的相似文本(105)。所述方法主要应用于自然语言处理的过程中,还涉及区块链技术,拼接向量可存储于区块链节点中。
Description
本申请要求于2020年4月27日提交中国专利局、申请号为202010341544.X,发明名称为“相似文本的生成方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及语义解析技术领域,特别是涉及一种相似文本的生成方法及装置。
随着人工智能的不断发展,人机交互系统的应用越来越广泛。在使用人机交互系统的过程中,用户输入的文本信息,或者语音转换得到的文本信息,可能并不是用户实际要表达的含义。为了避免人机交互系统对用户输入信息的错误解读,往往通过训练双语环境或者多语环境,将用户输入信息转换成多种准确的表述方法。但在双语翻译模型中会遇到语法语义偏差以及文本对齐的问题。
发明人意识到,现有技术中,采用根据第一神经网络模型计算初始文本的当前相似文本,然后根据第二神经网络模型计算初始文本和当前相似文本的当前判别概率,再判断当前判别概率是否等于预设概率值,若不等于则根据预设模型优化策略对第一神经网络模型进行优化,再根据优化后的第一神经网络模型重进计算当前相似文本,最后循环判断计算得到的当前判别概率是否等于预设概率值,若等于则相似文本作为目标相似文本。
本申请创造的发明人在研究中发现,现有技术的方案,采用神经网络方法计算相似文本,判别依赖依据主要在于第一神经网络模型和第二神经网络模型的模型参数,而模型参数是通过训练数据获得的,也就是计算得到的相似文本对训练数据的依赖度较高,相应的对初始文本依赖度较低,所以容易导致相似文本与初始文本的实际语义不完全相同。
发明内容
有鉴于此,本申请提供一种相似文本的生成方法及装置,主要目的在于解决现有技术中相似文本与初始文本的实际语义不完全相同的问题。
依据本申请一个方面,提供了一种相似文本的生成方法,包括:
获取初始文本的文本分词;
根据预置词向量算法,查找所述文本分词的文本词向量;
将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;
将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;
将所述表征词向量集合输入预置解码器,解算所述初始文本的相似文本。
依据本申请另一个方面,提供了一种相似文本的生成装置,包括:
获取模块,用于获取初始文本的文本分词;
查找模块,用于根据预置词向量算法,查找所述文本分词的文本词向量;
第一生成模块,用于将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;
第二生成模块,用于将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;
解算模块,用于将所述表征词向量集合输入预置解码器,解算所述初始文本的相似文本。
根据本申请的又一方面,提供了一种计算机存储介质,所述计算机存储介质中存储有至少一种可执行指令,所述可执行指令使处理器执行以下步骤:
获取初始文本的文本分词;
根据预置词向量算法,查找所述文本分词的文本词向量;
将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;
将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;
将所述表征词向量集合输入预置解码器,解算所述初始文本的相似文本。
根据本申请的再一方面,提供了一种计算机设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;
所述存储器用于存放至少一种可执行指令,所述可执行指令使所述处理器执行以下步骤:
获取初始文本的文本分词;
根据预置词向量算法,查找所述文本分词的文本词向量;
将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;
将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;
将所述表征词向量集合输入预置解码器,解算所述初始文本的相似文本。
本申请实施例通过以相对位置向量与文本词向量的拼接向量为输入,通过预置编码器生成初始文本的表征词向量结合,其中,相对位置向量使得每个文本分词都具有“上下文”关系,以使得同一个长句中不同分段的词语蕴含的位置信息相同,提高上下文的关联性,进而提高相似文本与初始文本的语义相似度。
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示出了本申请实施例提供的一种相似文本的生成方法流程图;
图2示出了本申请实施例提供的另一种相似文本的生成方法流程图;
图3示出了本申请实施例提供的一种相似文本的生成装置组成框图;
图4示出了本申请实施例提供的另一种相似文本的生成装置组成框图;
图5示出了本申请实施例提供的一种计算机设备的结构示意图。
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
本申请的技术方案还可应用于人工智能、区块链和/或大数据技术领域,如可通过数据平台或其他设备实现,以提升相似文本与初始文本的语义相似度。
本申请实施例提供了一种相似文本的生成方法,如图1所示,该方法包括:
101、获取初始文本的文本分词。
当用户通过终端输入文字或语音时,通常需要文字或语音的实际语义进行问答、推荐或搜索。初始文本是指用户输入的文字,或者语音转换后的文字。分词是指将连续的字序列按照一定的规范重新组合成词序列的过程,对初始文本进行分词,可以采用基于字符串匹配的分词方法、基于理解的分词方法或基于统计的分词方法,在本申请实施例中对采用的分词方法不做限定。
102、根据预置词向量算法,查找文本分词的文本词向量。
预置词向量算法可以为基于矩阵分解的方法、基于浅层窗口的方法和word2vector算法 等等,其中word2vector算法是通过神经网络机器学习算法来训练N-gram语言模型,并在训练过程中求出word所对应的vector的方法。在训练过程中采用层次和负采样两种方式加速训练word2vector算法。预置word2vector算法是已经训练好的模型算法,通过预置word2vector算法可以直接查找文本分词的文本词向量。
103、将文本词向量和文本词向量的相对位置向量进行拼接,生成拼接向量。
每个文本词向量都可以根据其在初始文本中的相对位置或绝对位置进行标识。若采用绝对位置会导致同一个长句下不同分段的词语蕴含的位置信息相同,但实际上位置信息应该有所区别,因此在本申请中采用相对位置,以有效区分每个文本词向量。相对位置向量是矢量矩阵,矢量矩阵的第i行第j列标识第i个词到第j个词之间的相对位置。相对位置向量与文本词向量一一对应,是同维度的高维向量,根据矩阵的运算规则直接相加进行拼接。
104、将拼接向量输入预置编码器,生成初始文本的表征词向量集合。
预置编码器的作用是把一个不定长的输入序列变换成一个定长变量,常用循环神经网络实现。也就是将拼接向量转换为同义的表征词向量集合,表征词向量结合是指与初始文本词意图相同,表述不同的高维空间的词向量张量的集合。预置编码器可采用深度神经网络、递归变分、和积网络深度等方式,在本申请实施例中对预置编码器采用的具体方法不做限定。
本申请的目的是在不改变文本含义的基础上,输出丰富多样的文本集合,以完成对初始文本的文本复述,以收集大量相似文本数据,用于提取文字摘要、机器翻译等自然语言处理中需要监督学习的任务。
105、将表征词向量集合输入预置解码器,解算初始文本的相似文本。
预置解码器的作用,与预置编码器的作用相反,是预置编码器的逆过程,用于将定长变量转换为不定长的输出序列。预置解码器是根据下游任务设计的,下游任务可以分为生成式任务和序列任务两类。示例性的,机器翻译是生成式任务,判断同义词是序列任务。以表征词向量集合为输入,经预置解码器的解算,输出相似文本。
本申请提供了一种相似文本的生成方法,首先获取初始文本的文本分词,然后根据预置词向量算法查找文本分词的文本词向量,再将文本词向量和文本词向量的相对位置向量进行拼接生成拼接向量,再将拼接向量输入预置编码器,生成初始文本的表征词向量集合,最后将表征词向量集合输入预置解码器解算初始文本的相似文本。与现有技术相比,本申请实施例通过以相对位置向量与文本词向量的拼接向量为输入,通过预置编码器生成初始文本的表征词向量结合,其中,相对位置向量使得每个文本分词都具有“上下文”关系,以使得同一个长句中不同分段的词语蕴含的位置信息相同,提高上下文的关联性,进而提高相似文本与初始文本的语义相似度。
本申请实施例提供了另一种相似文本的生成方法,如图2所示,该方法包括:
201、获取初始文本的文本分词。
当用户通过终端输入文字或语音时,通常需要文字或语音的实际语义进行问答、推荐或搜索。初始文本是指用户输入的文字,或者语音转换后的文字。分词是指将连续的字序列按照一定的规范重新组合成词序列的过程,对初始文本进行分词,可以采用:将所述初始文本输入至预置的结巴分词模型中;获取所述结巴分词模型输出的文本分词。
结巴中文分词包括基于Trie树结构实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图;采用了动态规划查找最大概率路径,找出基于词频的最大切分组合;对于未登录词,采用了基于汉字成词能力的HMM模型,使用了Viterbi算法。通过载入词典,调整词典,然后基于TF-IDF算法的关键词抽取,或基于TextRank算法的关键词抽取,对初始文本进行分词。
202、根据预置词向量算法,查找文本分词的文本词向量。
预置词向量算法可以为基于矩阵分解的方法、基于浅层窗口的方法和word2vector算法等等,其中word2vector算法是通过神经网络机器学习算法来训练N-gram语言模型,并在训练过程中求出word所对应的vector的方法。在训练过程中采用层次和负采样两种方式加速训练word2vector算法。预置word2vector算法是已经训练好的模型算法,通过预置word2vector算法可以直接查找文本分词的文本词向量。
203、将文本词向量和文本词向量的相对位置向量进行拼接,生成拼接向量。
每个文本词向量都可以根据其在初始文本中的相对位置或绝对位置进行标识。若采用绝对位置会导致同一个长句下不同分段的词语蕴含的位置信息相同,但实际上位置信息应该有所区别,因此在本申请中采用相对位置,以有效区分每个文本词向量。相对位置向量是矢量矩阵,矢量矩阵的第i行第j列标识第i个词到第j个词之间的相对位置。相对位置向量与文本词向量一一对应,是同维度的高维向量,根据矩阵的运算规则直接相加进行拼接。
204、根据拼接向量的词序概率,计算拼接向量的因式分解向量。
为了更好的理解本方案,现举例说明词序概率,假设给定长度为T的序列xx,总共有T!种排列方法,也就对应T!种链式分解方法。假设拼接向量x=x1x2x3,那么总共用3!=6种分解方法,其中p(x2|x1x3)是指第一个词是x1并且第三个词是x3的条件下第二个词是x2的概率,也就是说原来词的顺序是保持的。遍历T!种分解方法,并且共享模型参数,使得提取因式分解向量过程中能够学习到上下文关系。而普通的从左到右或者从右往左的语言模型只能学习一种方向的依赖关系,比如先”猜”一个词,然后根据第一个词”猜”第二个词,根据前两个词”猜”第三个词。而通过排列语言模型会学习各种顺序的词序概率,比如p(x)=p(x
1|x
3)p(x
2|x
1x
3)p(x
3)对应的顺序3→1→2,它是先”猜”第三个词,然后根据第三个词猜测第一个词,最后根据第一个和第三个词猜测第二个词。如果上下文依赖关系与文本顺序相同,那么顺序相同的文本具有唯一含义,且根据其唯一含义能获取其相似文本的可能性极大,据此,以词序概率计算拼接向量的因式分解向量
计算拼接向量的因式分解向量,具体包括:根据所述拼接向量计算所述初始文本的词序概率,其中,所述词序概率是指所述文本分词进行全排列的每种排列方式的条件概率,所述条件概率的发生条件是按照所述排列方式排列在当前分词之前的所有分词全部发生;确定所述词序概率的最大值对应的所述文本分词的排列顺序为分词语义顺序;将相邻分词向量合并,生成所述拼接向量的因式分解向量,所述相邻分词向量是指与所述分词语义顺序中顺序邻接的文本分词对应的所述拼接向量中的向量元素。
假设初始文本中包括5个文本分词x
1、x
2、x
3、x
4、x
5,对应的拼接向量中包括5个向量元素A1、A2、A3、A4、A5,将初始文本的文本分词进行全排列,包括5!=120中排列方式,其中词序概率最大的排序方式为x
3、x
1、x
2、x
4、x
5,其计算公式为P=p(x
1|x
3)p(x
2|x
1x
3)p(x
3)p(x
4|x
1x
2x
3)p(x
5|x
1x
2x
3x
4),分词语义顺序为x
3、x
1、x
2、x
4、x
5,其中x
1和x
2,以及x
4和x
5,都是顺序临街的分词文本,其对应的拼接向量中的向量元素A1和A2是相邻分词向量,A4和A5是相邻分词向量,将A1和A2合并为B1,将A4和A5合并为B2,拼接向量的因素分解向量为A1、B1、B2,以实现对拼接向量的降维,能减少数据量,提高训练和计算速度。其中,如果拼接向量中的各个元素为顺序编号,则相邻分词向量的查找方法,可以获取分词语义顺序中任意位置的第一个元素在拼接向量中的第一元素位置标识,以及按照预置顺序与其相邻的第二个元素在拼接向量中的第二元素位置标识,再将第一元素位置标识做自增步长运算得到预测位置标识,自增步长是拼接向量的顺序编号的编号间隔,如果预测位置标识与第二元素位置标识不同则重新获取第一元素位 置,如果预测位置标识与第二元素位置标识相同,则确定第一元素和第二元素是相邻分词向量,同时将第二元素位置标识重新定义为第一元素位置标识,以第二元素为分词语义顺序中的任意位置的第一元素,重复上述步骤,直至查找到拼接向量中的全部相邻分词向量。相邻分词向量可以是包括两个元素、三个元素、四个元素等等,在本申请实施例中对相邻分词向量中包含的元素个数不做限定。
205、根据预置的自注意力机制,提取因式分解向量的注意力特征。
自注意力特征的提取过程包括:将query和每个key进行相似度计算得到权重,然后使用一个softmax函数对权重进行归一化;最后将权重和相应的键值value进行加权求和得到注意力特征其中key和value是同一个,即key=value。通过因式分解向量和预置的自注意力机制,用于对拼接向量进行意图提取,以获得明显意图相同的文本编码。
206、基于因式分解向量的向量平均矢量和向量标准差矢量,对因式分解向量进行随机采样生成采样样本。
本步骤采用矢量量化变分机制,在本步骤获取维度较低的随机采样的采样样本。现有技术中将输入转换成矢量编码,其所在的潜在空间可能不连续,或者允许简单的插值,在机器翻译的双语翻译任务中,由编码器输出的是一个明确的多维特征张量,由于翻译任务的特殊性,潜在语义特征、语法特征以及文本长度都会影响翻译的准确性以及复述性。隐藏如果编码器输出的不是一个确定的多维张量,而是服从某种分布的随机分布特征,并通过该特征随机取样以保证语言的丰富性与多样性,会使得提高翻译的准确性以及复述性。
生成采用样本的过程具体包括:统计所述因式分解向量的向量平均矢量和向量标准差矢量;根据所述向量平均矢量和所述向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本。统计因式分解向量的数据分布特征,然后进行归纳,输出两个大小相同的矢量,向量平均矢量和向量标准差矢量。然后基于向量平均矢量和向量标准差矢量,对服从该约束的数据进行随机采样,随机采样的采用样本的潜在空间是连续的,并且允许插值。
其中,统计因式分解向量的向量平均矢量和向量标准差矢量,具体包括:依据第一预置概率分布公式,统计所述因式分解向量的第一概率分布函数,并依据第二预置概率分布公式,统计所述因式分解向量的第二概率分布函数,所述第一概率分布函数的因变量包括第一平均矢量和第一标准差矢量,所述第二概率分布函数的因变量包括第二平均矢量和第二标准差矢量;计算所述第一概率分布函数和所述第二概率分布函数的KL散度;如果KL散度等于0,则确定所述因式分解向量服从所述第一概率分布函数或所述第二概率分布函数,确定所述向量平均矢量是所述第一平均矢量或所述第二平均矢量,确定所述向量标准差矢量是第一标准差矢量或第二标准差矢量;如果KL散度不等于0,则根据所述因式分解向量,以获取所述KL散度的最小值为目标,计算所述向量平均矢量和所述向量标准差矢量。
生成采样样本之后,还可以结合残差神经网络,以避免后向传播过程中的梯度爆炸与梯度消失的情况,在输入第二层线性变化的激活层前加入上层输入,能够降低抽象表征在解码器做梯度更新的过程中的交叉熵并且加快收敛速率。
207、根据采样样本和注意力特征,生成初始文本的表征词向量集合。
表征词向量集合,是在采用样本的基础上,与采用样本相似但不完全相同的文本编码集合。具体的,根据预置维度调节规则,生成初始文本的表征向量集合,其中,预置维度调节规则的特征描述为:z
h=αe
h+(1-α)q
h;其中,z
h为表征向量集合,α为学习参数,e
h为注意力特征,q
h为随机采样结果。
上述步骤204-207相当于图1所示步骤104将拼接向量输入预置编码器,生成初始文本的表征词向量集合,其中步骤204-207可以等同于编码过程包括因式变换层、自注意力层、矢量量化变分层和全连接层。通过四层计算等到初始文本的表征词向量集合。表征词 向量结合是指与初始文本词意图相同,表述不同的高维空间的词向量张量的集合。本申请的目的是在不改变文本含义的基础上,输出丰富多样的文本集合,以完成对初始文本的文本复述,以收集大量相似文本数据,用于提取文字摘要、机器翻译等自然语言处理中需要监督学习的任务。
208、将表征词向量集合输入预置解码器,解算初始文本的相似文本。
预置解码器的作用,与预置编码器的作用相反,是预置编码器的逆过程,用于将定长变量转换为不定长的输出序列。预置解码器是根据下游任务设计的,下游任务可以分为生成式任务和序列任务两类。示例性的,机器翻译是生成式任务,判断同义词是序列任务。以表征词向量集合为输入,经预置解码器的解算,输出相似文本。
本申请提供了一种相似文本的生成方法,首先获取初始文本的文本分词,然后根据预置词向量算法查找文本分词的文本词向量,再将文本词向量和文本词向量的相对位置向量进行拼接生成拼接向量,再将拼接向量输入预置编码器,生成初始文本的表征词向量集合,最后将表征词向量集合输入预置解码器解算初始文本的相似文本。与现有技术相比,本申请实施例通过以相对位置向量与文本词向量的拼接向量为输入,通过预置编码器生成初始文本的表征词向量结合,其中,相对位置向量使得每个文本分词都具有“上下文”关系,以使得同一个长句中不同分段的词语蕴含的位置信息相同,提高上下文的关联性,进而提高相似文本与初始文本的语义相似度。
进一步的,作为对上述图1所示方法的实现,本申请实施例提供了一种相似文本的生成装置,如图3所示,该装置包括:
获取模块31,用于获取初始文本的文本分词;
查找模块32,用于根据预置词向量算法,查找所述文本分词的文本词向量;
第一生成模块33,用于将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;
第二生成模块34,用于将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;
解算模块35,用于将所述表征词向量集合输入预置解码器,解算所述初始文本的相似文本。
本申请提供了一种相似文本的生成装置,首先获取初始文本的文本分词,然后根据预置词向量算法查找文本分词的文本词向量,再将文本词向量和文本词向量的相对位置向量进行拼接生成拼接向量,再将拼接向量输入预置编码器,生成初始文本的表征词向量集合,最后将表征词向量集合输入预置解码器解算初始文本的相似文本。与现有技术相比,本申请实施例通过以相对位置向量与文本词向量的拼接向量为输入,通过预置编码器生成初始文本的表征词向量结合,其中,相对位置向量使得每个文本分词都具有“上下文”关系,以使得同一个长句中不同分段的词语蕴含的位置信息相同,提高上下文的关联性,进而提高相似文本与初始文本的语义相似度。
进一步的,作为对上述图2所示方法的实现,本申请实施例提供了另一种相似文本的生成装置,如图4所示,该装置包括:
获取模块41,用于获取初始文本的文本分词;
查找模块42,用于根据预置词向量算法,查找所述文本分词的文本词向量;
第一生成模块43,用于将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;
第二生成模块44,用于将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;
解算模块45,用于将所述表征词向量集合输入预置解码器,解算所述初始文本的相似 文本。
进一步地,所述获取模块41,包括:
输入单元411,用于将所述初始文本输入至预置的结巴分词模型中;
获取单元412,用于获取所述结巴分词模型输出的文本分词。
进一步地,所述第二生成模块44,包括:
计算单元441,用于根据所述拼接向量的词序概率,计算所述拼接向量的因式分解向量,其中,所述拼接向量存储在区块链中;
需要强调的是,为进一步保证上述拼接向量的私密和安全性,上述拼接向量还可以存储于一区块链的节点中。
提取单元442,用于根据预置的自注意力机制,提取所述因式分解向量的注意力特征;
采样单元443,用于基于所述因式分解向量的向量平均矢量和向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本;
生成单元444,用于根据所述采样样本和所述注意力特征,生成所述初始文本的表征词向量集合。
进一步地,所述计算单元441,包括:
计算子单元4411,用于根据所述拼接向量计算所述初始文本的词序概率,其中,所述词序概率是指所述文本分词进行全排列的每种排列方式的条件概率,所述条件概率的发生条件是按照所述排列方式排列在当前分词之前的所有分词全部发生;
确定子单元4412,用于确定所述词序概率的最大值对应的所述文本分词的排列顺序为分词语义顺序;
生成子单元4413,用于将相邻分词向量合并,生成所述拼接向量的因式分解向量,所述相邻分词向量是指与所述分词语义顺序中顺序邻接的文本分词对应的所述拼接向量中的向量元素。
进一步地,所述采样单元443,包括:
统计子单元4431,用于统计所述因式分解向量的向量平均矢量和向量标准差矢量;
采样子单元4432,用于根据所述向量平均矢量和所述向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本。
进一步地,所述统计子单元4431,用于:
依据第一预置概率分布公式,统计所述因式分解向量的第一概率分布函数,并依据第二预置概率分布公式,统计所述因式分解向量的第二概率分布函数,所述第一概率分布函数的因变量包括第一平均矢量和第一标准差矢量,所述第二概率分布函数的因变量包括第二平均矢量和第二标准差矢量;
计算所述第一概率分布函数和所述第二概率分布函数的KL散度;
如果KL散度等于0,则确定所述因式分解向量服从所述第一概率分布函数或所述第二概率分布函数,确定所述向量平均矢量是所述第一平均矢量或所述第二平均矢量,确定所述向量标准差矢量是第一标准差矢量或第二标准差矢量;
如果KL散度不等于0,则根据所述因式分解向量,以获取所述KL散度的最小值为目标,计算所述向量平均矢量和所述向量标准差矢量。
进一步地,所述生成单元444,用于:
根据预置维度调节规则,生成所述初始文本的表征向量集合,其中,所述预置维度调节规则的特征描述为:
z
h=αe
h+(1-α)q
h;
其中,z
h为所述表征向量集合,α为学习参数,e
h为注意力特征,q
h为所述随机采样结果。
本申请提供了一种相似文本的生成装置,首先获取初始文本的文本分词,然后根据预置词向量算法查找文本分词的文本词向量,再将文本词向量和文本词向量的相对位置向量进行拼接生成拼接向量,再将拼接向量输入预置编码器,生成初始文本的表征词向量集合,最后将表征词向量集合输入预置解码器解算初始文本的相似文本。与现有技术相比,本申请实施例通过以相对位置向量与文本词向量的拼接向量为输入,通过预置编码器生成初始文本的表征词向量结合,其中,相对位置向量使得每个文本分词都具有“上下文”关系,以使得同一个长句中不同分段的词语蕴含的位置信息相同,提高上下文的关联性,进而提高相似文本与初始文本的语义相似度。
根据本申请一个实施例提供了一种计算机存储介质,所述计算机存储介质存储有至少一可执行指令,该计算机可执行指令可执行上述任意方法实施例中的相似文本的生成方法。
可选的,本申请涉及的存储介质(计算机存储介质)可以是计算机可读存储介质,该存储介质可以是非易失性的,也可以是易失性的。
图5示出了根据本申请一个实施例提供的一种计算机设备的结构示意图,本申请具体实施例并不对计算机设备的具体实现做限定。
如图5所示,该计算机设备可以包括:处理器(processor)502、通信接口(Communications Interface)504、存储器(memory)506、以及通信总线508。
其中:处理器502、通信接口504、以及存储器506通过通信总线508完成相互间的通信。
通信接口504,用于与其它设备比如客户端或其它服务器等的网元通信。
处理器502,用于执行至少一种可执行指令如程序510,具体可以执行上述相似文本的生成方法实施例中的相关步骤。
具体地,程序510可以包括程序代码,该程序代码包括计算机操作指令。
处理器502可能是中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。计算机设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。
存储器506,用于存放至少一种可执行指令如程序510。存储器506可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
程序510具体可以用于使得处理器502执行以下操作:
获取初始文本的文本分词;
根据预置词向量算法,查找所述文本分词的文本词向量;
将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;
将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;
将所述表征词向量集合输入预置解码器,解算所述初始文本的相似文本。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步 骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。
Claims (20)
- 一种相似文本的生成方法,其中,包括:获取初始文本的文本分词;根据预置词向量算法,查找所述文本分词的文本词向量;将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;将所述表征词向量集合输入预置解码器,解算所述初始文本的相似文本。
- 如权利要求1所述的方法,其中,所述获取初始文本的文本分词,包括:将所述初始文本输入至预置的结巴分词模型中;获取所述结巴分词模型输出的文本分词。
- 如权利要求1所述的方法,其中,所述将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合,包括:根据所述拼接向量的词序概率,计算所述拼接向量的因式分解向量,其中,所述拼接向量存储在区块链中;根据预置的自注意力机制,提取所述因式分解向量的注意力特征;基于所述因式分解向量的向量平均矢量和向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本;根据所述采样样本和所述注意力特征,生成所述初始文本的表征词向量集合。
- 如权利要求3所述的方法,其中,所述根据所述拼接向量的词序概率,计算所述拼接向量的因式分解向量,包括:根据所述拼接向量计算所述初始文本的词序概率,其中,所述词序概率是指所述文本分词进行全排列的每种排列方式的条件概率,所述条件概率的发生条件是按照所述排列方式排列在当前分词之前的所有分词全部发生;确定所述词序概率的最大值对应的所述文本分词的排列顺序为分词语义顺序;将相邻分词向量合并,生成所述拼接向量的因式分解向量,所述相邻分词向量是指与所述分词语义顺序中顺序邻接的文本分词对应的所述拼接向量中的向量元素。
- 如权利要求3所述的方法,其中,所述基于所述因式分解向量的向量平均矢量和向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本,包括:统计所述因式分解向量的向量平均矢量和向量标准差矢量;根据所述向量平均矢量和所述向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本。
- 如权利要求5所述的方法,其中,所述统计所述因式分解向量的向量平均矢量和向量标准差矢量,包括:依据第一预置概率分布公式,统计所述因式分解向量的第一概率分布函数,并依据第二预置概率分布公式,统计所述因式分解向量的第二概率分布函数,所述第一概率分布函数的因变量包括第一平均矢量和第一标准差矢量,所述第二概率分布函数的因变量包括第二平均矢量和第二标准差矢量;计算所述第一概率分布函数和所述第二概率分布函数的KL散度;如果KL散度等于0,则确定所述因式分解向量服从所述第一概率分布函数或所述第二概率分布函数,确定所述向量平均矢量是所述第一平均矢量或所述第二平均矢量,确定所述向量标准差矢量是第一标准差矢量或第二标准差矢量;如果KL散度不等于0,则根据所述因式分解向量,以获取所述KL散度的最小值为目标,计算所述向量平均矢量和所述向量标准差矢量。
- 如权利要求3所述的方法,其中,所述根据所述采样样本和注意力特征,生成所述 初始文本的表征词向量集合,包括:根据预置维度调节规则,生成所述初始文本的表征向量集合,其中,所述预置维度调节规则的特征描述为:z h=αe h+(1-α)q h;其中,z h为所述表征向量集合,α为学习参数,e h为注意力特征,q h为所述随机采样结果。
- 一种相似文本的生成装置,其中,包括:获取模块,用于获取初始文本的文本分词;查找模块,用于根据预置词向量算法,查找所述文本分词的文本词向量;第一生成模块,用于将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;第二生成模块,用于将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;解算模块,用于将所述表征词向量集合输入预置解码器,解算所述初始文本的相似文本。
- 一种计算机存储介质,其中,所述计算机存储介质中存储有至少一种可执行指令,所述可执行指令使处理器执行以下步骤:获取初始文本的文本分词;根据预置词向量算法,查找所述文本分词的文本词向量;将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;将所述表征词向量集合输入预置解码器,解算所述初始文本的相似文本。
- 如权利要求9所述的计算机存储介质,其中,所述将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合时,具体执行:根据所述拼接向量的词序概率,计算所述拼接向量的因式分解向量,其中,所述拼接向量存储在区块链中;根据预置的自注意力机制,提取所述因式分解向量的注意力特征;基于所述因式分解向量的向量平均矢量和向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本;根据所述采样样本和所述注意力特征,生成所述初始文本的表征词向量集合。
- 如权利要求10所述的计算机存储介质,其中,所述根据所述拼接向量的词序概率,计算所述拼接向量的因式分解向量时,具体执行:根据所述拼接向量计算所述初始文本的词序概率,其中,所述词序概率是指所述文本分词进行全排列的每种排列方式的条件概率,所述条件概率的发生条件是按照所述排列方式排列在当前分词之前的所有分词全部发生;确定所述词序概率的最大值对应的所述文本分词的排列顺序为分词语义顺序;将相邻分词向量合并,生成所述拼接向量的因式分解向量,所述相邻分词向量是指与所述分词语义顺序中顺序邻接的文本分词对应的所述拼接向量中的向量元素。
- 如权利要求10所述的计算机存储介质,其中,所述基于所述因式分解向量的向量平均矢量和向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本时,具体执行:统计所述因式分解向量的向量平均矢量和向量标准差矢量;根据所述向量平均矢量和所述向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本。
- 如权利要求12所述的计算机存储介质,其中,所述统计所述因式分解向量的向量平均矢量和向量标准差矢量时,具体执行:依据第一预置概率分布公式,统计所述因式分解向量的第一概率分布函数,并依据第二预置概率分布公式,统计所述因式分解向量的第二概率分布函数,所述第一概率分布函数的因变量包括第一平均矢量和第一标准差矢量,所述第二概率分布函数的因变量包括第二平均矢量和第二标准差矢量;计算所述第一概率分布函数和所述第二概率分布函数的KL散度;如果KL散度等于0,则确定所述因式分解向量服从所述第一概率分布函数或所述第二概率分布函数,确定所述向量平均矢量是所述第一平均矢量或所述第二平均矢量,确定所述向量标准差矢量是第一标准差矢量或第二标准差矢量;如果KL散度不等于0,则根据所述因式分解向量,以获取所述KL散度的最小值为目标,计算所述向量平均矢量和所述向量标准差矢量。
- 如权利要求10所述的计算机存储介质,其中,所述根据所述采样样本和注意力特征,生成所述初始文本的表征词向量集合时,具体执行:根据预置维度调节规则,生成所述初始文本的表征向量集合,其中,所述预置维度调节规则的特征描述为:z h=αe h+(1-α)q h;其中,z h为所述表征向量集合,α为学习参数,e h为注意力特征,q h为所述随机采样结果。
- 一种计算机设备,其中,包括:处理器、存储器、通信接口和通信总线、所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一种可执行指令,所述可执行指令使所述处理器执行以下步骤:获取初始文本的文本分词;根据预置词向量算法,查找所述文本分词的文本词向量;将所述文本词向量和所述文本词向量的相对位置向量进行拼接,生成拼接向量;将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合;将所述表征词向量集合输入预置解码器,解算所述初始文本的相似文本。
- 如权利要求15所述的计算机设备,其中,所述将所述拼接向量输入预置编码器,生成所述初始文本的表征词向量集合时,具体执行:根据所述拼接向量的词序概率,计算所述拼接向量的因式分解向量,其中,所述拼接向量存储在区块链中;根据预置的自注意力机制,提取所述因式分解向量的注意力特征;基于所述因式分解向量的向量平均矢量和向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本;根据所述采样样本和所述注意力特征,生成所述初始文本的表征词向量集合。
- 如权利要求16所述的计算机设备,其中,所述根据所述拼接向量的词序概率,计算所述拼接向量的因式分解向量时,具体执行:根据所述拼接向量计算所述初始文本的词序概率,其中,所述词序概率是指所述文本分词进行全排列的每种排列方式的条件概率,所述条件概率的发生条件是按照所述排列方式排列在当前分词之前的所有分词全部发生;确定所述词序概率的最大值对应的所述文本分词的排列顺序为分词语义顺序;将相邻分词向量合并,生成所述拼接向量的因式分解向量,所述相邻分词向量是指与 所述分词语义顺序中顺序邻接的文本分词对应的所述拼接向量中的向量元素。
- 如权利要求16所述的计算机设备,其中,所述基于所述因式分解向量的向量平均矢量和向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本时,具体执行:统计所述因式分解向量的向量平均矢量和向量标准差矢量;根据所述向量平均矢量和所述向量标准差矢量,对所述因式分解向量进行随机采样生成采样样本。
- 如权利要求18所述的计算机设备,其中,所述统计所述因式分解向量的向量平均矢量和向量标准差矢量时,具体执行:依据第一预置概率分布公式,统计所述因式分解向量的第一概率分布函数,并依据第二预置概率分布公式,统计所述因式分解向量的第二概率分布函数,所述第一概率分布函数的因变量包括第一平均矢量和第一标准差矢量,所述第二概率分布函数的因变量包括第二平均矢量和第二标准差矢量;计算所述第一概率分布函数和所述第二概率分布函数的KL散度;如果KL散度等于0,则确定所述因式分解向量服从所述第一概率分布函数或所述第二概率分布函数,确定所述向量平均矢量是所述第一平均矢量或所述第二平均矢量,确定所述向量标准差矢量是第一标准差矢量或第二标准差矢量;如果KL散度不等于0,则根据所述因式分解向量,以获取所述KL散度的最小值为目标,计算所述向量平均矢量和所述向量标准差矢量。
- 如权利要求16所述的计算机设备,其中,所述根据所述采样样本和注意力特征,生成所述初始文本的表征词向量集合时,具体执行:根据预置维度调节规则,生成所述初始文本的表征向量集合,其中,所述预置维度调节规则的特征描述为:z h=αe h+(1-α)q h;其中,z h为所述表征向量集合,α为学习参数,e h为注意力特征,q h为所述随机采样结果。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010341544.XA CN111680494B (zh) | 2020-04-27 | 2020-04-27 | 相似文本的生成方法及装置 |
CN202010341544.X | 2020-04-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021218015A1 true WO2021218015A1 (zh) | 2021-11-04 |
Family
ID=72452258
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/117946 WO2021218015A1 (zh) | 2020-04-27 | 2020-09-25 | 相似文本的生成方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111680494B (zh) |
WO (1) | WO2021218015A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114338129A (zh) * | 2021-12-24 | 2022-04-12 | 中汽创智科技有限公司 | 一种报文异常检测方法、装置、设备及介质 |
CN114742029A (zh) * | 2022-04-20 | 2022-07-12 | 中国传媒大学 | 一种汉语文本比对方法、存储介质及设备 |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111680494B (zh) * | 2020-04-27 | 2023-05-12 | 平安科技(深圳)有限公司 | 相似文本的生成方法及装置 |
CN112395385B (zh) * | 2020-11-17 | 2023-07-25 | 中国平安人寿保险股份有限公司 | 基于人工智能的文本生成方法、装置、计算机设备及介质 |
CN112580352B (zh) * | 2021-03-01 | 2021-06-04 | 腾讯科技(深圳)有限公司 | 关键词提取方法、装置和设备及计算机存储介质 |
CN113822034B (zh) * | 2021-06-07 | 2024-04-19 | 腾讯科技(深圳)有限公司 | 一种复述文本的方法、装置、计算机设备及存储介质 |
CN113779987A (zh) * | 2021-08-23 | 2021-12-10 | 科大国创云网科技有限公司 | 一种基于自注意力增强语义的事件共指消岐方法及系统 |
CN114357974B (zh) * | 2021-12-28 | 2022-09-23 | 北京海泰方圆科技股份有限公司 | 相似样本语料的生成方法、装置、电子设备及存储介质 |
CN114936548B (zh) * | 2022-03-22 | 2024-09-06 | 北京探境科技有限公司 | 一种相似命令文本的生成方法、装置、设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180300295A1 (en) * | 2017-04-14 | 2018-10-18 | Digital Genius Limited | Automated tagging of text |
CN110135507A (zh) * | 2019-05-21 | 2019-08-16 | 西南石油大学 | 一种标签分布预测方法及装置 |
CN110362684A (zh) * | 2019-06-27 | 2019-10-22 | 腾讯科技(深圳)有限公司 | 一种文本分类方法、装置及计算机设备 |
CN110619127A (zh) * | 2019-08-29 | 2019-12-27 | 内蒙古工业大学 | 一种基于神经网络图灵机的蒙汉机器翻译方法 |
CN111680494A (zh) * | 2020-04-27 | 2020-09-18 | 平安科技(深圳)有限公司 | 相似文本的生成方法及装置 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106802888B (zh) * | 2017-01-12 | 2020-01-24 | 北京航空航天大学 | 词向量训练方法和装置 |
JP6976155B2 (ja) * | 2017-12-18 | 2021-12-08 | ヤフー株式会社 | 類似テキスト抽出装置、自動応答システム、類似テキスト抽出方法、およびプログラム |
KR20200015418A (ko) * | 2018-08-02 | 2020-02-12 | 네오사피엔스 주식회사 | 순차적 운율 특징을 기초로 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체 |
CN109145315B (zh) * | 2018-09-05 | 2022-03-18 | 腾讯科技(深圳)有限公司 | 文本翻译方法、装置、存储介质和计算机设备 |
CN110147535A (zh) * | 2019-04-18 | 2019-08-20 | 平安科技(深圳)有限公司 | 相似文本生成方法、装置、设备及存储介质 |
CN110110045B (zh) * | 2019-04-26 | 2021-08-31 | 腾讯科技(深圳)有限公司 | 一种检索相似文本的方法、装置以及存储介质 |
CN110209801B (zh) * | 2019-05-15 | 2021-05-14 | 华南理工大学 | 一种基于自注意力网络的文本摘要自动生成方法 |
CN110399454B (zh) * | 2019-06-04 | 2022-02-25 | 深思考人工智能机器人科技(北京)有限公司 | 一种基于变压器模型和多参照系的文本编码表示方法 |
CN110619034A (zh) * | 2019-06-27 | 2019-12-27 | 中山大学 | 基于Transformer模型的文本关键词生成方法 |
-
2020
- 2020-04-27 CN CN202010341544.XA patent/CN111680494B/zh active Active
- 2020-09-25 WO PCT/CN2020/117946 patent/WO2021218015A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180300295A1 (en) * | 2017-04-14 | 2018-10-18 | Digital Genius Limited | Automated tagging of text |
CN110135507A (zh) * | 2019-05-21 | 2019-08-16 | 西南石油大学 | 一种标签分布预测方法及装置 |
CN110362684A (zh) * | 2019-06-27 | 2019-10-22 | 腾讯科技(深圳)有限公司 | 一种文本分类方法、装置及计算机设备 |
CN110619127A (zh) * | 2019-08-29 | 2019-12-27 | 内蒙古工业大学 | 一种基于神经网络图灵机的蒙汉机器翻译方法 |
CN111680494A (zh) * | 2020-04-27 | 2020-09-18 | 平安科技(深圳)有限公司 | 相似文本的生成方法及装置 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114338129A (zh) * | 2021-12-24 | 2022-04-12 | 中汽创智科技有限公司 | 一种报文异常检测方法、装置、设备及介质 |
CN114338129B (zh) * | 2021-12-24 | 2023-10-31 | 中汽创智科技有限公司 | 一种报文异常检测方法、装置、设备及介质 |
CN114742029A (zh) * | 2022-04-20 | 2022-07-12 | 中国传媒大学 | 一种汉语文本比对方法、存储介质及设备 |
CN114742029B (zh) * | 2022-04-20 | 2022-12-16 | 中国传媒大学 | 一种汉语文本比对方法、存储介质及设备 |
Also Published As
Publication number | Publication date |
---|---|
CN111680494A (zh) | 2020-09-18 |
CN111680494B (zh) | 2023-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021218015A1 (zh) | 相似文本的生成方法及装置 | |
Duan et al. | Question generation for question answering | |
Peng et al. | Incrementally learning the hierarchical softmax function for neural language models | |
WO2023065544A1 (zh) | 意图分类方法、装置、电子设备及计算机可读存储介质 | |
CN114169330B (zh) | 融合时序卷积与Transformer编码器的中文命名实体识别方法 | |
CN111814466A (zh) | 基于机器阅读理解的信息抽取方法、及其相关设备 | |
CN114218389A (zh) | 一种基于图神经网络的化工制备领域长文本分类方法 | |
CN112100332A (zh) | 词嵌入表示学习方法及装置、文本召回方法及装置 | |
CN112906397B (zh) | 一种短文本实体消歧方法 | |
CN112580346B (zh) | 事件抽取方法、装置、计算机设备和存储介质 | |
WO2021146694A1 (en) | Systems and methods for mapping a term to a vector representation in a semantic space | |
CN111368542A (zh) | 一种基于递归神经网络的文本语言关联抽取方法和系统 | |
CN115759119B (zh) | 一种金融文本情感分析方法、系统、介质和设备 | |
CN115437626A (zh) | 一种基于自然语言的ocl语句自动生成方法和装置 | |
CN116304748A (zh) | 一种文本相似度计算方法、系统、设备及介质 | |
Siebers et al. | A survey of text representation methods and their genealogy | |
Bouraoui et al. | A comprehensive review of deep learning for natural language processing | |
Shi et al. | Improving code search with multi-modal momentum contrastive learning | |
CN113536741B (zh) | 中文自然语言转数据库语言的方法及装置 | |
Tian et al. | Chinese short text multi-classification based on word and part-of-speech tagging embedding | |
Gao et al. | Citation entity recognition method using multi‐feature semantic fusion based on deep learning | |
Liu | Research on literary translation based on the improved optimization model | |
Li et al. | A Chinese NER Method Based on Chinese Characters' Multiple Information | |
CN114722818A (zh) | 一种基于对抗迁移学习的命名实体识别模型 | |
CN113822018A (zh) | 实体关系联合抽取方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20933781 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20933781 Country of ref document: EP Kind code of ref document: A1 |