WO2023071115A1 - Sentence vector generation method and apparatus, device, and storage medium - Google Patents

Sentence vector generation method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2023071115A1
WO2023071115A1 PCT/CN2022/090157 CN2022090157W WO2023071115A1 WO 2023071115 A1 WO2023071115 A1 WO 2023071115A1 CN 2022090157 W CN2022090157 W CN 2022090157W WO 2023071115 A1 WO2023071115 A1 WO 2023071115A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence vector
target
model
sentence
training
Prior art date
Application number
PCT/CN2022/090157
Other languages
French (fr)
Chinese (zh)
Inventor
陈浩
谯轶轩
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023071115A1 publication Critical patent/WO2023071115A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of natural language processing in artificial intelligence, in particular to a sentence vector generation method, device, equipment and storage medium.
  • the sentence vector is to encode the text data information in the sentence into a fixed dense vector space.
  • the sentence vector plays an important role in various NLP tasks, such as sentence vector Applied to NLP tasks such as classification, clustering, and sentence similarity measurement.
  • Methods for constructing sentence vectors include unsupervised learning methods or contrastive learning-based methods.
  • the inventor realized that the training process of the unsupervised learning method requires a large amount of corpus, and the model is difficult to converge, which leads to the gradual elimination of this method.
  • the model is mainly trained by constructing positive and negative samples.
  • the difficulty of this method is that the text data belongs to discontinuous discrete data, so the generation of positive samples cannot be simply constructed by flipping and cutting like image data. Therefore, only high-quality positive samples can train a better contrastive learning model, which makes it difficult for this method to be widely used.
  • the main purpose of this application is to provide a sentence vector generation method, device, equipment and storage medium, aiming at solving the problem that the prior art adopts the method of unsupervised learning or the method based on contrastive learning to construct sentence vectors, and the method of unsupervised learning requires a large amount of The corpus and model are difficult to converge, and the method based on contrastive learning requires high-quality positive samples.
  • the application proposes a method for generating sentence vectors, the method comprising:
  • the target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
  • the present application also proposes a sentence vector generating device, the device comprising:
  • a data acquisition module configured to acquire target text data
  • the sentence vector generation module is used to input the target text data into the sentence vector generation model to generate the sentence vector, and obtain the target sentence vector corresponding to the target text data, wherein the sentence vector generation model adopts a plurality of training sample pairs
  • the model obtained by neural network training, each of the training samples includes: a corpus fragment and a definition of the corpus fragment.
  • the present application also proposes a computer device, including:
  • processors one or more processors
  • one or more computer programs wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs are configured to perform A sentence vector generation method:
  • sentence vector generation method comprises:
  • the target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
  • the present application also proposes a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, a method for generating a sentence vector is implemented, wherein the sentence vector
  • the generation method includes the following steps:
  • the target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
  • the sentence vector generation method, device, equipment and storage medium of the present application the method generates the sentence vector by inputting the target text data into the sentence vector generation model, and obtains the target sentence vector corresponding to the target text data, wherein the
  • the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, each of which includes: a corpus fragment and a corpus fragment definition, so as to train the neural network based on the corpus fragment and the corpus fragment definition to obtain a sentence
  • the vector generation model reduces the difficulty of training and avoids the use of unsupervised learning methods or methods based on contrastive learning to construct sentence vectors.
  • Fig. 1 is the schematic flow chart of the sentence vector generation method of an embodiment of the present application
  • Fig. 2 is a structural schematic block diagram of a sentence vector generation device according to an embodiment of the present application
  • FIG. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • FIG. 1 is a schematic flow diagram of a method for generating sentence vectors according to an embodiment of the present application. The method includes the following steps:
  • S2 Input the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is obtained by training a neural network using a plurality of training samples A model, each of the training samples includes: a corpus fragment and a definition of the corpus fragment.
  • the target sentence vector corresponding to the target text data is obtained by inputting the target text data into the sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model uses a plurality of training samples to train the neural network
  • the model obtained, each of the training samples includes: a corpus segment and a corpus segment definition, thereby realizing training the neural network based on the corpus segment and the corpus segment definition to obtain a sentence vector generation model, reducing the training difficulty and avoiding the use of unsupervised learning method or construct sentence vector based on contrastive learning method.
  • the target text data input by the user can be obtained, the target text data can also be obtained from the database, and the target text data can also be obtained from a third-party application system.
  • the target text data that is, the text data that needs to generate sentence vectors.
  • the target text data may be one or more of title, abstract, and keywords in a book.
  • the target text data is input into a sentence vector generation model to generate a sentence vector, and the generated sentence vector is used as the target sentence vector corresponding to the target text data.
  • the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples
  • the neural network includes but is not limited to: Bert (Bidirectional Encoder Representations from Transformers) model, XLNET (universal autoregressive pre-training method) model.
  • Each of the training samples includes: corpus fragments and corpus fragment definitions.
  • a corpus fragment includes one or more words.
  • the definition of the corpus fragment is the interpretation of the corpus fragment. That is, corpus fragments and corpus fragment definitions form text pairs.
  • the neural network is trained by using the text pair to shorten the distance between the corpus fragment and the definition of the corpus fragment, thereby improving the accuracy of the sentence vector.
  • using a plurality of training samples to train the neural network to obtain the sentence vector generation model includes: generating word vectors according to the corpus fragments in the training samples to obtain corpus fragment word vectors; using the neural network to obtain The initial model of the definition of the corpus fragment in the training sample is carried out sentence vector generation, obtains the corpus fragment definition sentence vector; According to the corpus fragment word vector and the corpus fragment definition sentence vector, the initial model is trained, and the training ends The initial model of is used as the sentence vector generation model.
  • the distance between the word vector corresponding to the corpus fragment and the sentence vector corresponding to the corpus fragment definition is shortened, the accuracy of the sentence vector generation model to generate the sentence vector is improved, and the generation-based NLP training method is not required, which reduces the training time. Difficulty, takes up less resources, and the model is easy to converge.
  • a learning rate of 0.0005 is used.
  • the step of inputting the target text data into the sentence vector generation model to generate the sentence vector, and obtaining the target sentence vector corresponding to the target text data it also includes:
  • S213 Generate a word vector according to each of the corpus fragments in the training sample set to obtain a first word vector
  • S214 Using an initial model, perform sentence vector generation on each of the corpus fragment definitions in the training sample set to obtain a first sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;
  • S215 Calculate the loss value according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value, update the parameters of the initial model according to the first loss value, and update the updated The initial model is used to calculate the first sentence vector next time;
  • S217 Use the initial model that achieves the first training objective as the sentence vector generation model.
  • the first word vector is determined according to the corpus fragment
  • the first sentence vector is determined according to the definition of the corpus fragment
  • the loss value is calculated according to each of the first word vectors and each of the first sentence vectors, thereby shortening the corpus
  • the distance between the word vector corresponding to the segment and the corresponding sentence vector defined by the corpus segment improves the accuracy of the sentence vector generation model for generating sentence vectors, and does not require a generative-based NLP training method, which reduces the difficulty of training and takes up less resources , the model is easy to converge; and the training sample set is used for batch training each time, so as to avoid the abnormal training sample transition from affecting the parameters of the initial model, which is conducive to improving the accuracy of training.
  • multiple training samples input by the user may be obtained, multiple training samples may be obtained from a database, or multiple training samples may be obtained from a third-party application system.
  • the preset number of batches is set to 64. It can be understood that the preset batch quantity can also be set to other values, which is not limited here.
  • a word vector is generated according to each of the corpus fragments in the training sample set, and each generated word vector is used as a first word vector (that is, a corpus fragment word vector). That is to say, each corpus segment corresponds to a first word vector.
  • the initial model is used to generate a sentence vector for each of the corpus fragment definitions in the training sample set, and the generated sentence vector is used as the first sentence vector (that is, the corpus fragment defines a sentence vector). That is to say, each corpus fragment definition corresponds to a first sentence vector.
  • the initial model is a model obtained based on the Bert model or the XLNET model. It can be understood that the initial model may also use other models, which are not limited here.
  • the calculation of the loss value is performed according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value, so as to realize the calculation of a first loss value for each batch.
  • the quantity of the first word vector in each of the first word vectors is the same as the value of the preset batch quantity; the quantity of the first sentence vector in each of the first sentence vectors is the same as the preset batch quantity The value of the number of times is the same.
  • the updated initial model is used for the next calculation of the first sentence vector, thereby implementing an iterative update of the initial model.
  • step S212 to step S216 are repeatedly executed until the first training target is reached.
  • the first training target includes: the first loss value reaches a first convergence condition or the number of iterations of the initial model reaches a second convergence condition.
  • the first convergence condition means that the size of the first loss value calculated twice adjacently satisfies the Lipschitz condition (Lipschitz continuous condition).
  • the number of iterations refers to the number of calculations of the first loss value, that is, the number of iterations is increased by 1 after being calculated once.
  • the second convergence condition is a specific numerical value.
  • the initial model that achieves the first training objective is a model that meets expected requirements, so the initial model that achieves the first training objective is used as the sentence vector generation model.
  • the above step of obtaining a plurality of training samples includes:
  • dictionary data includes: text segment and text segment definition
  • the text segment includes: any one of single Chinese characters, words, and idioms
  • the text segment definition is an explanation for the text segment
  • S2113 Generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the target The text segment corresponding to the text segment is defined as the corpus segment definition of the training sample;
  • the training samples are determined according to the dictionary data, which has the advantages of high accuracy, high stability, and easy acquisition.
  • This provides a basis for further improving the accuracy of the sentence vector generation model to generate sentence vectors.
  • the dictionary data input by the user can be obtained, the dictionary data can also be obtained from the database, and the dictionary data can also be obtained from a third-party application system.
  • the dictionary data is data obtained from the Xinhua Dictionary. It can be understood that the dictionary data may also be data obtained from other dictionaries, such as English dictionaries and other Chinese dictionaries, which are not limited here.
  • any text segment is obtained from the dictionary data, and the obtained text segment is used as a target text segment.
  • step S2112 to step S2114 are repeatedly executed until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired.
  • a sample generation end signal is acquired.
  • the sample generation end signal is generated by the program file implementing the present application according to preset conditions.
  • the preset condition is a preset sample size, which is not specifically limited in this example.
  • the above-mentioned step of generating word vectors according to each of the corpus fragments in the training sample set to obtain the first word vector includes:
  • S2131 Perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a phrase set of corpus fragments;
  • S2133 Perform average calculation on each set of phrase word vectors to obtain the first word vector.
  • This embodiment implements word segmentation for corpus fragments, word vector generation for each phrase, and calculation of the average value of each word vector to obtain the word vectors of the corpus fragments, thereby narrowing the correspondence between the word vectors corresponding to the corpus fragments and the corpus fragment definitions.
  • the distance between the sentence vectors of ⁇ provides the basis.
  • word segmentation is performed on each of the corpus fragments in the training sample set, and all phrases obtained by word segmentation are used as a corpus fragment phrase set. That is to say, there is a one-to-one correspondence between the phrase set of corpus fragments and the corpus fragments in the training sample set.
  • a preset word vector model is used to generate word vectors for each phrase in the corpus fragment phrase set, and each generated word vector is used as a phrase word vector set. That is to say, the phrase word vector set is in one-to-one correspondence with the corpus fragments in the training sample set.
  • the preset word vector model uses the pre-trained Chinese Glove (Global Vectors for Word Representation) word vector model.
  • the above-mentioned initial model is used to generate sentence vectors for each of the corpus fragment definitions in the training sample set, and the step of obtaining the first sentence vector includes:
  • S2142 Input each defined phrase set into the initial model to generate a sentence vector to obtain the first sentence vector.
  • This embodiment implements word segmentation processing and sentence vector generation for the definitions of the corpus fragments, thereby providing a basis for shortening the distance between the word vectors corresponding to the corpus fragments and the sentence vectors corresponding to the corpus fragment definitions.
  • word segmentation is performed on each definition of the corpus segment in the training sample set, and all phrases obtained by word segmentation are used as a set of defined phrases. That is to say, there is a one-to-one correspondence between the defined phrase set and the definitions of the corpus fragments in the training sample set.
  • the step of calculating the loss value according to each of the first word vectors and each of the first sentence vectors to obtain the first loss value includes:
  • S2152 Input the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to calculate a loss value to obtain a pending loss value, wherein the preset loss function adopts a relative entropy loss function;
  • S2154 Perform average calculation on each of the loss values to be processed to obtain the first loss value.
  • the relative entropy loss function is used as the preset loss function, which is beneficial to shorten the distance between the vectors.
  • the first loss value is obtained by calculating the average value of each of the loss values to be processed, thereby realizing the Each training only updates the parameters of the initial model once.
  • any one of the first word vectors is acquired, and the acquired first word vector is used as a target word vector.
  • For S2152 input the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to calculate a loss value, and use the calculated loss value as a loss value to be processed.
  • step S2151 to step S2153 are repeatedly executed until the acquisition of the first word vector is completed.
  • the step of inputting the target text data into the sentence vector generation model to generate the sentence vector, and obtaining the target sentence vector corresponding to the target text data it also includes:
  • S223 Generate a word vector according to the corpus fragment in the target training sample to obtain a second word vector
  • S224 Using an initial model, generate a sentence vector for the definition of the corpus segment in the target training sample to obtain a second sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;
  • S225 Calculate the loss value according to the second word vector and the second sentence vector to obtain a second loss value, update the parameters of the initial model according to the second loss value, and use the updated initial model For calculating the second sentence vector next time;
  • S226 Repeat the step of acquiring one training sample as a target training sample until a second training target is reached;
  • the second word vector is determined according to the corpus fragment
  • the second sentence vector is determined according to the definition of the corpus fragment
  • the loss value is calculated according to the second word vector and the second sentence vector, thereby shortening the correspondence between the corpus fragment.
  • the distance between the word vectors and corpus fragments defines the corresponding sentence vectors, which improves the accuracy of the sentence vector generation model to generate sentence vectors, and does not require a generative NLP training method, which reduces the difficulty of training and takes up less resources.
  • the model Easy to converge.
  • multiple training samples input by the user may be obtained, multiple training samples may be obtained from a database, or multiple training samples may be obtained from a third-party application system.
  • one training sample is acquired from each of the training samples, and the acquired training sample is used as a target training sample.
  • word segmentation is performed according to the corpus fragments in the target training sample, and word vector generation is performed according to each phrase obtained by word segmentation, and the average value of each generated word vector is calculated, and the calculated data is used as the second Word vectors (that is, corpus fragment word vectors).
  • the definition of the corpus segment in the target training sample is segmented, and each phrase obtained by the segment is input into the initial model for sentence vector generation, and the generated sentence vector is used as the second sentence vector (that is, the corpus segment defines the sentence vector).
  • a loss value is calculated according to the second word vector and the second sentence vector, and the calculated loss value is used as a second loss value.
  • the second training objective includes: the second loss value reaches a third convergence condition or the number of iterations of the initial model reaches a fourth convergence condition.
  • the third convergence condition means that the size of the second loss value calculated twice adjacently satisfies the Lipschitz condition (Lipschitz continuous condition).
  • the number of iterations refers to the number of calculations of the second loss value, that is, the number of iterations is increased by 1 after being calculated once.
  • the fourth convergence condition is a specific numerical value.
  • the initial model that achieves the second training objective is a model that meets expected requirements, so the initial model that achieves the second training objective is used as the sentence vector generation model.
  • the present application also proposes a kind of sentence vector generation device, and described device comprises:
  • Data acquisition module 100 for acquiring target text data
  • the sentence vector generation module 200 is used to input the target text data into the sentence vector generation model to generate the sentence vector, and obtain the target sentence vector corresponding to the target text data, wherein the sentence vector generation model adopts a plurality of training samples
  • each training sample includes: a corpus fragment and a definition of the corpus fragment.
  • the target sentence vector corresponding to the target text data is obtained by inputting the target text data into the sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model uses a plurality of training samples to train the neural network
  • the model obtained, each of the training samples includes: a corpus segment and a corpus segment definition, thereby realizing training the neural network based on the corpus segment and the corpus segment definition to obtain a sentence vector generation model, reducing the training difficulty and avoiding the use of unsupervised learning method or construct sentence vector based on contrastive learning method.
  • the above device further includes: a first model training module
  • the first model training module is configured to obtain a plurality of training samples, obtain a preset batch of training samples as a training sample set, and perform word vectors according to each of the corpus fragments in the training sample set Generate, obtain the first word vector, adopt initial model, carry out sentence vector generation to each described corpus fragment definition in described training sample set, obtain the first sentence vector, wherein, described initial model is based on Bert model or XLNET model
  • the obtained model performs loss value calculation according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value, updates the parameters of the initial model according to the first loss value, and updates the updated
  • the initial model is used to calculate the first sentence vector next time, and the step of obtaining the training samples of the preset batch number as the training sample set is repeated until the first training target is reached, and the second sentence vector will be reached.
  • the initial model of a training target is used as the sentence vector generation model.
  • the above-mentioned first model training module includes: a training sample generation submodule;
  • Described training sample generation submodule is used to obtain dictionary data
  • described dictionary data comprises: text segment and text segment definition
  • text segment comprises: any one in single Chinese character, word, idiom
  • described text segment definition is to An explanation of the text segment, obtaining any text segment from the dictionary data as a target text segment, generating the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein , using the target text segment as the corpus segment of the training sample, using the text segment definition corresponding to the target text segment as the corpus segment definition of the training sample, and repeatedly executing the The step of obtaining any text segment in the dictionary data as the target text segment, until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is obtained.
  • the above-mentioned first model training module further includes: a first word vector determination submodule;
  • the first word vector determination submodule is used to perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a corpus fragment phrase set, and use a preset word vector model to perform word segmentation processing on each of the corpus fragment phrases Word vectors are generated for each phrase in the set to obtain a phrase word vector set, and an average value is calculated for each of the phrase word vector sets to obtain the first word vector.
  • the above-mentioned first model training module also includes: a first sentence vector determination submodule;
  • the first sentence vector determination submodule is used to perform word segmentation processing on each of the corpus fragment definitions in the training sample set to obtain a defined phrase set, and input each of the defined phrase sets into the initial model for sentence The vector is generated to obtain the first sentence vector.
  • the above-mentioned first model training module further includes: a first loss value determination submodule;
  • the first loss value determination submodule is used to obtain any one of the first word vectors as a target word vector, and input the target word vector and the first sentence vector corresponding to the target word vector into a preset loss
  • the function calculates the loss value to obtain the loss value to be processed, wherein the preset loss function adopts a relative entropy loss function, and repeats the step of obtaining any one of the first word vectors as the target word vector until the completion of the
  • the acquisition of the first word vector is to perform average calculation on each of the loss values to be processed to obtain the first loss value.
  • the above device further includes: a second model training module
  • the second model training module is used to obtain a plurality of training samples, obtain one of the training samples as a target training sample, perform word vector generation according to the corpus fragments in the target training samples, and obtain a second word Vector, using an initial model to generate a sentence vector for the definition of the corpus segment in the target training sample to obtain a second sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model, according to the The second word vector and the second sentence vector perform loss value calculation to obtain a second loss value, update the parameters of the initial model according to the second loss value, and use the updated initial model for the next calculation
  • the second sentence vector repeating the step of obtaining one of the training samples as the target training sample, until reaching the second training target, generating the initial model that reaches the second training target as the sentence vector Model.
  • an embodiment of the present application further provides a computer device, which may be a server, and its internal structure may be as shown in FIG. 3 .
  • the computer device includes a processor, memory, network interface and database connected by a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer programs and databases.
  • the memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store data such as sentence vector generation methods.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the sentence vector generation method includes: obtaining target text data; inputting the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model It is a model obtained by training a neural network by using multiple training samples, each of which includes: a corpus fragment and a definition of the corpus fragment.
  • the target sentence vector corresponding to the target text data is obtained by inputting the target text data into the sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model uses a plurality of training samples to train the neural network
  • the model obtained, each of the training samples includes: a corpus segment and a corpus segment definition, thereby realizing training the neural network based on the corpus segment and the corpus segment definition to obtain a sentence vector generation model, reducing the training difficulty and avoiding the use of unsupervised learning method or construct sentence vector based on contrastive learning method.
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • the storage medium is a volatile storage medium or a non-volatile storage medium.
  • a The sentence vector generation method comprises the steps of: obtaining target text data; inputting the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is A model obtained by training the neural network by using multiple training samples, each of which includes: a corpus fragment and a definition of the corpus fragment.
  • the target sentence vector corresponding to the target text data is obtained by inputting the target text data into the sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model adopts multiple training
  • the model obtained by sample pair neural network training, each of said training samples includes: corpus segment and corpus segment definition, thereby realizing training neural network training based on corpus segment and corpus segment definition to obtain a sentence vector generation model, reducing training difficulty, It avoids using unsupervised learning methods or contrastive learning-based methods to construct sentence vectors.
  • Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM Static RAM
  • DRAM Dynamic RAM
  • SDRAM Synchronous DRAM
  • SSRSDRAM Double Data Rate SDRAM
  • ESDRAM Enhanced SDRAM
  • SLDRAM Synchronous Link (Synchlink) DRAM
  • SLDRAM Synchronous Link (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the technical field of artificial intelligence, and discloses a sentence vector generation method and apparatus, a device, and a storage medium. The method comprises: acquiring target text data; and inputting the target text data into a sentence vector generation model to generate a sentence vector, so as to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network by using multiple training samples, and each of the training samples comprises: a corpus fragment and a corpus fragment definition. In this case, the sentence vector generation model is obtained by training the neural network on the basis of the corpus fragment and the corpus fragment definition, the training difficulty is reduced, and the sentence vector is prevented from being constructed by using an unsupervised learning method or a comparative learning-based method.

Description

句子向量生成方法、装置、设备及存储介质Sentence vector generation method, device, equipment and storage medium
本申请要求于 20211026日提交中国专利局、申请号为 202111250467.8,发明名称为“ 句子向量生成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。 This application claims the priority of the Chinese patent application with the application number 202111250467.8 submitted to the China Patent Office on October 26 , 2021 , and the invention title is " Sentence vector generation method, device, equipment and storage medium ", the entire content of which is incorporated by reference in this application.
技术领域technical field
本申请涉及到人工智能中的自然语言处理技术领域,特别是涉及到一种句子向量生成方法、装置、设备及存储介质。The present application relates to the technical field of natural language processing in artificial intelligence, in particular to a sentence vector generation method, device, equipment and storage medium.
背景技术Background technique
在当前自然语言处理(NLP)的各种应用场景中,句子向量是将句子中的文本数据信息编码到固定的稠密向量空间,句子向量在各种NLP任务中发挥了重要作用,比如,句子向量应用于分类、聚类、句子的相似性度量等NLP任务。In various application scenarios of natural language processing (NLP), the sentence vector is to encode the text data information in the sentence into a fixed dense vector space. The sentence vector plays an important role in various NLP tasks, such as sentence vector Applied to NLP tasks such as classification, clustering, and sentence similarity measurement.
构建句子向量的方法包括无监督学习的方法或基于对比学习的方法。发明人意识到无监督学习的方法的训练过程需要大量的语料,并且模型又难以收敛,导致该方法逐渐被淘汰。基于对比学习的方法,主要通过构建正负样本来训练模型,该方法的难点在于文本数据属于不连续的离散型数据,因此正样本的生成不像图像数据那样可以通过翻转、剪切简单构造,因此只有高质量的正样本才能训练出较好的对比学习模型,从而导致该方法难以得到广泛应用。Methods for constructing sentence vectors include unsupervised learning methods or contrastive learning-based methods. The inventor realized that the training process of the unsupervised learning method requires a large amount of corpus, and the model is difficult to converge, which leads to the gradual elimination of this method. Based on the method of contrastive learning, the model is mainly trained by constructing positive and negative samples. The difficulty of this method is that the text data belongs to discontinuous discrete data, so the generation of positive samples cannot be simply constructed by flipping and cutting like image data. Therefore, only high-quality positive samples can train a better contrastive learning model, which makes it difficult for this method to be widely used.
技术问题technical problem
本申请的主要目的为提供一种句子向量生成方法、装置、设备及存储介质,旨在解决现有技术采用无监督学习的方法或基于对比学习的方法构建句子向量,无监督学习的方法需要大量语料和模型难以收敛,基于对比学习的方法需要高质量的正样本的技术问题。The main purpose of this application is to provide a sentence vector generation method, device, equipment and storage medium, aiming at solving the problem that the prior art adopts the method of unsupervised learning or the method based on contrastive learning to construct sentence vectors, and the method of unsupervised learning requires a large amount of The corpus and model are difficult to converge, and the method based on contrastive learning requires high-quality positive samples.
技术解决方案technical solution
第一方面,本申请提出一种句子向量生成方法,所述方法包括:In the first aspect, the application proposes a method for generating sentence vectors, the method comprising:
获取目标文本数据;Get target text data;
将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。The target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
第二方面,本申请还提出了一种句子向量生成装置,所述装置包括:In the second aspect, the present application also proposes a sentence vector generating device, the device comprising:
数据获取模块,用于获取目标文本数据;A data acquisition module, configured to acquire target text data;
句子向量生成模块,用于将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。The sentence vector generation module is used to input the target text data into the sentence vector generation model to generate the sentence vector, and obtain the target sentence vector corresponding to the target text data, wherein the sentence vector generation model adopts a plurality of training sample pairs The model obtained by neural network training, each of the training samples includes: a corpus fragment and a definition of the corpus fragment.
第三方面,本申请还提出了一种计算机设备,包括:In a third aspect, the present application also proposes a computer device, including:
一个或多个处理器;one or more processors;
存储器;memory;
一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于执行一种句子向量生成方法:one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs are configured to perform A sentence vector generation method:
其中,所述句子向量生成方法包括:Wherein, the sentence vector generation method comprises:
获取目标文本数据;Get target text data;
将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。The target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
第四方面,本申请还提出了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现句子向量生成方法,其中,所述句子向量生成方法包括以下步骤:In the fourth aspect, the present application also proposes a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, a method for generating a sentence vector is implemented, wherein the sentence vector The generation method includes the following steps:
获取目标文本数据;Get target text data;
将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。The target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
有益效果Beneficial effect
本申请的句子向量生成方法、装置、设备及存储介质,该方法通过将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义,从而实现基于语料片段和语料片段定义对神经网络训练进行训练得到句子向量生成模型,降低了训练难度,避免了采用无监督学习的方法或基于对比学习的方法构建句子向量。The sentence vector generation method, device, equipment and storage medium of the present application, the method generates the sentence vector by inputting the target text data into the sentence vector generation model, and obtains the target sentence vector corresponding to the target text data, wherein the The sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, each of which includes: a corpus fragment and a corpus fragment definition, so as to train the neural network based on the corpus fragment and the corpus fragment definition to obtain a sentence The vector generation model reduces the difficulty of training and avoids the use of unsupervised learning methods or methods based on contrastive learning to construct sentence vectors.
附图说明Description of drawings
图1为本申请一实施例的句子向量生成方法的流程示意图;Fig. 1 is the schematic flow chart of the sentence vector generation method of an embodiment of the present application;
图2 为本申请一实施例的句子向量生成装置的结构示意框图;Fig. 2 is a structural schematic block diagram of a sentence vector generation device according to an embodiment of the present application;
图3 为本申请一实施例的计算机设备的结构示意框图。FIG. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional features and advantages of the present application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
本发明的最佳实施方式BEST MODE FOR CARRYING OUT THE INVENTION
为了解决上述问题,本申请提供了一种句子向量生成方法,涉及人工智能中的自然语言处理技术领域,具体可参考图1,图1为本申请一实施例的句子向量生成方法的流程示意图,该方法包括以下步骤:In order to solve the above problems, the present application provides a method for generating sentence vectors, which relates to the technical field of natural language processing in artificial intelligence. For details, please refer to FIG. 1 , which is a schematic flow diagram of a method for generating sentence vectors according to an embodiment of the present application. The method includes the following steps:
S1:获取目标文本数据;S1: Obtain target text data;
S2:将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。S2: Input the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is obtained by training a neural network using a plurality of training samples A model, each of the training samples includes: a corpus fragment and a definition of the corpus fragment.
本实施例通过将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义,从而实现基于语料片段和语料片段定义对神经网络训练进行训练得到句子向量生成模型,降低了训练难度,避免了采用无监督学习的方法或基于对比学习的方法构建句子向量。In this embodiment, the target sentence vector corresponding to the target text data is obtained by inputting the target text data into the sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model uses a plurality of training samples to train the neural network The model obtained, each of the training samples includes: a corpus segment and a corpus segment definition, thereby realizing training the neural network based on the corpus segment and the corpus segment definition to obtain a sentence vector generation model, reducing the training difficulty and avoiding the use of unsupervised learning method or construct sentence vector based on contrastive learning method.
对于S1,可以获取用户输入的目标文本数据,也可以从数据库中获取目标文本数据,还可以从第三方应用系统中获取目标文本数据。For S1, the target text data input by the user can be obtained, the target text data can also be obtained from the database, and the target text data can also be obtained from a third-party application system.
目标文本数据,也就是需要生成句子向量的文本数据。The target text data, that is, the text data that needs to generate sentence vectors.
当本申请生成的句子向量应用于图书推荐场景时,所述目标文本数据可以是一本图书中的标题、摘要、关键字中的一种或多种。When the sentence vector generated by this application is applied to a book recommendation scenario, the target text data may be one or more of title, abstract, and keywords in a book.
对于S2,将所述目标文本数据输入句子向量生成模型进行句子向量生成,将生成的句子向量作为所述目标文本数据对应的目标句子向量。For S2, the target text data is input into a sentence vector generation model to generate a sentence vector, and the generated sentence vector is used as the target sentence vector corresponding to the target text data.
其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,所述神经网络包括但不限于:Bert(Bidirectional Encoder Representations from Transformers)模型、XLNET(通用的自回归预训练方法)模型。Wherein, the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, and the neural network includes but is not limited to: Bert (Bidirectional Encoder Representations from Transformers) model, XLNET (universal autoregressive pre-training method) model.
每个所述训练样本包括:语料片段和语料片段定义。语料片段中包括一个或多个文字。语料片段定义,是对语料片段的解释。也就是说,语料片段和语料片段定义组成文本对。采用该文本对对所述神经网络进行训练,以拉近语料片段和语料片段定义的距离,从而有利于提高句子向量的准确性。Each of the training samples includes: corpus fragments and corpus fragment definitions. A corpus fragment includes one or more words. The definition of the corpus fragment is the interpretation of the corpus fragment. That is, corpus fragments and corpus fragment definitions form text pairs. The neural network is trained by using the text pair to shorten the distance between the corpus fragment and the definition of the corpus fragment, thereby improving the accuracy of the sentence vector.
其中,采用多个训练样本对神经网络训练得到所述句子向量生成模型,包括:根据所述训练样本中的所述语料片段进行词向量生成,得到语料片段词向量;采用基于所述神经网络得到的初始模型,对所述训练样本中的所述语料片段定义进行句子向量生成,得到语料片段定义句子向量;根据语料片段词向量和语料片段定义句子向量对所述初始模型进行训练,将训练结束的所述初始模型作为所述句子向量生成模型。从而拉近了语料片段对应的词向量和语料片段定义对应的句子向量之间的距离,提高了句子向量生成模型生成句子向量的准确性,而且不需要基于生成式的NLP训练方式,降低了训练难度,占用资源少,模型易收敛。Wherein, using a plurality of training samples to train the neural network to obtain the sentence vector generation model includes: generating word vectors according to the corpus fragments in the training samples to obtain corpus fragment word vectors; using the neural network to obtain The initial model of the definition of the corpus fragment in the training sample is carried out sentence vector generation, obtains the corpus fragment definition sentence vector; According to the corpus fragment word vector and the corpus fragment definition sentence vector, the initial model is trained, and the training ends The initial model of is used as the sentence vector generation model. As a result, the distance between the word vector corresponding to the corpus fragment and the sentence vector corresponding to the corpus fragment definition is shortened, the accuracy of the sentence vector generation model to generate the sentence vector is improved, and the generation-based NLP training method is not required, which reduces the training time. Difficulty, takes up less resources, and the model is easy to converge.
可选的,采用多个训练样本对神经网络训练时,采用学习率为0.0005。Optionally, when using multiple training samples to train the neural network, a learning rate of 0.0005 is used.
在一个实施例中,上述将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量的步骤之前,还包括:In one embodiment, before the step of inputting the target text data into the sentence vector generation model to generate the sentence vector, and obtaining the target sentence vector corresponding to the target text data, it also includes:
S211:获取多个所述训练样本;S211: Obtain a plurality of training samples;
S212:获取预设批次数量的所述训练样本作为训练样本集;S212: Obtain the training samples of a preset batch number as a training sample set;
S213:根据所述训练样本集中的每个所述语料片段进行词向量生成,得到第一词向量;S213: Generate a word vector according to each of the corpus fragments in the training sample set to obtain a first word vector;
S214:采用初始模型,对所述训练样本集中的每个所述语料片段定义进行句子向量生成,得到第一句子向量,其中,所述初始模型是基于Bert模型或XLNET模型得到的模型;S214: Using an initial model, perform sentence vector generation on each of the corpus fragment definitions in the training sample set to obtain a first sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;
S215:根据各个所述第一词向量和各个所述第一句子向量进行损失值计算,得到第一损失值,根据所述第一损失值更新所述初始模型的参数,将更新后的所述初始模型用于下一次计算所述第一句子向量;S215: Calculate the loss value according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value, update the parameters of the initial model according to the first loss value, and update the updated The initial model is used to calculate the first sentence vector next time;
S216:重复执行所述获取预设批次数量的所述训练样本作为训练样本集的步骤,直至达到第一训练目标;S216: Repeat the step of obtaining the training samples of the preset batch number as the training sample set until the first training target is reached;
S217:将达到所述第一训练目标的所述初始模型作为所述句子向量生成模型。S217: Use the initial model that achieves the first training objective as the sentence vector generation model.
本实施例实现了根据语料片段确定第一词向量,根据语料片段定义确定第一句子向量,根据各个所述第一词向量和各个所述第一句子向量进行损失值计算,从而拉近了语料片段对应的词向量和语料片段定义对应的句子向量之间的距离,提高了句子向量生成模型生成句子向量的准确性,而且不需要基于生成式的NLP训练方式,降低了训练难度,占用资源少,模型易收敛;而且每次采用训练样本集进行批次训练,避免异常训练样本过渡影响所述初始模型的参数,有利于提高训练的准确性。In this embodiment, the first word vector is determined according to the corpus fragment, the first sentence vector is determined according to the definition of the corpus fragment, and the loss value is calculated according to each of the first word vectors and each of the first sentence vectors, thereby shortening the corpus The distance between the word vector corresponding to the segment and the corresponding sentence vector defined by the corpus segment improves the accuracy of the sentence vector generation model for generating sentence vectors, and does not require a generative-based NLP training method, which reduces the difficulty of training and takes up less resources , the model is easy to converge; and the training sample set is used for batch training each time, so as to avoid the abnormal training sample transition from affecting the parameters of the initial model, which is conducive to improving the accuracy of training.
对于S211,可以获取用户输入的多个所述训练样本,也可以从数据库中获取多个所述训练样本,还可以从第三方应用系统中获取多个所述训练样本。For S211, multiple training samples input by the user may be obtained, multiple training samples may be obtained from a database, or multiple training samples may be obtained from a third-party application system.
对于S212,从各个所述训练样本中获取预设批次数量的所述训练样本,将获取的各个所述训练样本作为训练样本集。For S212, acquire a preset batch of training samples from each of the training samples, and use each of the acquired training samples as a training sample set.
可选的,所述预设批次数量设为64。可以理解的是,预设批次数量还可以设为其他数值,在此不做限定。Optionally, the preset number of batches is set to 64. It can be understood that the preset batch quantity can also be set to other values, which is not limited here.
对于S213,根据所述训练样本集中的每个所述语料片段进行词向量生成,将生成的每个词向量作为一个第一词向量(也就是语料片段词向量)。也就是说,每个所述语料片段对应一个第一词向量。For S213, a word vector is generated according to each of the corpus fragments in the training sample set, and each generated word vector is used as a first word vector (that is, a corpus fragment word vector). That is to say, each corpus segment corresponds to a first word vector.
对于S214,采用初始模型,对所述训练样本集中的每个所述语料片段定义进行句子向量生成,将生成的句子向量作为第一句子向量(也就是语料片段定义句子向量)。也就是说,每个所述语料片段定义对应一个第一句子向量。For S214, the initial model is used to generate a sentence vector for each of the corpus fragment definitions in the training sample set, and the generated sentence vector is used as the first sentence vector (that is, the corpus fragment defines a sentence vector). That is to say, each corpus fragment definition corresponds to a first sentence vector.
其中,所述初始模型是基于Bert模型或XLNET模型得到的模型,可以理解的是,所述初始模型还可以采用其他模型,在此不做限定。Wherein, the initial model is a model obtained based on the Bert model or the XLNET model. It can be understood that the initial model may also use other models, which are not limited here.
对于S215,根据各个所述第一词向量和各个所述第一句子向量进行损失值计算,得到第一损失值,从而实现每个批次计算出一个第一损失值。For S215, the calculation of the loss value is performed according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value, so as to realize the calculation of a first loss value for each batch.
其中,各个所述第一词向量中的所述第一词向量的数量与预设批次数量的值相同;各个所述第一句子向量中的所述第一句子向量的数量与预设批次数量的值相同。Wherein, the quantity of the first word vector in each of the first word vectors is the same as the value of the preset batch quantity; the quantity of the first sentence vector in each of the first sentence vectors is the same as the preset batch quantity The value of the number of times is the same.
根据所述第一损失值更新所述初始模型的参数的方法步骤在此不做赘述。The steps of the method for updating the parameters of the initial model according to the first loss value will not be repeated here.
将更新后的所述初始模型用于下一次计算所述第一句子向量,从而实现了对所述初始模型的迭代更新。The updated initial model is used for the next calculation of the first sentence vector, thereby implementing an iterative update of the initial model.
对于S216,重复执行步骤S212至步骤S216,直至达到第一训练目标。As for S216, step S212 to step S216 are repeatedly executed until the first training target is reached.
第一训练目标包括:所述第一损失值达到第一收敛条件或者所述初始模型的迭代次数达到第二收敛条件。The first training target includes: the first loss value reaches a first convergence condition or the number of iterations of the initial model reaches a second convergence condition.
所述第一收敛条件是指相邻两次计算所述第一损失值的大小满足lipschitz条件(利普希茨连续条件)。The first convergence condition means that the size of the first loss value calculated twice adjacently satisfies the Lipschitz condition (Lipschitz continuous condition).
所述迭代次数是指所述第一损失值的计算次数,也就是说,被计算一次,迭代次数增加1。The number of iterations refers to the number of calculations of the first loss value, that is, the number of iterations is increased by 1 after being calculated once.
第二收敛条件是具体数值。The second convergence condition is a specific numerical value.
对于S217,达到所述第一训练目标的所述初始模型是符合预期要求的模型,因此将达到所述第一训练目标的所述初始模型作为所述句子向量生成模型。For S217, the initial model that achieves the first training objective is a model that meets expected requirements, so the initial model that achieves the first training objective is used as the sentence vector generation model.
在一个实施例中,上述获取多个所述训练样本的步骤,包括:In one embodiment, the above step of obtaining a plurality of training samples includes:
S2111:获取词典数据,所述词典数据包括:文本段和文本段定义,文本段包括:单汉字、词语、成语中的任一种,所述文本段定义是对所述文本段的解释说明;S2111: Obtain dictionary data, the dictionary data includes: text segment and text segment definition, the text segment includes: any one of single Chinese characters, words, and idioms, and the text segment definition is an explanation for the text segment;
S2112:从所述词典数据中获取任一个文本段作为目标文本段;S2112: Obtain any text segment from the dictionary data as a target text segment;
S2113:根据所述目标文本段和所述目标文本段对应的所述文本段定义生成所述训练样本,其中,将所述目标文本段作为所述训练样本的所述语料片段,将所述目标文本段对应的所述文本段定义作为所述训练样本的所述语料片段定义;S2113: Generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the target The text segment corresponding to the text segment is defined as the corpus segment definition of the training sample;
S2114:重复执行所述从所述词典数据中获取任一个文本段作为目标文本段的步骤,直至完成所述词典数据中的所述文本段的获取或者获取到样本生成结束信号。S2114: Repeat the step of obtaining any text segment from the dictionary data as the target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is obtained.
本实施例实现了根据词典数据确定训练样本,具有准确性高、稳定性高、易获取的优点,为拉近了语料片段对应的词向量和语料片段定义对应的句子向量之间的距离的准确性提供了基础,进一步提高了句子向量生成模型生成句子向量的准确性。In this embodiment, the training samples are determined according to the dictionary data, which has the advantages of high accuracy, high stability, and easy acquisition. In order to shorten the accuracy of the distance between the word vector corresponding to the corpus fragment and the sentence vector corresponding to the corpus fragment definition This provides a basis for further improving the accuracy of the sentence vector generation model to generate sentence vectors.
对于S2111,可以获取用户输入的词典数据,也可以从数据库中获取词典数据,还可以从第三方应用系统中获取词典数据。For S2111, the dictionary data input by the user can be obtained, the dictionary data can also be obtained from the database, and the dictionary data can also be obtained from a third-party application system.
可选的,词典数据是根据新华词典得到的数据。可以理解的是,词典数据还可以是根据其他词典得到的数据,比如,英文词典、中文其它词典,在此不做限定。Optionally, the dictionary data is data obtained from the Xinhua Dictionary. It can be understood that the dictionary data may also be data obtained from other dictionaries, such as English dictionaries and other Chinese dictionaries, which are not limited here.
对于S2112,从所述词典数据中获取任一个文本段,将获取的文本段作为目标文本段。For S2112, any text segment is obtained from the dictionary data, and the obtained text segment is used as a target text segment.
对于S2113,根据所述目标文本段和所述目标文本段对应的所述文本段定义生成所述训练样本,从而将所述目标文本段和所述目标文本段对应的解释作为文本对,将文本对作为所述训练样本。不需要确定句子向量的标定数据,从而简化了训练样本的生成。For S2113, generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, so that the target text segment and the interpretation corresponding to the target text segment are used as a text pair, and the text pair as the training samples. There is no need to determine the calibration data of the sentence vector, which simplifies the generation of training samples.
对于S2114,重复执行步骤S2112至步骤S2114,直至完成所述词典数据中的所述文本段的获取或者获取到样本生成结束信号。当完成所述词典数据中的所述文本段的获取时,已实现对所述词典数据中的所有数据生成了训练样本。当获取到样本生成结束信号,意味着已经生成了足够数量的训练样本。For S2114, step S2112 to step S2114 are repeatedly executed until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired. When the acquisition of the text segment in the dictionary data is completed, it has been realized that training samples have been generated for all the data in the dictionary data. When the sample generation end signal is obtained, it means that a sufficient number of training samples have been generated.
样本生成结束信号,是实现本申请的程序文件根据预设条件生成的。比如,预设条件为预设样本数量,在此举例不做具体限定。The sample generation end signal is generated by the program file implementing the present application according to preset conditions. For example, the preset condition is a preset sample size, which is not specifically limited in this example.
在一个实施例中,上述根据所述训练样本集中的每个所述语料片段进行词向量生成,得到第一词向量的步骤,包括:In one embodiment, the above-mentioned step of generating word vectors according to each of the corpus fragments in the training sample set to obtain the first word vector includes:
S2131:对所述训练样本集中的每个所述语料片段进行分词处理,得到语料片段短语集;S2131: Perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a phrase set of corpus fragments;
S2132:采用预设词向量模型,对每个所述语料片段短语集中的各个短语进行词向量生成,得到短语词向量集;S2132: Using a preset word vector model, perform word vector generation for each phrase in the phrase set of each corpus fragment, to obtain a phrase word vector set;
S2133:对每个所述短语词向量集进行平均值计算,得到所述第一词向量。S2133: Perform average calculation on each set of phrase word vectors to obtain the first word vector.
本实施例实现对语料片段分别进行分词、每个短语的词向量生成、各个词向量平均值计算,得到了语料片段的词向量,从而为拉近了语料片段对应的词向量和语料片段定义对应的句子向量之间的距离提供了基础。This embodiment implements word segmentation for corpus fragments, word vector generation for each phrase, and calculation of the average value of each word vector to obtain the word vectors of the corpus fragments, thereby narrowing the correspondence between the word vectors corresponding to the corpus fragments and the corpus fragment definitions. The distance between the sentence vectors of σ provides the basis.
对于S2131,对所述训练样本集中的每个所述语料片段进行分词处理,将分词得到的所有短语作为语料片段短语集。也就是说,所述语料片段短语集和所述训练样本集中的所述语料片段一一对应。For S2131, word segmentation is performed on each of the corpus fragments in the training sample set, and all phrases obtained by word segmentation are used as a corpus fragment phrase set. That is to say, there is a one-to-one correspondence between the phrase set of corpus fragments and the corpus fragments in the training sample set.
对于S2132,采用预设词向量模型,对所述语料片段短语集中的各个短语进行词向量生成,将生成的各个词向量作为短语词向量集。也就是说,所述短语词向量集和所述训练样本集中的所述语料片段一一对应。For S2132, a preset word vector model is used to generate word vectors for each phrase in the corpus fragment phrase set, and each generated word vector is used as a phrase word vector set. That is to say, the phrase word vector set is in one-to-one correspondence with the corpus fragments in the training sample set.
可选的,预设词向量模型采用预训练的中文Glove(Global Vectors for Word Representation)词向量模型。Optionally, the preset word vector model uses the pre-trained Chinese Glove (Global Vectors for Word Representation) word vector model.
对于S2133,对所述短语词向量集进行平均值计算,将计算得到的数据作为所述第一词向量。For S2133, perform average calculation on the phrase word vector set, and use the calculated data as the first word vector.
在一个实施例中,上述采用初始模型,对所述训练样本集中的每个所述语料片段定义进行句子向量生成,得到第一句子向量的步骤,包括:In one embodiment, the above-mentioned initial model is used to generate sentence vectors for each of the corpus fragment definitions in the training sample set, and the step of obtaining the first sentence vector includes:
S2141:对所述训练样本集中的每个所述语料片段定义进行分词处理,得到定义短语集;S2141: Perform word segmentation processing on each definition of the corpus segment in the training sample set to obtain a set of defined phrases;
S2142:将每个所述定义短语集输入所述初始模型进行句子向量生成,得到所述第一句子向量。S2142: Input each defined phrase set into the initial model to generate a sentence vector to obtain the first sentence vector.
本实施例实现了对语料片段定义分别进行分词处理及句子向量生成,从而为拉近了语料片段对应的词向量和语料片段定义对应的句子向量之间的距离提供了基础。This embodiment implements word segmentation processing and sentence vector generation for the definitions of the corpus fragments, thereby providing a basis for shortening the distance between the word vectors corresponding to the corpus fragments and the sentence vectors corresponding to the corpus fragment definitions.
对于S2141,对所述训练样本集中的每个所述语料片段定义进行分词处理,将分词得到的所有短语作为定义短语集。也就是说,定义短语集和所述训练样本集中的所述语料片段定义一一对应。For S2141, word segmentation is performed on each definition of the corpus segment in the training sample set, and all phrases obtained by word segmentation are used as a set of defined phrases. That is to say, there is a one-to-one correspondence between the defined phrase set and the definitions of the corpus fragments in the training sample set.
对于S2142,将所述定义短语集输入所述初始模型进行句子向量生成,将生成的句子向量作为所述第一句子向量。也就是说,所述第一句子向量和所述训练样本集中的所述语料片段定义一一对应。For S2142, input the defined phrase set into the initial model to generate a sentence vector, and use the generated sentence vector as the first sentence vector. That is to say, there is a one-to-one correspondence between the first sentence vector and the corpus segment definitions in the training sample set.
在一个实施例中,上述根据各个所述第一词向量和各个所述第一句子向量进行损失值计算,得到第一损失值的步骤,包括:In one embodiment, the step of calculating the loss value according to each of the first word vectors and each of the first sentence vectors to obtain the first loss value includes:
S2151:获取任一个所述第一词向量作为目标词向量;S2151: Obtain any one of the first word vectors as a target word vector;
S2152:将所述目标词向量和所述目标词向量对应的所述第一句子向量输入预设损失函数进行损失值计算,得到待处理损失值,其中,所述预设损失函数采用相对熵损失函数;S2152: Input the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to calculate a loss value to obtain a pending loss value, wherein the preset loss function adopts a relative entropy loss function;
S2153:重复执行所述获取任一个所述第一词向量作为目标词向量的步骤,直至完成所述第一词向量的获取;S2153: Repeat the step of acquiring any one of the first word vectors as the target word vector until the acquisition of the first word vector is completed;
S2154:对各个所述待处理损失值进行平均值计算,得到所述第一损失值。S2154: Perform average calculation on each of the loss values to be processed to obtain the first loss value.
本实施例采用相对熵损失函数作为预设损失函数,有利于拉近向量之间的距离,通过对各个所述待处理损失值进行平均值计算得到所述第一损失值,从而实现了每批次训练只更新一次所述初始模型的参数。In this embodiment, the relative entropy loss function is used as the preset loss function, which is beneficial to shorten the distance between the vectors. The first loss value is obtained by calculating the average value of each of the loss values to be processed, thereby realizing the Each training only updates the parameters of the initial model once.
对于S2151,获取任一个所述第一词向量,将获取的所述第一词向量作为目标词向量。For S2151, any one of the first word vectors is acquired, and the acquired first word vector is used as a target word vector.
对于S2152,将所述目标词向量和所述目标词向量对应的所述第一句子向量输入预设损失函数进行损失值计算,将计算得到的损失值作为待处理损失值。For S2152, input the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to calculate a loss value, and use the calculated loss value as a loss value to be processed.
对于S2153,重复执行步骤S2151至步骤S2153,直至完成所述第一词向量的获取。For S2153, step S2151 to step S2153 are repeatedly executed until the acquisition of the first word vector is completed.
对于S2154,对各个所述待处理损失值进行平均值计算,将计算得到的数据作为所述第一损失值。For S2154, an average value is calculated for each of the loss values to be processed, and the calculated data is used as the first loss value.
在一个实施例中,上述将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量的步骤之前,还包括:In one embodiment, before the step of inputting the target text data into the sentence vector generation model to generate the sentence vector, and obtaining the target sentence vector corresponding to the target text data, it also includes:
S221:获取多个所述训练样本;S221: Obtain a plurality of training samples;
S222:获取一个所述训练样本作为目标训练样本;S222: Obtain one training sample as a target training sample;
S223:根据所述目标训练样本中的所述语料片段进行词向量生成,得到第二词向量;S223: Generate a word vector according to the corpus fragment in the target training sample to obtain a second word vector;
S224:采用初始模型,对所述目标训练样本中的所述语料片段定义进行句子向量生成,得到第二句子向量,其中,所述初始模型是基于Bert模型或XLNET模型得到的模型;S224: Using an initial model, generate a sentence vector for the definition of the corpus segment in the target training sample to obtain a second sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;
S225:根据所述第二词向量和所述第二句子向量进行损失值计算,得到第二损失值,根据所述第二损失值更新所述初始模型的参数,将更新后的所述初始模型用于下一次计算所述第二句子向量;S225: Calculate the loss value according to the second word vector and the second sentence vector to obtain a second loss value, update the parameters of the initial model according to the second loss value, and use the updated initial model For calculating the second sentence vector next time;
S226:重复执行所述获取一个所述训练样本作为目标训练样本的步骤,直至达到第二训练目标;S226: Repeat the step of acquiring one training sample as a target training sample until a second training target is reached;
S227:将达到所述第二训练目标的所述初始模型作为所述句子向量生成模型。S227: Use the initial model that achieves the second training objective as the sentence vector generation model.
本实施例实现了根据语料片段确定第二词向量,根据语料片段定义确定第二句子向量,根据所述第二词向量和所述第二句子向量进行损失值计算,从而拉近了语料片段对应的词向量和语料片段定义对应的句子向量之间的距离,提高了句子向量生成模型生成句子向量的准确性,而且不需要基于生成式的NLP训练方式,降低了训练难度,占用资源少,模型易收敛。In this embodiment, the second word vector is determined according to the corpus fragment, the second sentence vector is determined according to the definition of the corpus fragment, and the loss value is calculated according to the second word vector and the second sentence vector, thereby shortening the correspondence between the corpus fragment. The distance between the word vectors and corpus fragments defines the corresponding sentence vectors, which improves the accuracy of the sentence vector generation model to generate sentence vectors, and does not require a generative NLP training method, which reduces the difficulty of training and takes up less resources. The model Easy to converge.
对于S221,可以获取用户输入的多个所述训练样本,也可以从数据库中获取多个所述训练样本,还可以从第三方应用系统中获取多个所述训练样本。For S221, multiple training samples input by the user may be obtained, multiple training samples may be obtained from a database, or multiple training samples may be obtained from a third-party application system.
对于S222,从各个所述训练样本中获取一个所述训练样本,将获取的所述训练样本作为目标训练样本。For S222, one training sample is acquired from each of the training samples, and the acquired training sample is used as a target training sample.
对于S223,根据所述目标训练样本中的所述语料片段分别进行分词,根据分词得到的每个短语进行词向量生成,对生成的各个词向量的平均值计算,将计算得到的数据作为第二词向量(也就是语料片段词向量)。For S223, word segmentation is performed according to the corpus fragments in the target training sample, and word vector generation is performed according to each phrase obtained by word segmentation, and the average value of each generated word vector is calculated, and the calculated data is used as the second Word vectors (that is, corpus fragment word vectors).
对于S224,将所述目标训练样本中的所述语料片段定义进行分词,将分词得到的各个短语输入初始模型进行句子向量生成,将生成的句子向量作为第二句子向量(也就是语料片段定义句子向量)。For S224, the definition of the corpus segment in the target training sample is segmented, and each phrase obtained by the segment is input into the initial model for sentence vector generation, and the generated sentence vector is used as the second sentence vector (that is, the corpus segment defines the sentence vector).
对于S225,根据所述第二词向量和所述第二句子向量进行损失值计算,将计算得到的损失值作为第二损失值。For S225, a loss value is calculated according to the second word vector and the second sentence vector, and the calculated loss value is used as a second loss value.
根据所述第二损失值更新所述初始模型的参数的步骤在此不做赘述。The step of updating the parameters of the initial model according to the second loss value will not be repeated here.
对于S226,重复执行步骤S222至步骤S226,直至达到第二训练目标。For S226, repeatedly execute steps S222 to S226 until the second training target is reached.
第二训练目标包括:所述第二损失值达到第三收敛条件或者所述初始模型的迭代次数达到第四收敛条件。The second training objective includes: the second loss value reaches a third convergence condition or the number of iterations of the initial model reaches a fourth convergence condition.
所述第三收敛条件是指相邻两次计算所述第二损失值的大小满足lipschitz条件(利普希茨连续条件)。The third convergence condition means that the size of the second loss value calculated twice adjacently satisfies the Lipschitz condition (Lipschitz continuous condition).
所述迭代次数是指所述第二损失值的计算次数,也就是说,被计算一次,迭代次数增加1。The number of iterations refers to the number of calculations of the second loss value, that is, the number of iterations is increased by 1 after being calculated once.
第四收敛条件是具体数值。The fourth convergence condition is a specific numerical value.
对于S227,达到所述第二训练目标的所述初始模型是符合预期要求的模型,因此将达到所述第二训练目标的所述初始模型作为所述句子向量生成模型。For S227, the initial model that achieves the second training objective is a model that meets expected requirements, so the initial model that achieves the second training objective is used as the sentence vector generation model.
参照图2,本申请还提出了一种句子向量生成装置,所述装置包括:With reference to Fig. 2, the present application also proposes a kind of sentence vector generation device, and described device comprises:
数据获取模块100,用于获取目标文本数据;Data acquisition module 100, for acquiring target text data;
句子向量生成模块200,用于将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。The sentence vector generation module 200 is used to input the target text data into the sentence vector generation model to generate the sentence vector, and obtain the target sentence vector corresponding to the target text data, wherein the sentence vector generation model adopts a plurality of training samples For the model obtained by neural network training, each training sample includes: a corpus fragment and a definition of the corpus fragment.
本实施例通过将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义,从而实现基于语料片段和语料片段定义对神经网络训练进行训练得到句子向量生成模型,降低了训练难度,避免了采用无监督学习的方法或基于对比学习的方法构建句子向量。In this embodiment, the target sentence vector corresponding to the target text data is obtained by inputting the target text data into the sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model uses a plurality of training samples to train the neural network The model obtained, each of the training samples includes: a corpus segment and a corpus segment definition, thereby realizing training the neural network based on the corpus segment and the corpus segment definition to obtain a sentence vector generation model, reducing the training difficulty and avoiding the use of unsupervised learning method or construct sentence vector based on contrastive learning method.
在一个实施例中,上述装置还包括:第一模型训练模块;In one embodiment, the above device further includes: a first model training module;
所述第一模型训练模块,用于获取多个所述训练样本,获取预设批次数量的所述训练样本作为训练样本集,根据所述训练样本集中的每个所述语料片段进行词向量生成,得到第一词向量,采用初始模型,对所述训练样本集中的每个所述语料片段定义进行句子向量生成,得到第一句子向量,其中,所述初始模型是基于Bert模型或XLNET模型得到的模型,根据各个所述第一词向量和各个所述第一句子向量进行损失值计算,得到第一损失值,根据所述第一损失值更新所述初始模型的参数,将更新后的所述初始模型用于下一次计算所述第一句子向量,重复执行所述获取预设批次数量的所述训练样本作为训练样本集的步骤,直至达到第一训练目标,将达到所述第一训练目标的所述初始模型作为所述句子向量生成模型。The first model training module is configured to obtain a plurality of training samples, obtain a preset batch of training samples as a training sample set, and perform word vectors according to each of the corpus fragments in the training sample set Generate, obtain the first word vector, adopt initial model, carry out sentence vector generation to each described corpus fragment definition in described training sample set, obtain the first sentence vector, wherein, described initial model is based on Bert model or XLNET model The obtained model performs loss value calculation according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value, updates the parameters of the initial model according to the first loss value, and updates the updated The initial model is used to calculate the first sentence vector next time, and the step of obtaining the training samples of the preset batch number as the training sample set is repeated until the first training target is reached, and the second sentence vector will be reached. The initial model of a training target is used as the sentence vector generation model.
在一个实施例中,上述第一模型训练模块包括:训练样本生成子模块;In one embodiment, the above-mentioned first model training module includes: a training sample generation submodule;
所述训练样本生成子模块,用于获取词典数据,所述词典数据包括:文本段和文本段定义,文本段包括:单汉字、词语、成语中的任一种,所述文本段定义是对所述文本段的解释说明,从所述词典数据中获取任一个文本段作为目标文本段,根据所述目标文本段和所述目标文本段对应的所述文本段定义生成所述训练样本,其中,将所述目标文本段作为所述训练样本的所述语料片段,将所述目标文本段对应的所述文本段定义作为所述训练样本的所述语料片段定义,重复执行所述从所述词典数据中获取任一个文本段作为目标文本段的步骤,直至完成所述词典数据中的所述文本段的获取或者获取到样本生成结束信号。Described training sample generation submodule is used to obtain dictionary data, and described dictionary data comprises: text segment and text segment definition, and text segment comprises: any one in single Chinese character, word, idiom, and described text segment definition is to An explanation of the text segment, obtaining any text segment from the dictionary data as a target text segment, generating the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein , using the target text segment as the corpus segment of the training sample, using the text segment definition corresponding to the target text segment as the corpus segment definition of the training sample, and repeatedly executing the The step of obtaining any text segment in the dictionary data as the target text segment, until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is obtained.
在一个实施例中,上述第一模型训练模块还包括:第一词向量确定子模块;In one embodiment, the above-mentioned first model training module further includes: a first word vector determination submodule;
所述第一词向量确定子模块,用于对所述训练样本集中的每个所述语料片段进行分词处理,得到语料片段短语集,采用预设词向量模型,对每个所述语料片段短语集中的各个短语进行词向量生成,得到短语词向量集,对每个所述短语词向量集进行平均值计算,得到所述第一词向量。The first word vector determination submodule is used to perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a corpus fragment phrase set, and use a preset word vector model to perform word segmentation processing on each of the corpus fragment phrases Word vectors are generated for each phrase in the set to obtain a phrase word vector set, and an average value is calculated for each of the phrase word vector sets to obtain the first word vector.
在一个实施例中,上述第一模型训练模块还包括:第一句子向量确定子模块;In one embodiment, the above-mentioned first model training module also includes: a first sentence vector determination submodule;
所述第一句子向量确定子模块,用于对所述训练样本集中的每个所述语料片段定义进行分词处理,得到定义短语集,将每个所述定义短语集输入所述初始模型进行句子向量生成,得到所述第一句子向量。The first sentence vector determination submodule is used to perform word segmentation processing on each of the corpus fragment definitions in the training sample set to obtain a defined phrase set, and input each of the defined phrase sets into the initial model for sentence The vector is generated to obtain the first sentence vector.
在一个实施例中,上述第一模型训练模块还包括:第一损失值确定子模块;In one embodiment, the above-mentioned first model training module further includes: a first loss value determination submodule;
所述第一损失值确定子模块,用于获取任一个所述第一词向量作为目标词向量,将所述目标词向量和所述目标词向量对应的所述第一句子向量输入预设损失函数进行损失值计算,得到待处理损失值,其中,所述预设损失函数采用相对熵损失函数,重复执行所述获取任一个所述第一词向量作为目标词向量的步骤,直至完成所述第一词向量的获取,对各个所述待处理损失值进行平均值计算,得到所述第一损失值。The first loss value determination submodule is used to obtain any one of the first word vectors as a target word vector, and input the target word vector and the first sentence vector corresponding to the target word vector into a preset loss The function calculates the loss value to obtain the loss value to be processed, wherein the preset loss function adopts a relative entropy loss function, and repeats the step of obtaining any one of the first word vectors as the target word vector until the completion of the The acquisition of the first word vector is to perform average calculation on each of the loss values to be processed to obtain the first loss value.
在一个实施例中,上述装置还包括:第二模型训练模块;In one embodiment, the above device further includes: a second model training module;
所述第二模型训练模块,用于获取多个所述训练样本,获取一个所述训练样本作为目标训练样本,根据所述目标训练样本中的所述语料片段进行词向量生成,得到第二词向量,采用初始模型,对所述目标训练样本中的所述语料片段定义进行句子向量生成,得到第二句子向量,其中,所述初始模型是基于Bert模型或XLNET模型得到的模型,根据所述第二词向量和所述第二句子向量进行损失值计算,得到第二损失值,根据所述第二损失值更新所述初始模型的参数,将更新后的所述初始模型用于下一次计算所述第二句子向量,重复执行所述获取一个所述训练样本作为目标训练样本的步骤,直至达到第二训练目标,将达到所述第二训练目标的所述初始模型作为所述句子向量生成模型。The second model training module is used to obtain a plurality of training samples, obtain one of the training samples as a target training sample, perform word vector generation according to the corpus fragments in the target training samples, and obtain a second word Vector, using an initial model to generate a sentence vector for the definition of the corpus segment in the target training sample to obtain a second sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model, according to the The second word vector and the second sentence vector perform loss value calculation to obtain a second loss value, update the parameters of the initial model according to the second loss value, and use the updated initial model for the next calculation The second sentence vector, repeating the step of obtaining one of the training samples as the target training sample, until reaching the second training target, generating the initial model that reaches the second training target as the sentence vector Model.
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于储存句子向量生成方法等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种句子向量生成方法。所述句子向量生成方法,包括:获取目标文本数据;将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。Referring to FIG. 3 , an embodiment of the present application further provides a computer device, which may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, network interface and database connected by a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs and databases. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store data such as sentence vector generation methods. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by a processor, a method for generating sentence vectors is realized. The sentence vector generation method includes: obtaining target text data; inputting the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model It is a model obtained by training a neural network by using multiple training samples, each of which includes: a corpus fragment and a definition of the corpus fragment.
本实施例通过将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义,从而实现基于语料片段和语料片段定义对神经网络训练进行训练得到句子向量生成模型,降低了训练难度,避免了采用无监督学习的方法或基于对比学习的方法构建句子向量。In this embodiment, the target sentence vector corresponding to the target text data is obtained by inputting the target text data into the sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model uses a plurality of training samples to train the neural network The model obtained, each of the training samples includes: a corpus segment and a corpus segment definition, thereby realizing training the neural network based on the corpus segment and the corpus segment definition to obtain a sentence vector generation model, reducing the training difficulty and avoiding the use of unsupervised learning method or construct sentence vector based on contrastive learning method.
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,所述存储介质为易失性存储介质或非易失性存储介质,计算机程序被处理器执行时实现一种句子向量生成方法,包括步骤:获取目标文本数据;将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored. The storage medium is a volatile storage medium or a non-volatile storage medium. When the computer program is executed by a processor, a The sentence vector generation method comprises the steps of: obtaining target text data; inputting the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is A model obtained by training the neural network by using multiple training samples, each of which includes: a corpus fragment and a definition of the corpus fragment.
上述执行的句子向量生成方法,通过将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义,从而实现基于语料片段和语料片段定义对神经网络训练进行训练得到句子向量生成模型,降低了训练难度,避免了采用无监督学习的方法或基于对比学习的方法构建句子向量。In the sentence vector generation method performed above, the target sentence vector corresponding to the target text data is obtained by inputting the target text data into the sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model adopts multiple training The model obtained by sample pair neural network training, each of said training samples includes: corpus segment and corpus segment definition, thereby realizing training neural network training based on corpus segment and corpus segment definition to obtain a sentence vector generation model, reducing training difficulty, It avoids using unsupervised learning methods or contrastive learning-based methods to construct sentence vectors.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented through computer programs to instruct related hardware, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any references to memory, storage, database or other media provided in the present application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Claims (20)

  1. 一种句子向量生成方法,其中,所述方法包括:A method for generating sentence vectors, wherein the method includes:
    获取目标文本数据;Get target text data;
    将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。The target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
  2. 根据权利要求1所述的句子向量生成方法,其中,所述将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量的步骤之前,还包括:The sentence vector generation method according to claim 1, wherein, before the step of obtaining the target sentence vector corresponding to the target text data, the target text data is input into the sentence vector generation model to perform sentence vector generation, further comprising:
    获取多个所述训练样本;Obtain a plurality of said training samples;
    获取预设批次数量的所述训练样本作为训练样本集;Obtaining the training samples of a preset batch number as a training sample set;
    根据所述训练样本集中的每个所述语料片段进行词向量生成,得到第一词向量;performing word vector generation according to each of the corpus fragments in the training sample set to obtain a first word vector;
    采用初始模型,对所述训练样本集中的每个所述语料片段定义进行句子向量生成,得到第一句子向量,其中,所述初始模型是基于Bert模型或XLNET模型得到的模型;Adopt initial model, carry out sentence vector generation to each described corpus fragment definition in described training sample set, obtain the first sentence vector, wherein, described initial model is the model that obtains based on Bert model or XLNET model;
    根据各个所述第一词向量和各个所述第一句子向量进行损失值计算,得到第一损失值;根据所述第一损失值更新所述初始模型的参数,将更新后的所述初始模型用于下一次计算所述第一句子向量;Calculate the loss value according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value; update the parameters of the initial model according to the first loss value, and use the updated initial model For calculating the first sentence vector next time;
    重复执行所述获取预设批次数量的所述训练样本作为训练样本集的步骤,直至达到第一训练目标;Repeating the step of obtaining the training samples of the preset batch number as the training sample set until the first training target is reached;
    将达到所述第一训练目标的所述初始模型作为所述句子向量生成模型。The initial model that achieves the first training objective is used as the sentence vector generation model.
  3. 根据权利要求2所述的句子向量生成方法,其中,所述获取多个所述训练样本的步骤,包括:The method for generating sentence vectors according to claim 2, wherein the step of obtaining a plurality of training samples includes:
    获取词典数据,所述词典数据包括:文本段和文本段定义,文本段包括:单汉字、词语、成语中的任一种,所述文本段定义是对所述文本段的解释说明;Obtaining dictionary data, the dictionary data includes: a text segment and a text segment definition, the text segment includes: any one of a single Chinese character, a word, an idiom, and the text segment definition is an explanation to the text segment;
    从所述词典数据中获取任一个文本段作为目标文本段;Acquiring any text segment from the dictionary data as a target text segment;
    根据所述目标文本段和所述目标文本段对应的所述文本段定义生成所述训练样本,其中,将所述目标文本段作为所述训练样本的所述语料片段,将所述目标文本段对应的所述文本段定义作为所述训练样本的所述语料片段定义;Generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the target text segment is The corresponding text segment is defined as the corpus segment definition of the training sample;
    重复执行所述从所述词典数据中获取任一个文本段作为目标文本段的步骤,直至完成所述词典数据中的所述文本段的获取或者获取到样本生成结束信号。Repeating the step of acquiring any text segment from the dictionary data as the target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired.
  4. 根据权利要求2所述的句子向量生成方法,其中,所述根据所述训练样本集中的每个所述语料片段进行词向量生成,得到第一词向量的步骤,包括:The method for generating sentence vectors according to claim 2, wherein the step of generating word vectors according to each of the corpus fragments in the training sample set to obtain the first word vectors includes:
    对所述训练样本集中的每个所述语料片段进行分词处理,得到语料片段短语集;Perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a phrase set of corpus fragments;
    采用预设词向量模型,对每个所述语料片段短语集中的各个短语进行词向量生成,得到短语词向量集;Using a preset word vector model to generate a word vector for each phrase in each of the corpus fragment phrase sets, to obtain a phrase word vector set;
    对每个所述短语词向量集进行平均值计算,得到所述第一词向量。performing average calculation on each set of phrase word vectors to obtain the first word vector.
  5. 根据权利要求2所述的句子向量生成方法,其中,所述采用初始模型,对所述训练样本集中的每个所述语料片段定义进行句子向量生成,得到第一句子向量的步骤,包括:The method for generating sentence vectors according to claim 2, wherein the step of using the initial model to generate sentence vectors for each of the corpus fragment definitions in the training sample set to obtain a first sentence vector includes:
    对所述训练样本集中的每个所述语料片段定义进行分词处理,得到定义短语集;Perform word segmentation processing on each of the corpus fragment definitions in the training sample set to obtain a defined phrase set;
    将每个所述定义短语集输入所述初始模型进行句子向量生成,得到所述第一句子向量。Inputting each defined phrase set into the initial model to generate sentence vectors to obtain the first sentence vectors.
  6. 根据权利要求2所述的句子向量生成方法,其中,所述根据各个所述第一词向量和各个所述第一句子向量进行损失值计算,得到第一损失值的步骤,包括:The sentence vector generation method according to claim 2, wherein, the step of calculating the loss value according to each of the first word vectors and each of the first sentence vectors to obtain the first loss value includes:
    获取任一个所述第一词向量作为目标词向量;Acquiring any one of the first word vectors as a target word vector;
    将所述目标词向量和所述目标词向量对应的所述第一句子向量输入预设损失函数进行损失值计算,得到待处理损失值,其中,所述预设损失函数采用相对熵损失函数;Inputting the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to calculate a loss value to obtain a loss value to be processed, wherein the preset loss function adopts a relative entropy loss function;
    重复执行所述获取任一个所述第一词向量作为目标词向量的步骤,直至完成所述第一词向量的获取;Repeating the step of acquiring any one of the first word vectors as the target word vector until the acquisition of the first word vector is completed;
    对各个所述待处理损失值进行平均值计算,得到所述第一损失值。Perform average calculation on each of the loss values to be processed to obtain the first loss value.
  7. 根据权利要求1所述的句子向量生成方法,其中,所述将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量的步骤之前,还包括:The sentence vector generation method according to claim 1, wherein, before the step of obtaining the target sentence vector corresponding to the target text data, the target text data is input into the sentence vector generation model to perform sentence vector generation, further comprising:
    获取多个所述训练样本;Obtain a plurality of said training samples;
    获取一个所述训练样本作为目标训练样本;Obtaining one of the training samples as a target training sample;
    根据所述目标训练样本中的所述语料片段进行词向量生成,得到第二词向量;performing word vector generation according to the corpus fragment in the target training sample to obtain a second word vector;
    采用初始模型,对所述目标训练样本中的所述语料片段定义进行句子向量生成,得到第二句子向量,其中,所述初始模型是基于Bert模型或XLNET模型得到的模型;Adopt initial model, carry out sentence vector generation to the described corpus segment definition in described target training sample, obtain the second sentence vector, wherein, described initial model is the model that obtains based on Bert model or XLNET model;
    根据所述第二词向量和所述第二句子向量进行损失值计算,得到第二损失值,根据所述第二损失值更新所述初始模型的参数,将更新后的所述初始模型用于下一次计算所述第二句子向量;Calculate the loss value according to the second word vector and the second sentence vector to obtain a second loss value, update the parameters of the initial model according to the second loss value, and use the updated initial model for Computing the second sentence vector next time;
    重复执行所述获取一个所述训练样本作为目标训练样本的步骤,直至达到第二训练目标;Repeating the step of acquiring one training sample as a target training sample until reaching a second training target;
    将达到所述第二训练目标的所述初始模型作为所述句子向量生成模型。The initial model that achieves the second training objective is used as the sentence vector generation model.
  8. 一种句子向量生成装置,其中,所述装置包括:A device for generating sentence vectors, wherein the device includes:
    数据获取模块,用于获取目标文本数据;A data acquisition module, configured to acquire target text data;
    句子向量生成模块,用于将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。The sentence vector generation module is used to input the target text data into the sentence vector generation model to generate the sentence vector, and obtain the target sentence vector corresponding to the target text data, wherein the sentence vector generation model adopts a plurality of training sample pairs The model obtained by neural network training, each of the training samples includes: a corpus fragment and a definition of the corpus fragment.
  9. 一种计算机设备,包括:A computer device comprising:
    一个或多个处理器;one or more processors;
    存储器;memory;
    一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于执行一种句子向量生成方法:one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs are configured to perform A sentence vector generation method:
    其中,所述句子向量生成方法包括:Wherein, the sentence vector generation method comprises:
    获取目标文本数据;Get target text data;
    将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。The target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
  10. 根据权利要求9所述的计算机设备,其中,所述将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量的步骤之前,还包括:The computer device according to claim 9, wherein, before the step of inputting the target text data into the sentence vector generation model to generate a sentence vector, and obtaining the target sentence vector corresponding to the target text data, further comprising:
    获取多个所述训练样本;Obtain a plurality of said training samples;
    获取预设批次数量的所述训练样本作为训练样本集;Obtaining the training samples of a preset batch number as a training sample set;
    根据所述训练样本集中的每个所述语料片段进行词向量生成,得到第一词向量;performing word vector generation according to each of the corpus fragments in the training sample set to obtain a first word vector;
    采用初始模型,对所述训练样本集中的每个所述语料片段定义进行句子向量生成,得到第一句子向量,其中,所述初始模型是基于Bert模型或XLNET模型得到的模型;Adopt initial model, carry out sentence vector generation to each described corpus fragment definition in described training sample set, obtain the first sentence vector, wherein, described initial model is the model that obtains based on Bert model or XLNET model;
    根据各个所述第一词向量和各个所述第一句子向量进行损失值计算,得到第一损失值;根据所述第一损失值更新所述初始模型的参数,将更新后的所述初始模型用于下一次计算所述第一句子向量;Calculate the loss value according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value; update the parameters of the initial model according to the first loss value, and use the updated initial model For calculating the first sentence vector next time;
    重复执行所述获取预设批次数量的所述训练样本作为训练样本集的步骤,直至达到第一训练目标;Repeating the step of obtaining the training samples of the preset batch number as the training sample set until the first training target is reached;
    将达到所述第一训练目标的所述初始模型作为所述句子向量生成模型。The initial model that achieves the first training objective is used as the sentence vector generation model.
  11. 根据权利要求10所述的计算机设备,其中,所述获取多个所述训练样本的步骤,包括:The computer device according to claim 10, wherein said step of obtaining a plurality of said training samples comprises:
    获取词典数据,所述词典数据包括:文本段和文本段定义,文本段包括:单汉字、词语、成语中的任一种,所述文本段定义是对所述文本段的解释说明;Obtaining dictionary data, the dictionary data includes: a text segment and a text segment definition, the text segment includes: any one of a single Chinese character, a word, an idiom, and the text segment definition is an explanation to the text segment;
    从所述词典数据中获取任一个文本段作为目标文本段;Acquiring any text segment from the dictionary data as a target text segment;
    根据所述目标文本段和所述目标文本段对应的所述文本段定义生成所述训练样本,其中,将所述目标文本段作为所述训练样本的所述语料片段,将所述目标文本段对应的所述文本段定义作为所述训练样本的所述语料片段定义;Generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the target text segment is The corresponding text segment is defined as the corpus segment definition of the training sample;
    重复执行所述从所述词典数据中获取任一个文本段作为目标文本段的步骤,直至完成所述词典数据中的所述文本段的获取或者获取到样本生成结束信号。Repeating the step of acquiring any text segment from the dictionary data as the target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired.
  12. 根据权利要求10所述的计算机设备,其中,所述根据所述训练样本集中的每个所述语料片段进行词向量生成,得到第一词向量的步骤,包括:The computer device according to claim 10, wherein the step of generating a word vector according to each of the corpus fragments in the training sample set to obtain a first word vector includes:
    对所述训练样本集中的每个所述语料片段进行分词处理,得到语料片段短语集;Perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a phrase set of corpus fragments;
    采用预设词向量模型,对每个所述语料片段短语集中的各个短语进行词向量生成,得到短语词向量集;Using a preset word vector model to generate a word vector for each phrase in each of the corpus fragment phrase sets, to obtain a phrase word vector set;
    对每个所述短语词向量集进行平均值计算,得到所述第一词向量。performing average calculation on each set of phrase word vectors to obtain the first word vector.
  13. 根据权利要求10所述的计算机设备,其中,所述采用初始模型,对所述训练样本集中的每个所述语料片段定义进行句子向量生成,得到第一句子向量的步骤,包括:The computer device according to claim 10, wherein the step of using the initial model to generate a sentence vector for each of the corpus fragment definitions in the training sample set to obtain the first sentence vector includes:
    对所述训练样本集中的每个所述语料片段定义进行分词处理,得到定义短语集;Perform word segmentation processing on each of the corpus fragment definitions in the training sample set to obtain a defined phrase set;
    将每个所述定义短语集输入所述初始模型进行句子向量生成,得到所述第一句子向量。Inputting each defined phrase set into the initial model to generate sentence vectors to obtain the first sentence vectors.
  14. 根据权利要求10所述的计算机设备,其中,所述根据各个所述第一词向量和各个所述第一句子向量进行损失值计算,得到第一损失值的步骤,包括:The computer device according to claim 10, wherein the step of calculating the loss value according to each of the first word vectors and each of the first sentence vectors to obtain the first loss value includes:
    获取任一个所述第一词向量作为目标词向量;Acquiring any one of the first word vectors as a target word vector;
    将所述目标词向量和所述目标词向量对应的所述第一句子向量输入预设损失函数进行损失值计算,得到待处理损失值,其中,所述预设损失函数采用相对熵损失函数;Inputting the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to calculate a loss value to obtain a loss value to be processed, wherein the preset loss function adopts a relative entropy loss function;
    重复执行所述获取任一个所述第一词向量作为目标词向量的步骤,直至完成所述第一词向量的获取;Repeating the step of acquiring any one of the first word vectors as the target word vector until the acquisition of the first word vector is completed;
    对各个所述待处理损失值进行平均值计算,得到所述第一损失值。Perform average calculation on each of the loss values to be processed to obtain the first loss value.
  15. 根据权利要求10所述的计算机设备,其中,所述将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量的步骤之前,还包括:The computer device according to claim 10, wherein, before the step of inputting the target text data into a sentence vector generation model to generate a sentence vector, and obtaining the target sentence vector corresponding to the target text data, further comprising:
    获取多个所述训练样本;Obtain a plurality of said training samples;
    获取一个所述训练样本作为目标训练样本;Obtaining one of the training samples as a target training sample;
    根据所述目标训练样本中的所述语料片段进行词向量生成,得到第二词向量;performing word vector generation according to the corpus fragment in the target training sample to obtain a second word vector;
    采用初始模型,对所述目标训练样本中的所述语料片段定义进行句子向量生成,得到第二句子向量,其中,所述初始模型是基于Bert模型或XLNET模型得到的模型;Adopt initial model, carry out sentence vector generation to the described corpus segment definition in described target training sample, obtain the second sentence vector, wherein, described initial model is the model that obtains based on Bert model or XLNET model;
    根据所述第二词向量和所述第二句子向量进行损失值计算,得到第二损失值,根据所述第二损失值更新所述初始模型的参数,将更新后的所述初始模型用于下一次计算所述第二句子向量;Calculate the loss value according to the second word vector and the second sentence vector to obtain a second loss value, update the parameters of the initial model according to the second loss value, and use the updated initial model for Computing the second sentence vector next time;
    重复执行所述获取一个所述训练样本作为目标训练样本的步骤,直至达到第二训练目标;Repeating the step of acquiring one training sample as a target training sample until reaching a second training target;
    将达到所述第二训练目标的所述初始模型作为所述句子向量生成模型。The initial model that achieves the second training objective is used as the sentence vector generation model.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现句子向量生成方法,其中,所述句子向量生成方法包括以下步骤:A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, a method for generating a sentence vector is realized, wherein the method for generating a sentence vector comprises the following steps:
    获取目标文本数据;Get target text data;
    将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量,其中,所述句子向量生成模型是采用多个训练样本对神经网络训练得到的模型,每个所述训练样本包括:语料片段和语料片段定义。The target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述将所述目标文本数据输入句子向量生成模型进行句子向量生成,得到所述目标文本数据对应的目标句子向量的步骤之前,还包括:The computer-readable storage medium according to claim 16, wherein, before the step of inputting the target text data into a sentence vector generation model to generate a sentence vector, and obtaining the target sentence vector corresponding to the target text data, further comprising :
    获取多个所述训练样本;Obtain a plurality of said training samples;
    获取预设批次数量的所述训练样本作为训练样本集;Obtaining the training samples of a preset batch number as a training sample set;
    根据所述训练样本集中的每个所述语料片段进行词向量生成,得到第一词向量;performing word vector generation according to each of the corpus fragments in the training sample set to obtain a first word vector;
    采用初始模型,对所述训练样本集中的每个所述语料片段定义进行句子向量生成,得到第一句子向量,其中,所述初始模型是基于Bert模型或XLNET模型得到的模型;Adopt initial model, carry out sentence vector generation to each described corpus fragment definition in described training sample set, obtain the first sentence vector, wherein, described initial model is the model that obtains based on Bert model or XLNET model;
    根据各个所述第一词向量和各个所述第一句子向量进行损失值计算,得到第一损失值;根据所述第一损失值更新所述初始模型的参数,将更新后的所述初始模型用于下一次计算所述第一句子向量;Calculate the loss value according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value; update the parameters of the initial model according to the first loss value, and use the updated initial model For calculating the first sentence vector next time;
    重复执行所述获取预设批次数量的所述训练样本作为训练样本集的步骤,直至达到第一训练目标;Repeating the step of obtaining the training samples of the preset batch number as the training sample set until the first training target is reached;
    将达到所述第一训练目标的所述初始模型作为所述句子向量生成模型。The initial model that achieves the first training objective is used as the sentence vector generation model.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述获取多个所述训练样本的步骤,包括:The computer-readable storage medium according to claim 17, wherein said step of obtaining a plurality of said training samples comprises:
    获取词典数据,所述词典数据包括:文本段和文本段定义,文本段包括:单汉字、词语、成语中的任一种,所述文本段定义是对所述文本段的解释说明;Obtaining dictionary data, the dictionary data includes: a text segment and a text segment definition, the text segment includes: any one of a single Chinese character, a word, an idiom, and the text segment definition is an explanation to the text segment;
    从所述词典数据中获取任一个文本段作为目标文本段;Acquiring any text segment from the dictionary data as a target text segment;
    根据所述目标文本段和所述目标文本段对应的所述文本段定义生成所述训练样本,其中,将所述目标文本段作为所述训练样本的所述语料片段,将所述目标文本段对应的所述文本段定义作为所述训练样本的所述语料片段定义;Generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the target text segment is The corresponding text segment is defined as the corpus segment definition of the training sample;
    重复执行所述从所述词典数据中获取任一个文本段作为目标文本段的步骤,直至完成所述词典数据中的所述文本段的获取或者获取到样本生成结束信号。Repeating the step of acquiring any text segment from the dictionary data as the target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired.
  19. 根据权利要求17所述的计算机可读存储介质,其中,所述根据所述训练样本集中的每个所述语料片段进行词向量生成,得到第一词向量的步骤,包括:The computer-readable storage medium according to claim 17, wherein the step of generating a word vector according to each of the corpus fragments in the training sample set to obtain a first word vector includes:
    对所述训练样本集中的每个所述语料片段进行分词处理,得到语料片段短语集;Perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a phrase set of corpus fragments;
    采用预设词向量模型,对每个所述语料片段短语集中的各个短语进行词向量生成,得到短语词向量集;Using a preset word vector model to generate a word vector for each phrase in each of the corpus fragment phrase sets, to obtain a phrase word vector set;
    对每个所述短语词向量集进行平均值计算,得到所述第一词向量。performing average calculation on each set of phrase word vectors to obtain the first word vector.
  20. 根据权利要求17所述的计算机可读存储介质,其中,所述采用初始模型,对所述训练样本集中的每个所述语料片段定义进行句子向量生成,得到第一句子向量的步骤,包括:The computer-readable storage medium according to claim 17, wherein the step of using an initial model to generate a sentence vector for each of the corpus fragment definitions in the training sample set to obtain a first sentence vector includes:
    对所述训练样本集中的每个所述语料片段定义进行分词处理,得到定义短语集;Perform word segmentation processing on each of the corpus fragment definitions in the training sample set to obtain a defined phrase set;
    将每个所述定义短语集输入所述初始模型进行句子向量生成,得到所述第一句子向量。Inputting each defined phrase set into the initial model to generate sentence vectors to obtain the first sentence vectors.
PCT/CN2022/090157 2021-10-26 2022-04-29 Sentence vector generation method and apparatus, device, and storage medium WO2023071115A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111250467.8 2021-10-26
CN202111250467.8A CN113935315A (en) 2021-10-26 2021-10-26 Sentence vector generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023071115A1 true WO2023071115A1 (en) 2023-05-04

Family

ID=79284360

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090157 WO2023071115A1 (en) 2021-10-26 2022-04-29 Sentence vector generation method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN113935315A (en)
WO (1) WO2023071115A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113935315A (en) * 2021-10-26 2022-01-14 平安科技(深圳)有限公司 Sentence vector generation method, device, equipment and storage medium
CN114519395B (en) * 2022-02-22 2024-05-14 平安科技(深圳)有限公司 Model training method and device, text abstract generating method and device and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960804A (en) * 2019-03-21 2019-07-02 江西风向标教育科技有限公司 A kind of topic text sentence vector generation method and device
WO2020106267A1 (en) * 2018-11-19 2020-05-28 Genesys Telecommunications Laboratories, Inc. Method and system for sentiment analysis
CN111222329A (en) * 2019-12-10 2020-06-02 上海八斗智能技术有限公司 Sentence vector training method and model, and sentence vector prediction method and system
CN111709223A (en) * 2020-06-02 2020-09-25 上海硬通网络科技有限公司 Method and device for generating sentence vector based on bert and electronic equipment
CN112016296A (en) * 2020-09-07 2020-12-01 平安科技(深圳)有限公司 Sentence vector generation method, device, equipment and storage medium
CN113935315A (en) * 2021-10-26 2022-01-14 平安科技(深圳)有限公司 Sentence vector generation method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020106267A1 (en) * 2018-11-19 2020-05-28 Genesys Telecommunications Laboratories, Inc. Method and system for sentiment analysis
CN109960804A (en) * 2019-03-21 2019-07-02 江西风向标教育科技有限公司 A kind of topic text sentence vector generation method and device
CN111222329A (en) * 2019-12-10 2020-06-02 上海八斗智能技术有限公司 Sentence vector training method and model, and sentence vector prediction method and system
CN111709223A (en) * 2020-06-02 2020-09-25 上海硬通网络科技有限公司 Method and device for generating sentence vector based on bert and electronic equipment
CN112016296A (en) * 2020-09-07 2020-12-01 平安科技(深圳)有限公司 Sentence vector generation method, device, equipment and storage medium
CN113935315A (en) * 2021-10-26 2022-01-14 平安科技(深圳)有限公司 Sentence vector generation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113935315A (en) 2022-01-14

Similar Documents

Publication Publication Date Title
WO2022104967A1 (en) Pre-training language model-based summarization generation method
WO2022057776A1 (en) Model compression method and apparatus
CN110069790B (en) Machine translation system and method for contrasting original text through translated text retranslation
CN111859987B (en) Text processing method, training method and device for target task model
WO2023071115A1 (en) Sentence vector generation method and apparatus, device, and storage medium
CN109492202B (en) Chinese error correction method based on pinyin coding and decoding model
CN110931137B (en) Machine-assisted dialog systems, methods, and apparatus
WO2021044908A1 (en) Translation device, translation method, and program
CN111666775B (en) Text processing method, device, equipment and storage medium
WO2023159758A1 (en) Data enhancement method and apparatus, electronic device, and storage medium
US20230023789A1 (en) Method for identifying noise samples, electronic device, and storage medium
WO2023137911A1 (en) Intention classification method and apparatus based on small-sample corpus, and computer device
US20220058349A1 (en) Data processing method, device, and storage medium
US20210174003A1 (en) Sentence encoding and decoding method, storage medium, and device
CN112016300A (en) Pre-training model processing method, pre-training model processing device, downstream task processing device and storage medium
CN111985218A (en) Automatic judicial literature proofreading method based on generation of confrontation network
US20200364543A1 (en) Computationally efficient expressive output layers for neural networks
CN111951785B (en) Voice recognition method and device and terminal equipment
EP4109443A2 (en) Method for correcting text, method for generating text correction model, device and medium
US20230205994A1 (en) Performing machine learning tasks using instruction-tuned neural networks
CN114330375A (en) Term translation method and system based on fixed paradigm
Chen et al. Reinforced zero-shot cross-lingual neural headline generation
WO2021237928A1 (en) Training method and apparatus for text similarity recognition model, and related device
Jiang et al. English-Vietnamese machine translation model based on sequence to sequence algorithm
CN111858899A (en) Statement processing method, device, system and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22885029

Country of ref document: EP

Kind code of ref document: A1