WO2023071115A1

WO2023071115A1 - Sentence vector generation method and apparatus, device, and storage medium

Info

Publication number: WO2023071115A1
Application number: PCT/CN2022/090157
Authority: WO
Inventors: 陈浩; 谯轶轩
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-10-26
Filing date: 2022-04-29
Publication date: 2023-05-04
Also published as: CN113935315A

Abstract

The present application relates to the technical field of artificial intelligence, and discloses a sentence vector generation method and apparatus, a device, and a storage medium. The method comprises: acquiring target text data; and inputting the target text data into a sentence vector generation model to generate a sentence vector, so as to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network by using multiple training samples, and each of the training samples comprises: a corpus fragment and a corpus fragment definition. In this case, the sentence vector generation model is obtained by training the neural network on the basis of the corpus fragment and the corpus fragment definition, the training difficulty is reduced, and the sentence vector is prevented from being constructed by using an unsupervised learning method or a comparative learning-based method.

Description

Sentence vector generation method, device, equipment and storage medium

This application claims the priority of the Chinese patent application with the application number 202111250467.8 submitted to the China Patent Office on October 26 , 2021 , and the invention title is " Sentence vector generation method, device, equipment and storage medium ", the entire content of which is incorporated by reference in this application.

technical field

The present application relates to the technical field of natural language processing in artificial intelligence, in particular to a sentence vector generation method, device, equipment and storage medium.

Background technique

In various application scenarios of natural language processing (NLP), the sentence vector is to encode the text data information in the sentence into a fixed dense vector space. The sentence vector plays an important role in various NLP tasks, such as sentence vector Applied to NLP tasks such as classification, clustering, and sentence similarity measurement.

Methods for constructing sentence vectors include unsupervised learning methods or contrastive learning-based methods. The inventor realized that the training process of the unsupervised learning method requires a large amount of corpus, and the model is difficult to converge, which leads to the gradual elimination of this method. Based on the method of contrastive learning, the model is mainly trained by constructing positive and negative samples. The difficulty of this method is that the text data belongs to discontinuous discrete data, so the generation of positive samples cannot be simply constructed by flipping and cutting like image data. Therefore, only high-quality positive samples can train a better contrastive learning model, which makes it difficult for this method to be widely used.

technical problem

The main purpose of this application is to provide a sentence vector generation method, device, equipment and storage medium, aiming at solving the problem that the prior art adopts the method of unsupervised learning or the method based on contrastive learning to construct sentence vectors, and the method of unsupervised learning requires a large amount of The corpus and model are difficult to converge, and the method based on contrastive learning requires high-quality positive samples.

technical solution

In the first aspect, the application proposes a method for generating sentence vectors, the method comprising:

Get target text data;

The target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.

In the second aspect, the present application also proposes a sentence vector generating device, the device comprising:

A data acquisition module, configured to acquire target text data;

The sentence vector generation module is used to input the target text data into the sentence vector generation model to generate the sentence vector, and obtain the target sentence vector corresponding to the target text data, wherein the sentence vector generation model adopts a plurality of training sample pairs The model obtained by neural network training, each of the training samples includes: a corpus fragment and a definition of the corpus fragment.

In a third aspect, the present application also proposes a computer device, including:

one or more processors;

memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs are configured to perform A sentence vector generation method:

Wherein, the sentence vector generation method comprises:

Get target text data;

In the fourth aspect, the present application also proposes a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, a method for generating a sentence vector is implemented, wherein the sentence vector The generation method includes the following steps:

Get target text data;

Beneficial effect

The sentence vector generation method, device, equipment and storage medium of the present application, the method generates the sentence vector by inputting the target text data into the sentence vector generation model, and obtains the target sentence vector corresponding to the target text data, wherein the The sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, each of which includes: a corpus fragment and a corpus fragment definition, so as to train the neural network based on the corpus fragment and the corpus fragment definition to obtain a sentence The vector generation model reduces the difficulty of training and avoids the use of unsupervised learning methods or methods based on contrastive learning to construct sentence vectors.

Description of drawings

Fig. 1 is the schematic flow chart of the sentence vector generation method of an embodiment of the present application;

Fig. 2 is a structural schematic block diagram of a sentence vector generation device according to an embodiment of the present application;

FIG. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.

The realization, functional features and advantages of the present application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

BEST MODE FOR CARRYING OUT THE INVENTION

In order to solve the above problems, the present application provides a method for generating sentence vectors, which relates to the technical field of natural language processing in artificial intelligence. For details, please refer to FIG. 1 , which is a schematic flow diagram of a method for generating sentence vectors according to an embodiment of the present application. The method includes the following steps:

S1: Obtain target text data;

S2: Input the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is obtained by training a neural network using a plurality of training samples A model, each of the training samples includes: a corpus fragment and a definition of the corpus fragment.

In this embodiment, the target sentence vector corresponding to the target text data is obtained by inputting the target text data into the sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model uses a plurality of training samples to train the neural network The model obtained, each of the training samples includes: a corpus segment and a corpus segment definition, thereby realizing training the neural network based on the corpus segment and the corpus segment definition to obtain a sentence vector generation model, reducing the training difficulty and avoiding the use of unsupervised learning method or construct sentence vector based on contrastive learning method.

For S1, the target text data input by the user can be obtained, the target text data can also be obtained from the database, and the target text data can also be obtained from a third-party application system.

The target text data, that is, the text data that needs to generate sentence vectors.

When the sentence vector generated by this application is applied to a book recommendation scenario, the target text data may be one or more of title, abstract, and keywords in a book.

For S2, the target text data is input into a sentence vector generation model to generate a sentence vector, and the generated sentence vector is used as the target sentence vector corresponding to the target text data.

Wherein, the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, and the neural network includes but is not limited to: Bert (Bidirectional Encoder Representations from Transformers) model, XLNET (universal autoregressive pre-training method) model.

Each of the training samples includes: corpus fragments and corpus fragment definitions. A corpus fragment includes one or more words. The definition of the corpus fragment is the interpretation of the corpus fragment. That is, corpus fragments and corpus fragment definitions form text pairs. The neural network is trained by using the text pair to shorten the distance between the corpus fragment and the definition of the corpus fragment, thereby improving the accuracy of the sentence vector.

Wherein, using a plurality of training samples to train the neural network to obtain the sentence vector generation model includes: generating word vectors according to the corpus fragments in the training samples to obtain corpus fragment word vectors; using the neural network to obtain The initial model of the definition of the corpus fragment in the training sample is carried out sentence vector generation, obtains the corpus fragment definition sentence vector; According to the corpus fragment word vector and the corpus fragment definition sentence vector, the initial model is trained, and the training ends The initial model of is used as the sentence vector generation model. As a result, the distance between the word vector corresponding to the corpus fragment and the sentence vector corresponding to the corpus fragment definition is shortened, the accuracy of the sentence vector generation model to generate the sentence vector is improved, and the generation-based NLP training method is not required, which reduces the training time. Difficulty, takes up less resources, and the model is easy to converge.

Optionally, when using multiple training samples to train the neural network, a learning rate of 0.0005 is used.

In one embodiment, before the step of inputting the target text data into the sentence vector generation model to generate the sentence vector, and obtaining the target sentence vector corresponding to the target text data, it also includes:

S211: Obtain a plurality of training samples;

S212: Obtain the training samples of a preset batch number as a training sample set;

S213: Generate a word vector according to each of the corpus fragments in the training sample set to obtain a first word vector;

S214: Using an initial model, perform sentence vector generation on each of the corpus fragment definitions in the training sample set to obtain a first sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;

S215: Calculate the loss value according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value, update the parameters of the initial model according to the first loss value, and update the updated The initial model is used to calculate the first sentence vector next time;

S216: Repeat the step of obtaining the training samples of the preset batch number as the training sample set until the first training target is reached;

S217: Use the initial model that achieves the first training objective as the sentence vector generation model.

In this embodiment, the first word vector is determined according to the corpus fragment, the first sentence vector is determined according to the definition of the corpus fragment, and the loss value is calculated according to each of the first word vectors and each of the first sentence vectors, thereby shortening the corpus The distance between the word vector corresponding to the segment and the corresponding sentence vector defined by the corpus segment improves the accuracy of the sentence vector generation model for generating sentence vectors, and does not require a generative-based NLP training method, which reduces the difficulty of training and takes up less resources , the model is easy to converge; and the training sample set is used for batch training each time, so as to avoid the abnormal training sample transition from affecting the parameters of the initial model, which is conducive to improving the accuracy of training.

For S211, multiple training samples input by the user may be obtained, multiple training samples may be obtained from a database, or multiple training samples may be obtained from a third-party application system.

For S212, acquire a preset batch of training samples from each of the training samples, and use each of the acquired training samples as a training sample set.

Optionally, the preset number of batches is set to 64. It can be understood that the preset batch quantity can also be set to other values, which is not limited here.

For S213, a word vector is generated according to each of the corpus fragments in the training sample set, and each generated word vector is used as a first word vector (that is, a corpus fragment word vector). That is to say, each corpus segment corresponds to a first word vector.

For S214, the initial model is used to generate a sentence vector for each of the corpus fragment definitions in the training sample set, and the generated sentence vector is used as the first sentence vector (that is, the corpus fragment defines a sentence vector). That is to say, each corpus fragment definition corresponds to a first sentence vector.

Wherein, the initial model is a model obtained based on the Bert model or the XLNET model. It can be understood that the initial model may also use other models, which are not limited here.

For S215, the calculation of the loss value is performed according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value, so as to realize the calculation of a first loss value for each batch.

Wherein, the quantity of the first word vector in each of the first word vectors is the same as the value of the preset batch quantity; the quantity of the first sentence vector in each of the first sentence vectors is the same as the preset batch quantity The value of the number of times is the same.

The steps of the method for updating the parameters of the initial model according to the first loss value will not be repeated here.

The updated initial model is used for the next calculation of the first sentence vector, thereby implementing an iterative update of the initial model.

As for S216, step S212 to step S216 are repeatedly executed until the first training target is reached.

The first training target includes: the first loss value reaches a first convergence condition or the number of iterations of the initial model reaches a second convergence condition.

The first convergence condition means that the size of the first loss value calculated twice adjacently satisfies the Lipschitz condition (Lipschitz continuous condition).

The number of iterations refers to the number of calculations of the first loss value, that is, the number of iterations is increased by 1 after being calculated once.

The second convergence condition is a specific numerical value.

For S217, the initial model that achieves the first training objective is a model that meets expected requirements, so the initial model that achieves the first training objective is used as the sentence vector generation model.

In one embodiment, the above step of obtaining a plurality of training samples includes:

S2111: Obtain dictionary data, the dictionary data includes: text segment and text segment definition, the text segment includes: any one of single Chinese characters, words, and idioms, and the text segment definition is an explanation for the text segment;

S2112: Obtain any text segment from the dictionary data as a target text segment;

S2113: Generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the target The text segment corresponding to the text segment is defined as the corpus segment definition of the training sample;

S2114: Repeat the step of obtaining any text segment from the dictionary data as the target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is obtained.

In this embodiment, the training samples are determined according to the dictionary data, which has the advantages of high accuracy, high stability, and easy acquisition. In order to shorten the accuracy of the distance between the word vector corresponding to the corpus fragment and the sentence vector corresponding to the corpus fragment definition This provides a basis for further improving the accuracy of the sentence vector generation model to generate sentence vectors.

For S2111, the dictionary data input by the user can be obtained, the dictionary data can also be obtained from the database, and the dictionary data can also be obtained from a third-party application system.

Optionally, the dictionary data is data obtained from the Xinhua Dictionary. It can be understood that the dictionary data may also be data obtained from other dictionaries, such as English dictionaries and other Chinese dictionaries, which are not limited here.

For S2112, any text segment is obtained from the dictionary data, and the obtained text segment is used as a target text segment.

For S2113, generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, so that the target text segment and the interpretation corresponding to the target text segment are used as a text pair, and the text pair as the training samples. There is no need to determine the calibration data of the sentence vector, which simplifies the generation of training samples.

For S2114, step S2112 to step S2114 are repeatedly executed until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired. When the acquisition of the text segment in the dictionary data is completed, it has been realized that training samples have been generated for all the data in the dictionary data. When the sample generation end signal is obtained, it means that a sufficient number of training samples have been generated.

The sample generation end signal is generated by the program file implementing the present application according to preset conditions. For example, the preset condition is a preset sample size, which is not specifically limited in this example.

In one embodiment, the above-mentioned step of generating word vectors according to each of the corpus fragments in the training sample set to obtain the first word vector includes:

S2131: Perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a phrase set of corpus fragments;

S2132: Using a preset word vector model, perform word vector generation for each phrase in the phrase set of each corpus fragment, to obtain a phrase word vector set;

S2133: Perform average calculation on each set of phrase word vectors to obtain the first word vector.

This embodiment implements word segmentation for corpus fragments, word vector generation for each phrase, and calculation of the average value of each word vector to obtain the word vectors of the corpus fragments, thereby narrowing the correspondence between the word vectors corresponding to the corpus fragments and the corpus fragment definitions. The distance between the sentence vectors of σ provides the basis.

For S2131, word segmentation is performed on each of the corpus fragments in the training sample set, and all phrases obtained by word segmentation are used as a corpus fragment phrase set. That is to say, there is a one-to-one correspondence between the phrase set of corpus fragments and the corpus fragments in the training sample set.

For S2132, a preset word vector model is used to generate word vectors for each phrase in the corpus fragment phrase set, and each generated word vector is used as a phrase word vector set. That is to say, the phrase word vector set is in one-to-one correspondence with the corpus fragments in the training sample set.

Optionally, the preset word vector model uses the pre-trained Chinese Glove (Global Vectors for Word Representation) word vector model.

For S2133, perform average calculation on the phrase word vector set, and use the calculated data as the first word vector.

In one embodiment, the above-mentioned initial model is used to generate sentence vectors for each of the corpus fragment definitions in the training sample set, and the step of obtaining the first sentence vector includes:

S2141: Perform word segmentation processing on each definition of the corpus segment in the training sample set to obtain a set of defined phrases;

S2142: Input each defined phrase set into the initial model to generate a sentence vector to obtain the first sentence vector.

This embodiment implements word segmentation processing and sentence vector generation for the definitions of the corpus fragments, thereby providing a basis for shortening the distance between the word vectors corresponding to the corpus fragments and the sentence vectors corresponding to the corpus fragment definitions.

For S2141, word segmentation is performed on each definition of the corpus segment in the training sample set, and all phrases obtained by word segmentation are used as a set of defined phrases. That is to say, there is a one-to-one correspondence between the defined phrase set and the definitions of the corpus fragments in the training sample set.

For S2142, input the defined phrase set into the initial model to generate a sentence vector, and use the generated sentence vector as the first sentence vector. That is to say, there is a one-to-one correspondence between the first sentence vector and the corpus segment definitions in the training sample set.

In one embodiment, the step of calculating the loss value according to each of the first word vectors and each of the first sentence vectors to obtain the first loss value includes:

S2151: Obtain any one of the first word vectors as a target word vector;

S2152: Input the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to calculate a loss value to obtain a pending loss value, wherein the preset loss function adopts a relative entropy loss function;

S2153: Repeat the step of acquiring any one of the first word vectors as the target word vector until the acquisition of the first word vector is completed;

S2154: Perform average calculation on each of the loss values to be processed to obtain the first loss value.

In this embodiment, the relative entropy loss function is used as the preset loss function, which is beneficial to shorten the distance between the vectors. The first loss value is obtained by calculating the average value of each of the loss values to be processed, thereby realizing the Each training only updates the parameters of the initial model once.

For S2151, any one of the first word vectors is acquired, and the acquired first word vector is used as a target word vector.

For S2152, input the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to calculate a loss value, and use the calculated loss value as a loss value to be processed.

For S2153, step S2151 to step S2153 are repeatedly executed until the acquisition of the first word vector is completed.

For S2154, an average value is calculated for each of the loss values to be processed, and the calculated data is used as the first loss value.

S221: Obtain a plurality of training samples;

S222: Obtain one training sample as a target training sample;

S223: Generate a word vector according to the corpus fragment in the target training sample to obtain a second word vector;

S224: Using an initial model, generate a sentence vector for the definition of the corpus segment in the target training sample to obtain a second sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;

S225: Calculate the loss value according to the second word vector and the second sentence vector to obtain a second loss value, update the parameters of the initial model according to the second loss value, and use the updated initial model For calculating the second sentence vector next time;

S226: Repeat the step of acquiring one training sample as a target training sample until a second training target is reached;

S227: Use the initial model that achieves the second training objective as the sentence vector generation model.

In this embodiment, the second word vector is determined according to the corpus fragment, the second sentence vector is determined according to the definition of the corpus fragment, and the loss value is calculated according to the second word vector and the second sentence vector, thereby shortening the correspondence between the corpus fragment. The distance between the word vectors and corpus fragments defines the corresponding sentence vectors, which improves the accuracy of the sentence vector generation model to generate sentence vectors, and does not require a generative NLP training method, which reduces the difficulty of training and takes up less resources. The model Easy to converge.

For S221, multiple training samples input by the user may be obtained, multiple training samples may be obtained from a database, or multiple training samples may be obtained from a third-party application system.

For S222, one training sample is acquired from each of the training samples, and the acquired training sample is used as a target training sample.

For S223, word segmentation is performed according to the corpus fragments in the target training sample, and word vector generation is performed according to each phrase obtained by word segmentation, and the average value of each generated word vector is calculated, and the calculated data is used as the second Word vectors (that is, corpus fragment word vectors).

For S224, the definition of the corpus segment in the target training sample is segmented, and each phrase obtained by the segment is input into the initial model for sentence vector generation, and the generated sentence vector is used as the second sentence vector (that is, the corpus segment defines the sentence vector).

For S225, a loss value is calculated according to the second word vector and the second sentence vector, and the calculated loss value is used as a second loss value.

The step of updating the parameters of the initial model according to the second loss value will not be repeated here.

For S226, repeatedly execute steps S222 to S226 until the second training target is reached.

The second training objective includes: the second loss value reaches a third convergence condition or the number of iterations of the initial model reaches a fourth convergence condition.

The third convergence condition means that the size of the second loss value calculated twice adjacently satisfies the Lipschitz condition (Lipschitz continuous condition).

The number of iterations refers to the number of calculations of the second loss value, that is, the number of iterations is increased by 1 after being calculated once.

The fourth convergence condition is a specific numerical value.

For S227, the initial model that achieves the second training objective is a model that meets expected requirements, so the initial model that achieves the second training objective is used as the sentence vector generation model.

With reference to Fig. 2, the present application also proposes a kind of sentence vector generation device, and described device comprises:

Data acquisition module 100, for acquiring target text data;

The sentence vector generation module 200 is used to input the target text data into the sentence vector generation model to generate the sentence vector, and obtain the target sentence vector corresponding to the target text data, wherein the sentence vector generation model adopts a plurality of training samples For the model obtained by neural network training, each training sample includes: a corpus fragment and a definition of the corpus fragment.

In one embodiment, the above device further includes: a first model training module;

The first model training module is configured to obtain a plurality of training samples, obtain a preset batch of training samples as a training sample set, and perform word vectors according to each of the corpus fragments in the training sample set Generate, obtain the first word vector, adopt initial model, carry out sentence vector generation to each described corpus fragment definition in described training sample set, obtain the first sentence vector, wherein, described initial model is based on Bert model or XLNET model The obtained model performs loss value calculation according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value, updates the parameters of the initial model according to the first loss value, and updates the updated The initial model is used to calculate the first sentence vector next time, and the step of obtaining the training samples of the preset batch number as the training sample set is repeated until the first training target is reached, and the second sentence vector will be reached. The initial model of a training target is used as the sentence vector generation model.

In one embodiment, the above-mentioned first model training module includes: a training sample generation submodule;

Described training sample generation submodule is used to obtain dictionary data, and described dictionary data comprises: text segment and text segment definition, and text segment comprises: any one in single Chinese character, word, idiom, and described text segment definition is to An explanation of the text segment, obtaining any text segment from the dictionary data as a target text segment, generating the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein , using the target text segment as the corpus segment of the training sample, using the text segment definition corresponding to the target text segment as the corpus segment definition of the training sample, and repeatedly executing the The step of obtaining any text segment in the dictionary data as the target text segment, until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is obtained.

In one embodiment, the above-mentioned first model training module further includes: a first word vector determination submodule;

The first word vector determination submodule is used to perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a corpus fragment phrase set, and use a preset word vector model to perform word segmentation processing on each of the corpus fragment phrases Word vectors are generated for each phrase in the set to obtain a phrase word vector set, and an average value is calculated for each of the phrase word vector sets to obtain the first word vector.

In one embodiment, the above-mentioned first model training module also includes: a first sentence vector determination submodule;

The first sentence vector determination submodule is used to perform word segmentation processing on each of the corpus fragment definitions in the training sample set to obtain a defined phrase set, and input each of the defined phrase sets into the initial model for sentence The vector is generated to obtain the first sentence vector.

In one embodiment, the above-mentioned first model training module further includes: a first loss value determination submodule;

The first loss value determination submodule is used to obtain any one of the first word vectors as a target word vector, and input the target word vector and the first sentence vector corresponding to the target word vector into a preset loss The function calculates the loss value to obtain the loss value to be processed, wherein the preset loss function adopts a relative entropy loss function, and repeats the step of obtaining any one of the first word vectors as the target word vector until the completion of the The acquisition of the first word vector is to perform average calculation on each of the loss values to be processed to obtain the first loss value.

In one embodiment, the above device further includes: a second model training module;

The second model training module is used to obtain a plurality of training samples, obtain one of the training samples as a target training sample, perform word vector generation according to the corpus fragments in the target training samples, and obtain a second word Vector, using an initial model to generate a sentence vector for the definition of the corpus segment in the target training sample to obtain a second sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model, according to the The second word vector and the second sentence vector perform loss value calculation to obtain a second loss value, update the parameters of the initial model according to the second loss value, and use the updated initial model for the next calculation The second sentence vector, repeating the step of obtaining one of the training samples as the target training sample, until reaching the second training target, generating the initial model that reaches the second training target as the sentence vector Model.

Referring to FIG. 3 , an embodiment of the present application further provides a computer device, which may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, network interface and database connected by a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs and databases. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store data such as sentence vector generation methods. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by a processor, a method for generating sentence vectors is realized. The sentence vector generation method includes: obtaining target text data; inputting the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model It is a model obtained by training a neural network by using multiple training samples, each of which includes: a corpus fragment and a definition of the corpus fragment.

An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored. The storage medium is a volatile storage medium or a non-volatile storage medium. When the computer program is executed by a processor, a The sentence vector generation method comprises the steps of: obtaining target text data; inputting the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is A model obtained by training the neural network by using multiple training samples, each of which includes: a corpus fragment and a definition of the corpus fragment.

In the sentence vector generation method performed above, the target sentence vector corresponding to the target text data is obtained by inputting the target text data into the sentence vector generation model to generate the sentence vector, wherein the sentence vector generation model adopts multiple training The model obtained by sample pair neural network training, each of said training samples includes: corpus segment and corpus segment definition, thereby realizing training neural network training based on corpus segment and corpus segment definition to obtain a sentence vector generation model, reducing training difficulty, It avoids using unsupervised learning methods or contrastive learning-based methods to construct sentence vectors.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented through computer programs to instruct related hardware, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any references to memory, storage, database or other media provided in the present application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Claims

A method for generating sentence vectors, wherein the method includes:

Get target text data;

The target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
The sentence vector generation method according to claim 1, wherein, before the step of obtaining the target sentence vector corresponding to the target text data, the target text data is input into the sentence vector generation model to perform sentence vector generation, further comprising:

Obtain a plurality of said training samples;

Obtaining the training samples of a preset batch number as a training sample set;

performing word vector generation according to each of the corpus fragments in the training sample set to obtain a first word vector;

Adopt initial model, carry out sentence vector generation to each described corpus fragment definition in described training sample set, obtain the first sentence vector, wherein, described initial model is the model that obtains based on Bert model or XLNET model;

Calculate the loss value according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value; update the parameters of the initial model according to the first loss value, and use the updated initial model For calculating the first sentence vector next time;

Repeating the step of obtaining the training samples of the preset batch number as the training sample set until the first training target is reached;

The initial model that achieves the first training objective is used as the sentence vector generation model.
The method for generating sentence vectors according to claim 2, wherein the step of obtaining a plurality of training samples includes:

Obtaining dictionary data, the dictionary data includes: a text segment and a text segment definition, the text segment includes: any one of a single Chinese character, a word, an idiom, and the text segment definition is an explanation to the text segment;

Acquiring any text segment from the dictionary data as a target text segment;

Generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the target text segment is The corresponding text segment is defined as the corpus segment definition of the training sample;

Repeating the step of acquiring any text segment from the dictionary data as the target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired.
The method for generating sentence vectors according to claim 2, wherein the step of generating word vectors according to each of the corpus fragments in the training sample set to obtain the first word vectors includes:

Perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a phrase set of corpus fragments;

Using a preset word vector model to generate a word vector for each phrase in each of the corpus fragment phrase sets, to obtain a phrase word vector set;

performing average calculation on each set of phrase word vectors to obtain the first word vector.
The method for generating sentence vectors according to claim 2, wherein the step of using the initial model to generate sentence vectors for each of the corpus fragment definitions in the training sample set to obtain a first sentence vector includes:

Perform word segmentation processing on each of the corpus fragment definitions in the training sample set to obtain a defined phrase set;

Inputting each defined phrase set into the initial model to generate sentence vectors to obtain the first sentence vectors.
The sentence vector generation method according to claim 2, wherein, the step of calculating the loss value according to each of the first word vectors and each of the first sentence vectors to obtain the first loss value includes:

Acquiring any one of the first word vectors as a target word vector;

Inputting the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to calculate a loss value to obtain a loss value to be processed, wherein the preset loss function adopts a relative entropy loss function;

Repeating the step of acquiring any one of the first word vectors as the target word vector until the acquisition of the first word vector is completed;

Perform average calculation on each of the loss values to be processed to obtain the first loss value.
The sentence vector generation method according to claim 1, wherein, before the step of obtaining the target sentence vector corresponding to the target text data, the target text data is input into the sentence vector generation model to perform sentence vector generation, further comprising:

Obtain a plurality of said training samples;

Obtaining one of the training samples as a target training sample;

performing word vector generation according to the corpus fragment in the target training sample to obtain a second word vector;

Adopt initial model, carry out sentence vector generation to the described corpus segment definition in described target training sample, obtain the second sentence vector, wherein, described initial model is the model that obtains based on Bert model or XLNET model;

Calculate the loss value according to the second word vector and the second sentence vector to obtain a second loss value, update the parameters of the initial model according to the second loss value, and use the updated initial model for Computing the second sentence vector next time;

Repeating the step of acquiring one training sample as a target training sample until reaching a second training target;

The initial model that achieves the second training objective is used as the sentence vector generation model.
A device for generating sentence vectors, wherein the device includes:

A data acquisition module, configured to acquire target text data;

The sentence vector generation module is used to input the target text data into the sentence vector generation model to generate the sentence vector, and obtain the target sentence vector corresponding to the target text data, wherein the sentence vector generation model adopts a plurality of training sample pairs The model obtained by neural network training, each of the training samples includes: a corpus fragment and a definition of the corpus fragment.
A computer device comprising:

one or more processors;

memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs are configured to perform A sentence vector generation method:

Wherein, the sentence vector generation method comprises:

Get target text data;

The target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
The computer device according to claim 9, wherein, before the step of inputting the target text data into the sentence vector generation model to generate a sentence vector, and obtaining the target sentence vector corresponding to the target text data, further comprising:

Obtain a plurality of said training samples;

Obtaining the training samples of a preset batch number as a training sample set;

performing word vector generation according to each of the corpus fragments in the training sample set to obtain a first word vector;

Adopt initial model, carry out sentence vector generation to each described corpus fragment definition in described training sample set, obtain the first sentence vector, wherein, described initial model is the model that obtains based on Bert model or XLNET model;

Calculate the loss value according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value; update the parameters of the initial model according to the first loss value, and use the updated initial model For calculating the first sentence vector next time;

Repeating the step of obtaining the training samples of the preset batch number as the training sample set until the first training target is reached;

The initial model that achieves the first training objective is used as the sentence vector generation model.
The computer device according to claim 10, wherein said step of obtaining a plurality of said training samples comprises:

Obtaining dictionary data, the dictionary data includes: a text segment and a text segment definition, the text segment includes: any one of a single Chinese character, a word, an idiom, and the text segment definition is an explanation to the text segment;

Acquiring any text segment from the dictionary data as a target text segment;

Generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the target text segment is The corresponding text segment is defined as the corpus segment definition of the training sample;

Repeating the step of acquiring any text segment from the dictionary data as the target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired.
The computer device according to claim 10, wherein the step of generating a word vector according to each of the corpus fragments in the training sample set to obtain a first word vector includes:

Perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a phrase set of corpus fragments;

Using a preset word vector model to generate a word vector for each phrase in each of the corpus fragment phrase sets, to obtain a phrase word vector set;

performing average calculation on each set of phrase word vectors to obtain the first word vector.
The computer device according to claim 10, wherein the step of using the initial model to generate a sentence vector for each of the corpus fragment definitions in the training sample set to obtain the first sentence vector includes:

Perform word segmentation processing on each of the corpus fragment definitions in the training sample set to obtain a defined phrase set;

Inputting each defined phrase set into the initial model to generate sentence vectors to obtain the first sentence vectors.
The computer device according to claim 10, wherein the step of calculating the loss value according to each of the first word vectors and each of the first sentence vectors to obtain the first loss value includes:

Acquiring any one of the first word vectors as a target word vector;

Inputting the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to calculate a loss value to obtain a loss value to be processed, wherein the preset loss function adopts a relative entropy loss function;

Repeating the step of acquiring any one of the first word vectors as the target word vector until the acquisition of the first word vector is completed;

Perform average calculation on each of the loss values to be processed to obtain the first loss value.
The computer device according to claim 10, wherein, before the step of inputting the target text data into a sentence vector generation model to generate a sentence vector, and obtaining the target sentence vector corresponding to the target text data, further comprising:

Obtain a plurality of said training samples;

Obtaining one of the training samples as a target training sample;

performing word vector generation according to the corpus fragment in the target training sample to obtain a second word vector;

Adopt initial model, carry out sentence vector generation to the described corpus segment definition in described target training sample, obtain the second sentence vector, wherein, described initial model is the model that obtains based on Bert model or XLNET model;

Calculate the loss value according to the second word vector and the second sentence vector to obtain a second loss value, update the parameters of the initial model according to the second loss value, and use the updated initial model for Computing the second sentence vector next time;

Repeating the step of acquiring one training sample as a target training sample until reaching a second training target;

The initial model that achieves the second training objective is used as the sentence vector generation model.
A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, a method for generating a sentence vector is realized, wherein the method for generating a sentence vector comprises the following steps:

Get target text data;

The target text data is input into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, Each of the training samples includes: corpus fragments and corpus fragment definitions.
The computer-readable storage medium according to claim 16, wherein, before the step of inputting the target text data into a sentence vector generation model to generate a sentence vector, and obtaining the target sentence vector corresponding to the target text data, further comprising :

Obtain a plurality of said training samples;

Obtaining the training samples of a preset batch number as a training sample set;

performing word vector generation according to each of the corpus fragments in the training sample set to obtain a first word vector;

Adopt initial model, carry out sentence vector generation to each described corpus fragment definition in described training sample set, obtain the first sentence vector, wherein, described initial model is the model that obtains based on Bert model or XLNET model;

Calculate the loss value according to each of the first word vectors and each of the first sentence vectors to obtain a first loss value; update the parameters of the initial model according to the first loss value, and use the updated initial model For calculating the first sentence vector next time;

Repeating the step of obtaining the training samples of the preset batch number as the training sample set until the first training target is reached;

The initial model that achieves the first training objective is used as the sentence vector generation model.
The computer-readable storage medium according to claim 17, wherein said step of obtaining a plurality of said training samples comprises:

Obtaining dictionary data, the dictionary data includes: a text segment and a text segment definition, the text segment includes: any one of a single Chinese character, a word, an idiom, and the text segment definition is an explanation to the text segment;

Acquiring any text segment from the dictionary data as a target text segment;

Generate the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the target text segment is The corresponding text segment is defined as the corpus segment definition of the training sample;

Repeating the step of acquiring any text segment from the dictionary data as the target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired.
The computer-readable storage medium according to claim 17, wherein the step of generating a word vector according to each of the corpus fragments in the training sample set to obtain a first word vector includes:

Perform word segmentation processing on each of the corpus fragments in the training sample set to obtain a phrase set of corpus fragments;

Using a preset word vector model to generate a word vector for each phrase in each of the corpus fragment phrase sets, to obtain a phrase word vector set;

performing average calculation on each set of phrase word vectors to obtain the first word vector.
The computer-readable storage medium according to claim 17, wherein the step of using an initial model to generate a sentence vector for each of the corpus fragment definitions in the training sample set to obtain a first sentence vector includes:

Perform word segmentation processing on each of the corpus fragment definitions in the training sample set to obtain a defined phrase set;

Inputting each defined phrase set into the initial model to generate sentence vectors to obtain the first sentence vectors.