CN113935315A - Sentence vector generation method, device, equipment and storage medium - Google Patents

Sentence vector generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN113935315A
CN113935315A CN202111250467.8A CN202111250467A CN113935315A CN 113935315 A CN113935315 A CN 113935315A CN 202111250467 A CN202111250467 A CN 202111250467A CN 113935315 A CN113935315 A CN 113935315A
Authority
CN
China
Prior art keywords
sentence vector
training
target
model
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111250467.8A
Other languages
Chinese (zh)
Inventor
陈浩
谯轶轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202111250467.8A priority Critical patent/CN113935315A/en
Publication of CN113935315A publication Critical patent/CN113935315A/en
Priority to PCT/CN2022/090157 priority patent/WO2023071115A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application relates to the technical field of artificial intelligence, and discloses a sentence vector generation method, a sentence vector generation device, sentence vector generation equipment and a storage medium, wherein the method comprises the following steps: acquiring target text data; inputting the target text data into a sentence vector generation model for sentence vector generation to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is obtained by training a neural network by adopting a plurality of training samples, and each training sample comprises: corpus fragment and corpus fragment definitions. Therefore, training of neural network training is achieved based on the corpus fragment and the corpus fragment definition to obtain a sentence vector generation model, training difficulty is reduced, and sentence vector construction by adopting an unsupervised learning method or a comparison learning-based method is avoided.

Description

Sentence vector generation method, device, equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a sentence vector generation method, apparatus, device, and storage medium.
Background
In various application scenarios of current Natural Language Processing (NLP), a sentence vector is used to encode text data information in a sentence into a fixed dense vector space, and the sentence vector plays an important role in various NLP tasks, for example, the sentence vector is applied to NLP tasks such as classification, clustering, similarity measurement of sentences, and the like.
The sentence vector constructing method includes an unsupervised learning method or a contrast learning-based method. The unsupervised learning method requires a large amount of linguistic data in the training process, and the model is difficult to converge, so that the method is gradually eliminated. The method based on the comparative learning mainly trains the model by constructing positive and negative samples, and the method has the difficulty that text data belongs to discontinuous discrete data, so that the generation of positive samples can not be simply constructed by turning and shearing like image data, and only high-quality positive samples can train a better comparative learning model, so that the method is difficult to be widely applied.
Disclosure of Invention
The main purpose of the present application is to provide a sentence vector generation method, apparatus, device and storage medium, and to solve the technical problems that in the prior art, a method of unsupervised learning or a method based on contrast learning is adopted to construct a sentence vector, the unsupervised learning method requires a large amount of corpora and models and is difficult to converge, and the method based on contrast learning requires a high-quality positive sample.
In order to achieve the above object, the present application provides a sentence vector generation method, including:
acquiring target text data;
inputting the target text data into a sentence vector generation model for sentence vector generation to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is obtained by training a neural network by adopting a plurality of training samples, and each training sample comprises: corpus fragment and corpus fragment definitions.
Further, before the step of inputting the target text data into a sentence vector generation model for sentence vector generation to obtain a target sentence vector corresponding to the target text data, the method further includes:
obtaining a plurality of the training samples;
acquiring training samples of a preset batch number as a training sample set;
generating a word vector according to each corpus fragment in the training sample set to obtain a first word vector;
sentence vector generation is carried out on each corpus fragment definition in the training sample set by adopting an initial model to obtain a first sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;
calculating loss values according to the first word vectors and the first sentence vectors to obtain first loss values, updating parameters of the initial model according to the first loss values, and using the updated initial model for calculating the first sentence vectors next time;
repeatedly executing the step of obtaining the training samples with the preset batch number as a training sample set until a first training target is reached;
using the initial model reaching the first training goal as the sentence vector generation model.
Further, the step of obtaining a plurality of training samples includes:
obtaining dictionary data, the dictionary data comprising: a text segment and a text segment definition, the text segment comprising: any one of single Chinese character, word and idiom, the text segment definition is an explanation description of the text segment;
acquiring any text segment from the dictionary data as a target text segment;
generating the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the text segment definition corresponding to the target text segment is used as the corpus segment definition of the training sample;
and repeatedly executing the step of acquiring any text segment from the dictionary data as a target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired.
Further, the step of generating a word vector according to each corpus fragment in the training sample set to obtain a first word vector includes:
performing word segmentation processing on each corpus fragment in the training sample set to obtain a corpus fragment phrase set;
generating word vectors for all phrases in each corpus fragment phrase set by adopting a preset word vector model to obtain a phrase word vector set;
and carrying out average value calculation on each short word vector set to obtain the first word vector.
Further, the step of generating a sentence vector for each corpus fragment definition in the training sample set by using the initial model to obtain a first sentence vector includes:
performing word segmentation processing on each corpus fragment definition in the training sample set to obtain a definition phrase set;
and inputting each definition phrase set into the initial model to generate a sentence vector, so as to obtain the first sentence vector.
Further, the step of calculating a loss value according to each first word vector and each first sentence vector to obtain a first loss value includes:
acquiring any one of the first word vectors as a target word vector;
inputting the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to perform loss value calculation, so as to obtain a to-be-processed loss value, wherein the preset loss function adopts a relative entropy loss function;
repeatedly executing the step of obtaining any one first word vector as a target word vector until the first word vector is obtained;
and calculating the average value of the loss values to be processed to obtain the first loss value.
Further, before the step of inputting the target text data into a sentence vector generation model for sentence vector generation to obtain a target sentence vector corresponding to the target text data, the method further includes:
obtaining a plurality of the training samples;
acquiring one training sample as a target training sample;
generating word vectors according to the corpus segments in the target training sample to obtain second word vectors;
sentence vector generation is carried out on the corpus segment definition in the target training sample by adopting an initial model to obtain a second sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;
calculating a loss value according to the second word vector and the second sentence vector to obtain a second loss value, updating parameters of the initial model according to the second loss value, and using the updated initial model for calculating the second sentence vector next time;
repeating the step of obtaining one of the training samples as a target training sample until a second training target is reached;
using the initial model reaching the second training goal as the sentence vector generation model.
The present application further proposes a sentence vector generation apparatus, the apparatus comprising:
the data acquisition module is used for acquiring target text data;
a sentence vector generation module, configured to input the target text data into a sentence vector generation model for sentence vector generation, so as to obtain a target sentence vector corresponding to the target text data, where the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, and each training sample includes: corpus fragment and corpus fragment definitions.
The present application further proposes a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.
The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.
The method includes inputting target text data into a sentence vector generation model to generate sentence vectors, and obtaining target sentence vectors corresponding to the target text data, wherein the sentence vector generation model is a model obtained by training a neural network by adopting a plurality of training samples, and each training sample comprises: the corpus segments and the corpus segment definitions are adopted, so that training of neural network training is achieved based on the corpus segments and the corpus segment definitions to obtain a sentence vector generation model, training difficulty is reduced, and sentence vectors are prevented from being constructed by adopting an unsupervised learning method or a contrast learning method.
Drawings
Fig. 1 is a flowchart illustrating a sentence vector generation method according to an embodiment of the present application;
FIG. 2 is a block diagram illustrating a sentence vector generation apparatus according to an embodiment of the present application;
fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a sentence vector generation method, where the method includes:
s1: acquiring target text data;
s2: inputting the target text data into a sentence vector generation model for sentence vector generation to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is obtained by training a neural network by adopting a plurality of training samples, and each training sample comprises: corpus fragment and corpus fragment definitions.
In this embodiment, the target text data is input into a sentence vector generation model to generate a sentence vector, so as to obtain a target sentence vector corresponding to the target text data, where the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, and each training sample includes: the corpus segments and the corpus segment definitions are adopted, so that training of neural network training is achieved based on the corpus segments and the corpus segment definitions to obtain a sentence vector generation model, training difficulty is reduced, and sentence vectors are prevented from being constructed by adopting an unsupervised learning method or a contrast learning method.
For S1, the target text data input by the user may be obtained, the target text data may be obtained from a database, or the target text data may be obtained from a third-party application system.
Target text data, that is, text data required to generate a sentence vector.
When the sentence vector generated by the application is applied to the book recommendation scene, the target text data can be one or more of a title, an abstract and keywords in a book.
For S2, the target text data is input into a sentence vector generation model to generate a sentence vector, and the generated sentence vector is used as the target sentence vector corresponding to the target text data.
Wherein the sentence vector generation model is a model obtained by training a neural network by using a plurality of training samples, and the neural network includes but is not limited to: bert (bidirectional Encoder responses from transformations) model, XLNET (general autoregressive pretraining method) model.
Each of the training samples comprises: corpus fragment and corpus fragment definitions. The corpus fragment includes one or more words. The corpus fragment definition is the explanation of the corpus fragment. That is, the corpus fragment and corpus fragment definition constitute a text pair. The neural network is trained by adopting the text pair to draw the corpus segments and the distance defined by the corpus segments, so that the accuracy of sentence vectors is improved.
The method for obtaining the sentence vector generation model by training a neural network by adopting a plurality of training samples comprises the following steps: generating word vectors according to the corpus segments in the training samples to obtain corpus segment word vectors; sentence vector generation is carried out on the corpus fragment definitions in the training samples by adopting an initial model obtained based on the neural network, and corpus fragment definition sentence vectors are obtained; and training the initial model according to the corpus fragment word vectors and the corpus fragment definition sentence vectors, and taking the initial model after training as the sentence vector generation model. Therefore, the distance between the word vector corresponding to the corpus fragment and the sentence vector corresponding to the corpus fragment definition is shortened, the accuracy of generating the sentence vector by the sentence vector generation model is improved, a NLP training mode based on a generating formula is not needed, the training difficulty is reduced, the occupied resources are less, and the model is easy to converge.
Optionally, when the neural network is trained by using a plurality of training samples, the learning rate is 0.0005.
In an embodiment, before the step of inputting the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, the method further includes:
s211: obtaining a plurality of the training samples;
s212: acquiring training samples of a preset batch number as a training sample set;
s213: generating a word vector according to each corpus fragment in the training sample set to obtain a first word vector;
s214: sentence vector generation is carried out on each corpus fragment definition in the training sample set by adopting an initial model to obtain a first sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;
s215: calculating loss values according to the first word vectors and the first sentence vectors to obtain first loss values, updating parameters of the initial model according to the first loss values, and using the updated initial model for calculating the first sentence vectors next time;
s216: repeatedly executing the step of obtaining the training samples with the preset batch number as a training sample set until a first training target is reached;
s217: using the initial model reaching the first training goal as the sentence vector generation model.
The embodiment realizes that the first word vector is determined according to the corpus fragment, the first sentence vector is determined according to the corpus fragment definition, and the loss value calculation is carried out according to each first word vector and each first sentence vector, so that the distance between the word vector corresponding to the corpus fragment and the sentence vector corresponding to the corpus fragment definition is shortened, the accuracy of generating the sentence vector by the sentence vector generation model is improved, a non-line segment (NLP) training mode based on a generating formula is not needed, the training difficulty is reduced, the occupied resources are less, and the model is easy to converge; and the training sample set is adopted to carry out batch training each time, so that the condition that the parameters of the initial model are influenced by abnormal training samples in a transition mode is avoided, and the training accuracy is improved.
For S211, a plurality of training samples input by the user may be obtained, or a plurality of training samples may be obtained from a database, or a plurality of training samples may be obtained from a third-party application system.
For step S212, a preset number of training samples are obtained from each training sample, and each obtained training sample is used as a training sample set.
Optionally, the preset batch number is set to 64. It is understood that the preset batch number may also be set to other values, and is not limited herein.
For S213, generating a word vector according to each corpus fragment in the training sample set, and using each generated word vector as a first word vector (i.e., a corpus fragment word vector). That is, each of the corpus segments corresponds to a first word vector.
For S214, using the initial model, generating a sentence vector for each corpus fragment definition in the training sample set, and using the generated sentence vector as a first sentence vector (i.e. the corpus fragment definition sentence vector). That is, each of the corpus fragment definitions corresponds to a first sentence vector.
The initial model is a model obtained based on a Bert model or an XLNET model, and it can be understood that the initial model may also adopt other models, which is not limited herein.
For S215, a loss value is calculated according to each first word vector and each first sentence vector to obtain a first loss value, thereby realizing that a first loss value is calculated for each batch.
The number of the first word vectors in each first word vector is the same as the value of the preset batch number; the number of the first sentence vectors in each first sentence vector is the same as the value of the preset batch number.
The method steps for updating the parameters of the initial model according to the first loss value are not described herein again.
And using the updated initial model for calculating the first sentence vector next time, thereby realizing the iterative update of the initial model.
For S216, steps S212 to S216 are repeatedly performed until the first training target is reached.
The first training objective includes: the first loss value reaches a first convergence condition or the number of iterations of the initial model reaches a second convergence condition.
The first convergence condition means that the magnitude of the first loss value calculated two adjacent times satisfies the lipschitz condition (lipschitz continuity condition).
The number of iterations refers to the number of times the first loss value is calculated, that is, once, the number of iterations is increased by 1.
The second convergence condition is a specific numerical value.
For S217, the initial model that achieves the first training goal is a model that meets expected requirements, and thus the initial model that achieves the first training goal is taken as the sentence vector generation model.
In an embodiment, the step of obtaining a plurality of training samples includes:
s2111: obtaining dictionary data, the dictionary data comprising: a text segment and a text segment definition, the text segment comprising: any one of single Chinese character, word and idiom, the text segment definition is an explanation description of the text segment;
s2112: acquiring any text segment from the dictionary data as a target text segment;
s2113: generating the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the text segment definition corresponding to the target text segment is used as the corpus segment definition of the training sample;
s2114: and repeatedly executing the step of acquiring any text segment from the dictionary data as a target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired.
The embodiment realizes that the training sample is determined according to the dictionary data, has the advantages of high accuracy, high stability and easy acquisition, provides a basis for improving the accuracy of the distance between the word vector corresponding to the corpus fragment and the sentence vector corresponding to the corpus fragment definition, and further improves the accuracy of the sentence vector generated by the sentence vector generation model.
For S2111, dictionary data input by the user may be acquired, the dictionary data may be acquired from a database, or the dictionary data may be acquired from a third-party application system.
Alternatively, the dictionary data is data obtained from a Xinhua dictionary. It is understood that the dictionary data may also be data obtained from other dictionaries, such as an english dictionary, and other chinese dictionaries, and is not limited herein.
For S2112, any text segment is acquired from the dictionary data, and the acquired text segment is taken as a target text segment.
For S2113, the training sample is generated according to the target text segment and the text segment definition corresponding to the target text segment, so that the interpretation corresponding to the target text segment and the target text segment is used as a text pair, and the text pair is used as the training sample. The calibration data of the sentence vector does not need to be determined, thereby simplifying the generation of training samples.
For S2114, steps S2112 to S2114 are repeatedly executed until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired. When the retrieval of the text segment in the lexicon data is completed, generation of training samples has been achieved for all data in the lexicon data. When a sample generation end signal is acquired, it means that a sufficient number of training samples have been generated.
The sample generation end signal is generated according to preset conditions by the program file for realizing the application. For example, the preset condition is a preset number of samples, which is not limited in this example.
In an embodiment, the step of generating a word vector according to each corpus fragment in the training sample set to obtain a first word vector includes:
s2131: performing word segmentation processing on each corpus fragment in the training sample set to obtain a corpus fragment phrase set;
s2132: generating word vectors for all phrases in each corpus fragment phrase set by adopting a preset word vector model to obtain a phrase word vector set;
s2133: and carrying out average value calculation on each short word vector set to obtain the first word vector.
The embodiment realizes the word segmentation, the word vector generation of each phrase and the average value calculation of each word vector of the language material segments, and obtains the word vectors of the language material segments, thereby providing a basis for shortening the distance between the word vectors corresponding to the language material segments and the sentence vectors corresponding to the definition of the language material segments.
For S2131, performing word segmentation on each corpus fragment in the training sample set, and taking all phrases obtained by word segmentation as a corpus fragment phrase set. That is, the corpus fragment phrase sets correspond to the corpus fragments in the training sample set one to one.
For step S2132, a preset word vector model is adopted, word vector generation is performed on each phrase in the corpus fragment phrase set, and each generated word vector is used as a phrase word vector set. That is, the corpus segments in the corpus sample set correspond to the corpus word vector set one-to-one.
Optionally, the preset Word vector model adopts a pre-trained Chinese glove (global Vectors for Word retrieval) Word vector model.
For S2133, performing average calculation on the phrase vector set, and taking the calculated data as the first word vector.
In an embodiment, the step of generating a sentence vector for each corpus fragment definition in the training sample set by using the initial model to obtain a first sentence vector includes:
s2141: performing word segmentation processing on each corpus fragment definition in the training sample set to obtain a definition phrase set;
s2142: and inputting each definition phrase set into the initial model to generate a sentence vector, so as to obtain the first sentence vector.
The embodiment realizes the word segmentation processing and sentence vector generation respectively for the language fragment definition, thereby providing a basis for shortening the distance between the word vector corresponding to the language fragment and the sentence vector corresponding to the language fragment definition.
For S2141, performing word segmentation on each corpus fragment definition in the training sample set, and taking all phrases obtained by word segmentation as a definition phrase set. That is, the definitions phrase sets and the corpus fragment definitions in the training sample set correspond one-to-one.
For S2142, the phrase set is input to the initial model to generate a sentence vector, and the generated sentence vector is used as the first sentence vector. That is, the first sentence vector and the corpus fragment definitions in the training sample set are in one-to-one correspondence.
In an embodiment, the step of calculating a loss value according to each first word vector and each first sentence vector to obtain a first loss value includes:
s2151: acquiring any one of the first word vectors as a target word vector;
s2152: inputting the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to perform loss value calculation, so as to obtain a to-be-processed loss value, wherein the preset loss function adopts a relative entropy loss function;
s2153: repeatedly executing the step of obtaining any one first word vector as a target word vector until the first word vector is obtained;
s2154: and calculating the average value of the loss values to be processed to obtain the first loss value.
In the embodiment, a relative entropy loss function is used as a preset loss function, which is beneficial to reducing the distance between vectors, and the first loss value is obtained by calculating the average value of the loss values to be processed, so that the parameter of the initial model is updated only once in each batch of training.
For S2151, any one of the first word vectors is acquired, and the acquired first word vector is taken as a target word vector.
For step S2152, the target word vector and the first sentence vector corresponding to the target word vector are input into a preset loss function to be subjected to loss value calculation, and the calculated loss value is used as a to-be-processed loss value.
For S2153, steps S2151 to S2153 are repeatedly performed until the acquisition of the first word vector is completed.
For S2154, an average value of each of the to-be-processed loss values is calculated, and the calculated data is used as the first loss value.
In an embodiment, before the step of inputting the target text data into a sentence vector generation model to generate a sentence vector to obtain a target sentence vector corresponding to the target text data, the method further includes:
s221: obtaining a plurality of the training samples;
s222: acquiring one training sample as a target training sample;
s223: generating word vectors according to the corpus segments in the target training sample to obtain second word vectors;
s224: sentence vector generation is carried out on the corpus segment definition in the target training sample by adopting an initial model to obtain a second sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;
s225: calculating a loss value according to the second word vector and the second sentence vector to obtain a second loss value, updating parameters of the initial model according to the second loss value, and using the updated initial model for calculating the second sentence vector next time;
s226: repeating the step of obtaining one of the training samples as a target training sample until a second training target is reached;
s227: using the initial model reaching the second training goal as the sentence vector generation model.
The embodiment realizes that the second word vector is determined according to the corpus fragment, the second sentence vector is determined according to the corpus fragment definition, and the loss value calculation is performed according to the second word vector and the second sentence vector, so that the distance between the word vector corresponding to the corpus fragment and the sentence vector corresponding to the corpus fragment definition is shortened, the accuracy of generating the sentence vector by the sentence vector generation model is improved, a NLP (non-line segment) training mode based on a generating formula is not needed, the training difficulty is reduced, the occupied resources are less, and the model is easy to converge.
For S221, a plurality of training samples input by the user may be obtained, or a plurality of training samples may be obtained from a database, or a plurality of training samples may be obtained from a third-party application system.
For S222, one training sample is obtained from each training sample, and the obtained training sample is used as a target training sample.
For S223, performing word segmentation according to the corpus segments in the target training sample, performing word vector generation according to each phrase obtained by word segmentation, calculating an average value of each generated word vector, and using the calculated data as a second word vector (i.e., corpus segment word vector).
For S224, performing word segmentation on the corpus segment definitions in the target training sample, inputting each phrase obtained by word segmentation into an initial model to perform sentence vector generation, and using the generated sentence vector as a second sentence vector (i.e., a corpus segment definition sentence vector).
For S225, a loss value is calculated according to the second word vector and the second sentence vector, and the calculated loss value is used as a second loss value.
The step of updating the parameters of the initial model according to the second loss value is not described herein again.
For S226, steps S222 to S226 are repeatedly executed until the second training target is reached.
The second training target includes: the second loss value reaches a third convergence condition or the number of iterations of the initial model reaches a fourth convergence condition.
The third convergence condition means that the magnitude of the second loss value calculated two adjacent times satisfies the lipschitz condition (lipschitz continuous condition).
The number of iterations refers to the number of times the second loss value is calculated, that is, once, the number of iterations is increased by 1.
The fourth convergence condition is a specific numerical value.
For S227, the initial model that achieves the second training goal is a model that meets expected requirements, and thus the initial model that achieves the second training goal is taken as the sentence vector generation model.
Referring to fig. 2, the present application also proposes a sentence vector generation apparatus, including:
a data obtaining module 100, configured to obtain target text data;
a sentence vector generation module 200, configured to input the target text data into a sentence vector generation model for sentence vector generation, so as to obtain a target sentence vector corresponding to the target text data, where the sentence vector generation model is a model obtained by training a neural network by using a plurality of training samples, and each training sample includes: corpus fragment and corpus fragment definitions.
In this embodiment, the target text data is input into a sentence vector generation model to generate a sentence vector, so as to obtain a target sentence vector corresponding to the target text data, where the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, and each training sample includes: the corpus segments and the corpus segment definitions are adopted, so that training of neural network training is achieved based on the corpus segments and the corpus segment definitions to obtain a sentence vector generation model, training difficulty is reduced, and sentence vectors are prevented from being constructed by adopting an unsupervised learning method or a contrast learning method.
In one embodiment, the above apparatus further comprises: a first model training module;
the first model training module is configured to obtain a plurality of training samples, obtain a preset batch number of the training samples as a training sample set, generate a word vector according to each corpus fragment in the training sample set to obtain a first word vector, generate a sentence vector for each corpus fragment definition in the training sample set by using an initial model, obtain a first sentence vector, obtain a first loss value by performing loss value calculation according to each first word vector and each first sentence vector, update parameters of the initial model according to the first loss value, use the updated initial model for calculating the first sentence vector next time, and repeatedly execute the step of obtaining the preset batch number of the training samples as the training sample set, and taking the initial model reaching the first training target as the sentence vector generation model until the first training target is reached.
In one embodiment, the first model training module includes: a training sample generation submodule;
the training sample generation submodule is configured to acquire dictionary data, where the dictionary data includes: a text segment and a text segment definition, the text segment comprising: the method comprises the steps of obtaining any one of a single Chinese character, a word and a idiom, obtaining any one of text segments from dictionary data as a target text segment, generating a training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, the text segment definition corresponding to the target text segment is used as the corpus segment definition of the training sample, and repeatedly executing the step of obtaining any one of the text segments from the dictionary data as the target text segment until the obtaining of the text segments in the dictionary data is completed or a sample generation end signal is obtained.
In one embodiment, the first model training module further comprises: a first word vector determination submodule;
the first word vector determining submodule is configured to perform word segmentation on each corpus fragment in the training sample set to obtain a corpus fragment phrase set, perform word vector generation on each phrase in each corpus fragment phrase set by using a preset word vector model to obtain a phrase word vector set, and perform average value calculation on each short term word vector set to obtain the first word vector.
In one embodiment, the first model training module further comprises: a first sentence vector determination submodule;
the first sentence vector determination submodule is configured to perform word segmentation on each corpus fragment definition in the training sample set to obtain a defined phrase set, and input each defined phrase set into the initial model to perform sentence vector generation to obtain the first sentence vector.
In one embodiment, the first model training module further comprises: a first loss value determination submodule;
the first loss value determining submodule is configured to obtain any one of the first word vectors as a target word vector, input the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to perform loss value calculation, and obtain a to-be-processed loss value, where the preset loss function uses a relative entropy loss function, and repeatedly executes the step of obtaining any one of the first word vectors as the target word vector until the obtaining of the first word vector is completed, and performs average value calculation on each to-be-processed loss value to obtain the first loss value.
In one embodiment, the above apparatus further comprises: a second model training module;
the second model training module is configured to obtain a plurality of training samples, obtain one training sample as a target training sample, perform word vector generation according to the corpus fragment in the target training sample to obtain a second word vector, perform sentence vector generation on the corpus fragment definition in the target training sample by using an initial model to obtain a second sentence vector, where the initial model is a model obtained based on a Bert model or an XLNET model, perform loss value calculation according to the second word vector and the second sentence vector to obtain a second loss value, update parameters of the initial model according to the second loss value, use the updated initial model for calculating the second sentence vector next time, and repeatedly execute the step of obtaining one training sample as a target training sample until a second training target is reached, using the initial model reaching the second training goal as the sentence vector generation model.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data such as sentence vector generation methods. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a sentence vector generation method. The sentence vector generation method comprises the following steps: acquiring target text data; inputting the target text data into a sentence vector generation model for sentence vector generation to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is obtained by training a neural network by adopting a plurality of training samples, and each training sample comprises: corpus fragment and corpus fragment definitions.
In this embodiment, the target text data is input into a sentence vector generation model to generate a sentence vector, so as to obtain a target sentence vector corresponding to the target text data, where the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, and each training sample includes: the corpus segments and the corpus segment definitions are adopted, so that training of neural network training is achieved based on the corpus segments and the corpus segment definitions to obtain a sentence vector generation model, training difficulty is reduced, and sentence vectors are prevented from being constructed by adopting an unsupervised learning method or a contrast learning method.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing a sentence vector generation method, including the steps of: acquiring target text data; inputting the target text data into a sentence vector generation model for sentence vector generation to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is obtained by training a neural network by adopting a plurality of training samples, and each training sample comprises: corpus fragment and corpus fragment definitions.
In the above-described executed sentence vector generation method, the target text data is input into a sentence vector generation model to generate a sentence vector, so as to obtain a target sentence vector corresponding to the target text data, where the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, and each of the training samples includes: the corpus segments and the corpus segment definitions are adopted, so that training of neural network training is achieved based on the corpus segments and the corpus segment definitions to obtain a sentence vector generation model, training difficulty is reduced, and sentence vectors are prevented from being constructed by adopting an unsupervised learning method or a contrast learning method.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims (10)

1. A sentence vector generation method, the method comprising:
acquiring target text data;
inputting the target text data into a sentence vector generation model for sentence vector generation to obtain a target sentence vector corresponding to the target text data, wherein the sentence vector generation model is obtained by training a neural network by adopting a plurality of training samples, and each training sample comprises: corpus fragment and corpus fragment definitions.
2. The method according to claim 1, wherein before the step of inputting the target text data into a sentence vector generation model to generate a sentence vector and obtaining a target sentence vector corresponding to the target text data, the method further comprises:
obtaining a plurality of the training samples;
acquiring training samples of a preset batch number as a training sample set;
generating a word vector according to each corpus fragment in the training sample set to obtain a first word vector;
sentence vector generation is carried out on each corpus fragment definition in the training sample set by adopting an initial model to obtain a first sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;
calculating loss values according to the first word vectors and the first sentence vectors to obtain first loss values, updating parameters of the initial model according to the first loss values, and using the updated initial model for calculating the first sentence vectors next time;
repeatedly executing the step of obtaining the training samples with the preset batch number as a training sample set until a first training target is reached;
using the initial model reaching the first training goal as the sentence vector generation model.
3. The sentence vector generation method of claim 2, wherein the step of obtaining the plurality of training samples comprises:
obtaining dictionary data, the dictionary data comprising: a text segment and a text segment definition, the text segment comprising: any one of single Chinese character, word and idiom, the text segment definition is an explanation description of the text segment;
acquiring any text segment from the dictionary data as a target text segment;
generating the training sample according to the target text segment and the text segment definition corresponding to the target text segment, wherein the target text segment is used as the corpus segment of the training sample, and the text segment definition corresponding to the target text segment is used as the corpus segment definition of the training sample;
and repeatedly executing the step of acquiring any text segment from the dictionary data as a target text segment until the acquisition of the text segment in the dictionary data is completed or a sample generation end signal is acquired.
4. The method according to claim 2, wherein said step of generating a word vector according to each of said corpus segments in said training sample set to obtain a first word vector comprises:
performing word segmentation processing on each corpus fragment in the training sample set to obtain a corpus fragment phrase set;
generating word vectors for all phrases in each corpus fragment phrase set by adopting a preset word vector model to obtain a phrase word vector set;
and carrying out average value calculation on each short word vector set to obtain the first word vector.
5. The method according to claim 2, wherein said step of generating a sentence vector for each of said corpus segment definitions in said training sample set using an initial model to obtain a first sentence vector comprises:
performing word segmentation processing on each corpus fragment definition in the training sample set to obtain a definition phrase set;
and inputting each definition phrase set into the initial model to generate a sentence vector, so as to obtain the first sentence vector.
6. The sentence vector generation method of claim 2, wherein the step of calculating a loss value from each of the first word vectors and each of the first sentence vectors to obtain a first loss value comprises:
acquiring any one of the first word vectors as a target word vector;
inputting the target word vector and the first sentence vector corresponding to the target word vector into a preset loss function to perform loss value calculation, so as to obtain a to-be-processed loss value, wherein the preset loss function adopts a relative entropy loss function;
repeatedly executing the step of obtaining any one first word vector as a target word vector until the first word vector is obtained;
and calculating the average value of the loss values to be processed to obtain the first loss value.
7. The method according to claim 1, wherein before the step of inputting the target text data into a sentence vector generation model to generate a sentence vector and obtaining a target sentence vector corresponding to the target text data, the method further comprises:
obtaining a plurality of the training samples;
acquiring one training sample as a target training sample;
generating word vectors according to the corpus segments in the target training sample to obtain second word vectors;
sentence vector generation is carried out on the corpus segment definition in the target training sample by adopting an initial model to obtain a second sentence vector, wherein the initial model is a model obtained based on a Bert model or an XLNET model;
calculating a loss value according to the second word vector and the second sentence vector to obtain a second loss value, updating parameters of the initial model according to the second loss value, and using the updated initial model for calculating the second sentence vector next time;
repeating the step of obtaining one of the training samples as a target training sample until a second training target is reached;
using the initial model reaching the second training goal as the sentence vector generation model.
8. An apparatus for sentence vector generation, the apparatus comprising:
the data acquisition module is used for acquiring target text data;
a sentence vector generation module, configured to input the target text data into a sentence vector generation model for sentence vector generation, so as to obtain a target sentence vector corresponding to the target text data, where the sentence vector generation model is a model obtained by training a neural network using a plurality of training samples, and each training sample includes: corpus fragment and corpus fragment definitions.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202111250467.8A 2021-10-26 2021-10-26 Sentence vector generation method, device, equipment and storage medium Pending CN113935315A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111250467.8A CN113935315A (en) 2021-10-26 2021-10-26 Sentence vector generation method, device, equipment and storage medium
PCT/CN2022/090157 WO2023071115A1 (en) 2021-10-26 2022-04-29 Sentence vector generation method and apparatus, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111250467.8A CN113935315A (en) 2021-10-26 2021-10-26 Sentence vector generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113935315A true CN113935315A (en) 2022-01-14

Family

ID=79284360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111250467.8A Pending CN113935315A (en) 2021-10-26 2021-10-26 Sentence vector generation method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113935315A (en)
WO (1) WO2023071115A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114519395A (en) * 2022-02-22 2022-05-20 平安科技(深圳)有限公司 Model training method and device, text abstract generating method and device, and equipment
WO2023071115A1 (en) * 2021-10-26 2023-05-04 平安科技(深圳)有限公司 Sentence vector generation method and apparatus, device, and storage medium
CN114519395B (en) * 2022-02-22 2024-05-14 平安科技(深圳)有限公司 Model training method and device, text abstract generating method and device and equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2018450122A1 (en) * 2018-11-19 2021-06-17 Genesys Telecommunications Laboratories, Inc., Method and system for sentiment analysis
CN109960804B (en) * 2019-03-21 2023-05-02 江西风向标教育科技有限公司 Method and device for generating topic text sentence vector
CN111222329B (en) * 2019-12-10 2023-08-01 上海八斗智能技术有限公司 Sentence vector training method, sentence vector model, sentence vector prediction method and sentence vector prediction system
CN111709223B (en) * 2020-06-02 2023-08-08 上海硬通网络科技有限公司 Sentence vector generation method and device based on bert and electronic equipment
CN112016296B (en) * 2020-09-07 2023-08-25 平安科技(深圳)有限公司 Sentence vector generation method, sentence vector generation device, sentence vector generation equipment and sentence vector storage medium
CN113935315A (en) * 2021-10-26 2022-01-14 平安科技(深圳)有限公司 Sentence vector generation method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023071115A1 (en) * 2021-10-26 2023-05-04 平安科技(深圳)有限公司 Sentence vector generation method and apparatus, device, and storage medium
CN114519395A (en) * 2022-02-22 2022-05-20 平安科技(深圳)有限公司 Model training method and device, text abstract generating method and device, and equipment
CN114519395B (en) * 2022-02-22 2024-05-14 平安科技(深圳)有限公司 Model training method and device, text abstract generating method and device and equipment

Also Published As

Publication number Publication date
WO2023071115A1 (en) 2023-05-04

Similar Documents

Publication Publication Date Title
US20230100376A1 (en) Text sentence processing method and apparatus, computer device, and storage medium
CN111226222A (en) Depth context based syntax error correction using artificial neural networks
US11170167B2 (en) Automatic lexical sememe prediction system using lexical dictionaries
CN111984766A (en) Missing semantic completion method and device
CN115599901B (en) Machine question-answering method, device, equipment and storage medium based on semantic prompt
CN110826334A (en) Chinese named entity recognition model based on reinforcement learning and training method thereof
CN113128232B (en) Named entity identification method based on ALBERT and multiple word information embedding
CN112131883A (en) Language model training method and device, computer equipment and storage medium
CN116628171B (en) Medical record retrieval method and system based on pre-training language model
CN112016300A (en) Pre-training model processing method, pre-training model processing device, downstream task processing device and storage medium
CN113935315A (en) Sentence vector generation method, device, equipment and storage medium
CN114416984A (en) Text classification method, device and equipment based on artificial intelligence and storage medium
CN112800748B (en) Phoneme prediction method, device, equipment and storage medium suitable for polyphones
KR20220021836A (en) Context sensitive spelling error correction system or method using Autoregressive language model
CN115545030A (en) Entity extraction model training method, entity relation extraction method and device
CN114911941A (en) Text vector generation method, device, equipment and medium based on artificial intelligence
Xu Research on neural network machine translation model based on entity tagging improvement
CN114048296A (en) Semantic gate-based chatting type multi-round conversation method, system, medium and equipment
CN114492429A (en) Text theme generation method, device and equipment and storage medium
CN112464649A (en) Pinyin conversion method and device for polyphone, computer equipment and storage medium
CN114692610A (en) Keyword determination method and device
Wang et al. Sequence adversarial training and minimum bayes risk decoding for end-to-end neural conversation models
CN112434133B (en) Intention classification method and device, intelligent terminal and storage medium
CN114676684B (en) Text error correction method and device, computer equipment and storage medium
CN114186548B (en) Sentence vector generation method, device, equipment and medium based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40063350

Country of ref document: HK