CN112328734B

CN112328734B - Method, device and computer equipment for generating text data

Info

Publication number: CN112328734B
Application number: CN202011224705.3A
Authority: CN
Inventors: 阮智昊; 李茂昌; 江炼鑫; 莫洋
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2024-02-13
Anticipated expiration: 2040-11-05
Also published as: CN112328734A

Abstract

The application relates to big data technology, and discloses a method for generating text data, which comprises the following steps: acquiring standard training texts in the training data and expanding training texts corresponding to the standard training texts; inputting a standard training text into a first Bert which is pre-arranged under the twin framework, and inputting an extended training text into a second Bert which is pre-arranged under the twin framework; training a twin Bert model consisting of a first Bert and a second Bert under the constraint of a loss function; judging whether the loss function converges or not; if yes, determining model parameters of a twin Bert model; inputting standard text data of a similar text to be generated into a twin Bert model, and generating a vector corresponding to the standard text; and acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text. By arranging the Bert model under the twin framework and designing the training loss function of the twin Bert model, the similarity of the generated text data is higher, and the generated text corresponding to the expansion question is more coherent and has more accurate semantics.

Description

Method, device and computer equipment for generating text data

Technical Field

The present application relates to the field of big data, and in particular, to a method, an apparatus, and a computer device for generating text data.

Background

In a plurality of scenes of natural language processing technology application, a large amount of corpus is needed to support the prediction and generation of the model. If the corpus is insufficient or the samples are severely unbalanced, chinese text data enhancement is required. The chinese text data enhancement may be performed by generating a similarity question or generating a synonym. The method is characterized in that a corpus short text is input, and a short text with similar semantics and different characters is output. However, the existing method for generating the similarity questions or generating the synonyms is mostly realized by a seq2seq model, synonym-based addition and deletion transformation, a transliteration model and the like, and the data precision requirement of training a natural language processing model cannot be met in terms of consistency of the generated text and semantic accuracy.

Disclosure of Invention

The method and the device mainly aim to solve the technical problem that the database is difficult to run and cannot respond to service requirements rapidly when the service volume is large.

The application provides a method for generating text data, which comprises the following steps:

acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry labeling data;

Inputting the standard training text into a first Bert which is pre-arranged under a twin frame, and inputting the extended training text into a second Bert which is pre-arranged under the twin frame;

training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is thatloss represents a loss function, v ₁ Representing the vector corresponding to the standard text, v ₂ Representing vectors corresponding to the extended text, and label representing annotation data;

judging whether the loss function converges or not;

if yes, determining model parameters of the twin Bert model;

inputting standard text data of a similar text to be generated into the twin Bert model, and generating a vector corresponding to the standard text;

and acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.

Preferably, after the step of obtaining the expanded text similar to the standard text according to the vector corresponding to the standard text, the method includes:

inputting the vector corresponding to the standard text into a nearest neighbor model;

screening a specified statement vector meeting a preset similarity requirement with the standard text in a preset database through a nearest neighbor model;

Acquiring text data corresponding to the specified sentence vector;

and taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.

Preferably, after the step of screening the specified sentence vector meeting the preset similarity requirement with the standard text in the preset database through the nearest neighbor model, the method includes:

vector stitching is carried out on the vector corresponding to the standard text and the specified sentence vector to form a stitched vector;

inputting the spliced vector into a Bert classification model for descending order sorting;

acquiring a specified number of splice vectors arranged in a preceding arrangement in the descending order;

and determining the expanded text corresponding to the standard text according to the specified number of splicing vectors.

Preferably, the step of vector stitching is performed on the vector corresponding to the standard text and the specified sentence vector to form a stitched vector, including:

acquiring a first vector corresponding to the standard text and a second vector screened by the nearest neighbor model, wherein the second vector is any one of the specified sentence vectors;

according to the first vector and the second vector, calculating element granularity differences, element granularity products and cosine distances among vectors of the first vector and the second vector;

And splicing the element granularity differences, element granularity products and cosine distances among vectors into the spliced vectors.

Preferably, the step of determining the expanded text corresponding to the standard text according to the specified number of spliced vectors includes:

respectively calculating the correlation degree between every two of the expansion texts in the appointed number;

judging whether the correlation degree reaches a corresponding threshold value when two texts are correlated;

if yes, combining the two appointed extended texts into one extended text.

Preferably, the step of calculating the correlation between each two of the specified number of expanded texts includes:

respectively calculating a Levenshtein distance, a Jaro-Winkler distance between every two extended texts in the appointed number, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance, and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance;

splicing the Levenshtein distance, the Jaro-Winkler distance, the first TF-IDF, the first BM25, the second TF-IDF and the second BM25 to form a correlation feature;

inputting the correlation characteristics into a random forest frame, calling a random forest function, and calculating the repeatability;

And taking the classification result output by the random forest framework as the correlation degree.

The application provides a device for generating text data, which comprises:

the first acquisition module is used for acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry annotation data;

the first input module is used for inputting the standard training text into a first Bert which is pre-arranged under a twin frame, and inputting the extended training text into a second Bert which is pre-arranged under the twin frame;

a training module for training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is thatv ₁ Representing the vector corresponding to the standard text, v ₂ Representing vectors corresponding to the extended text, and label representing annotation data;

the judging module is used for judging whether the loss function converges or not;

the first determining module is used for determining model parameters of the twin Bert model if convergence occurs;

The second input module is used for inputting standard text data of a similar text to be generated into the twin Bert model and generating a vector corresponding to the standard text;

and the second acquisition module is used for acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.

The present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above-described method.

According to the method, the Bert model is arranged under the twin framework, and the training loss function of the twin Bert model is designed, so that the similarity of generated text data is higher, and the generated text corresponding to the expansion question is more coherent and more accurate in semantic.

Drawings

FIG. 1 is a flow chart of a method of generating text data according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an apparatus for generating text data according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Referring to fig. 1, the method for generating text data of the present embodiment includes:

s1: acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry labeling data;

s2: inputting the standard training text into a first Bert which is pre-arranged under a twin frame, and inputting the extended training text into a second Bert which is pre-arranged under the twin frame;

s3: training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is thatloss represents a loss function,/->Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data;

S4: judging whether the loss function converges or not;

s5: if yes, determining model parameters of the twin Bert model;

s6: inputting standard text data of a similar text to be generated into the twin Bert model, and generating a vector corresponding to the standard text;

s7: and acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.

In the embodiment of the application, two identical berts are pre-arranged on a twin framework to form a twin Bert model, and the twin Bert model is trained on training data through a set loss function, so that the two berts have the same parameter parameters. Because the Bert can jointly mediate the bidirectional representation of the context training depth in all layers through the bidirectional encoder, the word with the word multi-meaning phenomenon can be interpreted and represented according to the context by combining the alignment operation of the twin vector, and finally the sentence vector conforming to the context semantic information is generated, so that the semantic accuracy is improved, and the twin Bert model can accurately determine the expanded text similar to the standard text. When the sentence vector mapping of the standard text is more accurate, the expanded text which is screened by the standard text and is similar to the standard text only has similar consistency and semantic accuracy with the standard text, and can meet the expansion precision requirement of the text data.

Further, after step S7 of obtaining the expanded text similar to the standard text according to the vector corresponding to the standard text, the method includes:

s71: inputting the vector corresponding to the standard text into a nearest neighbor model;

s72: screening a specified statement vector meeting a preset similarity requirement with the standard text in a preset database through the nearest neighbor model;

s73: acquiring text data corresponding to the specified sentence vector;

s74: and taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.

In the embodiment of the application, the nearest neighbor model plays a role of a recall layer, and the used recall corpus is added with a high-quality data set of the disclosed user dialogue besides the training corpus, so that the abundant recall corpus is ensured, and the consistency of texts is also ensured. The nearest neighbor model is a model for realizing data retrieval according to the nearest neighbor, and according to the similarity of the data, the most similar items with the target data are searched from a database, and the similarity is quantized to the distance between the data in the space position, so that the closer the distance between the data in the space is, the higher the similarity between the data is. When the first K data items closest to the target data need to be searched, the K nearest neighbor search model is called as K-NN for short. The recall layer obtains a specified statement vector which approximates the nearest neighbor in a high-dimensional space, and screens a vector meeting the preset similarity requirement with the standard text in a preset database by determining the statement vector which approximates the nearest neighbor. According to the embodiment of the application, an Annoy open source library is used, sentence vectors generated by corpus of standard texts input by a user are extracted, an Annoy index is built through the sentence vectors, a binary tree is built in a preset database, and approximate nearest neighbor searching of O (log) time complexity is achieved through binary tree searching. The search process is as follows: "from genesim. Models import KeyedVectors; send 2vec_model = keyvectors. Load (u 'send 2 vec_model'); send 2vec_model. Similar_by_vector (query) ".

Further, after the step S72 of screening the specified sentence vector satisfying the preset similarity requirement with the standard text in the preset database through the nearest neighbor model, the method includes:

s75: vector stitching is carried out on the vector corresponding to the standard text and the specified sentence vector to form a stitched vector;

s76: inputting the spliced vector into a Bert classification model for descending order sorting;

s77: acquiring a specified number of splice vectors arranged in a preceding arrangement in the descending order;

s78: and determining the expanded text corresponding to the standard text according to the specified number of splicing vectors.

In order to improve the accuracy of the selected similar sentence vectors, the appointed sentence vectors output by the recall layer are respectively subjected to vector splicing with sentence vectors corresponding to standard texts, and then are subjected to descending order sorting through a Bert classification model, wherein the Bert classification model plays a role of a sorting layer. Determining a specified number of spliced vectors arranged in the descending order according to the descending order of the Bert classification model, so as to determine sentence vectors with higher similarity degree, and further determine the expanded text with the most similar consistency and the most similar semantics with the standard text.

Further, the step S75 of vector stitching the vector corresponding to the standard text and the specified sentence vector to form a stitched vector includes:

s751: acquiring a first vector corresponding to the standard text and a second vector screened by the nearest neighbor model, wherein the second vector is any one of the specified sentence vectors;

s752: according to the first vector and the second vector, calculating element granularity differences, element granularity products and cosine distances among vectors of the first vector and the second vector;

s753: and splicing the element granularity differences, element granularity products and cosine distances among vectors into the spliced vectors.

According to the method and the device, the first vector and the second vector are subjected to interactive calculation, and the interactive information characteristics between the two vectors are captured, so that the ordering accuracy of the Bert classification model is improved. The interactive calculation includes, but is not limited to, calculating element granularity differences, element granularity products and cosine distances between two vectors, so as to obtain interactive information of the two text vectors from different dimensions, and extracting interactive information features. For example, the first vector is a vector vec_q1 of a standard text, the second vector is a vector vec_q2 of one of the extended texts output by the recall layer, and the interactive operation comprises calculating element granularity difference element-wise minus between the two vectors, wherein vec_q1-vec_q2; calculating element-wise multi-point of element granularity between two vectors, wherein vec_q1 is vec_q2; the cosine distance vec_q1.alpha.vec_q2 of the two vectors is calculated. Such as vec_q1= [ x1, x2, x3...xn ], vec_q2= [ y1, y2, y3..], vec_q1-vec_q2= [ x1-y1, x2-y2, x 3-y3..]; vec_q1×vecq2= [ x1×y1, x2×y2, x3× y3. ]; the cosine distance vec_q1+vecq2 in the embodiment of the present application is a cosine distance which is not normalized by the vector mode, so as to reduce the calculation amount. The above vec_q1++vecq2= x1×y1+x2+x3×y3+).

According to the vector input mode of the Bert classification model, and [ cls ] is added at the beginning of the spliced vector, and [ sep ] is arranged between every two vectors. The input vectors after the final concatenation were [ cls, vec_q1, sep, vec_q2, sep, vec_q1-vec_q2, sep, vec_q1.vec_q2 ].

Further, the step S78 of determining the expanded text corresponding to the standard text according to the specified number of spliced vectors includes:

s781: respectively calculating the correlation degree between every two of the expansion texts in the appointed number;

s782: judging whether the correlation degree reaches a corresponding threshold value when two texts are correlated;

s783: if yes, combining the two appointed extended texts into one extended text.

The relevance of the embodiment of the application can be expressed by a Levenshtein distance, so that the calculated amount is reduced as much as possible under the condition of meeting the precision. And judging whether character strings of the two texts are identical or not by carrying out Levenshtein distance calculation on the obtained expanded texts. For example, if the Levenshtein distance is less than or equal to 2, which indicates that the strings of two extended texts are substantially identical, a threshold with a correlation is reached. I.e., levenshtein Distance (q 1, q 2) <=2, where q1 and q2 represent two specified extended texts and Levenshtein Distance represents a Levenshtein distance, and the two specified extended texts are correlated or identical for deduplication. The two may be combined into one to increase the effectiveness of the expanded text as training expanded data for other models.

Further, the step S781 of calculating the correlation between the specified number of expanded texts, respectively, includes:

s7811: respectively calculating a Levenshtein distance, a Jaro-Winkler distance between every two extended texts in the appointed number, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance, and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance;

s7812: splicing the Levenshtein distance, the Jaro-Winkler distance, the first TF-IDF, the first BM25, the second TF-IDF and the second BM25 to form a correlation feature;

s7813: inputting the correlation characteristics into a random forest frame, calling a random forest function, and calculating the repeatability;

s7814: and taking the classification result output by the random forest framework as the correlation degree.

In order to evaluate the relevance of two extended texts from different dimensions and improve the duplicate removal effect, the embodiment of the invention calculates the Levenshtein distance and the Jaro-Winkler distance of the two extended texts respectively, and splices the correlation characteristics with the Levenshtein distance and the Jaro-Winkler distance into a relevance feature by TF-IDF (term frequency-inverse document frequency) and BM25 (best mapping 25) corresponding to the Levenshtein distance and the Jaro-Winkler distance respectively, inputs the two extended texts into a random forest frame for duplicate/unrepeated binary classification calculation, and when the output binary classification result is duplicate, the two extended texts are identical or relevant and need duplicate removal treatment. The TF-IDF is a weighting technology for information retrieval and data mining, TF is word Frequency Term Frequency, and IDF is inverse text Frequency index Inverse Document Frequency; BM25 is an optimized version of TF-IDF.

Referring to fig. 2, an apparatus for generating text data according to an embodiment of the present application includes:

the first acquisition module 1 is used for acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry annotation data;

a first input module 2, configured to input the standard training text into a first Bert that is pre-arranged under a twin frame, and input the extended training text into a second Bert that is pre-arranged under the twin frame;

a training module 3, configured to train a twin Bert model composed of the first Bert and the second Bert under a constraint of a loss function, where an expression of the loss function isloss represents a loss function,/->Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data;

a judging module 4, configured to judge whether the loss function converges;

the first determining module 5 is configured to determine model parameters of the twin Bert model if convergence occurs;

the second input module 6 is used for inputting standard text data of a similar text to be generated into the twin Bert model and generating a vector corresponding to the standard text;

And the second acquisition module 7 is used for acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.

Further, an apparatus for generating text data, comprising:

the third input module is used for inputting the vector corresponding to the standard text into a nearest neighbor model;

The screening module is used for screening specified sentence vectors meeting the preset similarity requirement with the standard text in a preset database through the nearest neighbor model;

the third acquisition module is used for acquiring text data corresponding to the specified sentence vector;

and the module is used for taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.

In the embodiment of the application, the nearest neighbor model plays a role of a recall layer, and the used recall corpus is added with a high-quality data set of the disclosed user dialogue besides the training corpus, so that the abundant recall corpus is ensured, and the consistency of texts is also ensured. The recall layer obtains a specified statement vector which approximates the nearest neighbor in a high-dimensional space, and screens a vector meeting the preset similarity requirement with the standard text in a preset database by determining the statement vector which approximates the nearest neighbor. According to the embodiment of the application, an Annoy open source library is used, sentence vectors generated by corpus of standard texts input by a user are extracted, an Annoy index is built through the sentence vectors, a binary tree is built in a preset database, and approximate nearest neighbor searching of O (log) time complexity is achieved through binary tree searching. The search process is as follows: "from genesim. Models import KeyedVectors; send 2vec_model = keyvectors. Load (u 'send 2 vec_model'); send 2vec_model. Similar_by_vector (query) ".

Further, an apparatus for generating text data, comprising:

the forming module is used for carrying out vector splicing on the vector corresponding to the standard text and the specified sentence vector to form a spliced vector;

the fourth input module is used for inputting the spliced vector into the Bert classification model to carry out descending order sequencing;

a fourth obtaining module, configured to obtain a specified number of stitching vectors arranged in a preceding order in the descending order;

and the second determining module is used for determining the expansion text corresponding to the standard text according to the specified number of splicing vectors.

Further, forming a module, comprising:

The acquisition unit is used for acquiring a first vector corresponding to the standard text and a second vector screened by the nearest neighbor model, wherein the second vector is any one of the specified sentence vectors;

the first calculating unit is used for calculating element granularity differences, element granularity products and cosine distances among vectors of the first vector and the second vector according to the first vector and the second vector;

and the splicing unit is used for splicing the element granularity difference, the element granularity product and the cosine distance among vectors into the spliced vector.

Then, according to the vector input mode of the Bert classification model, and [ cls ] is added at the beginning of the spliced vector, and [ sep ] is arranged between every two vectors. The input vectors after the final concatenation were [ cls, vec_q1, sep, vec_q2, sep, vec_q1-vec_q2, sep, vec_q1.vec_q2 ].

Further, the second determining module includes:

the second calculation unit is used for calculating the correlation degree between every two expansion texts in the appointed number;

the judging unit is used for judging whether the correlation degree reaches a corresponding threshold value when two texts are correlated;

and the merging unit is used for merging the two appointed expanded texts into one expanded text if the corresponding threshold value is reached when the two texts are related.

Further, the second calculation unit includes:

a calculating subunit, configured to calculate, in the specified number of extended texts, a Levenshtein distance, a Jaro-Winkler distance between every two extended texts, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance, and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance, respectively;

a splicing subunit, configured to splice the Levenshtein distance, the Jaro-Winkler distance, the first TF-IDF, the first BM25, the second TF-IDF, and the second BM25 to form a correlation feature;

an input subunit, configured to input the correlation feature into a random forest frame, call a random forest function, and perform repetition calculation;

and the subunit is used for taking the classification result output by the random forest framework as the correlation degree.

In order to evaluate the relevance of two extended texts from different dimensions and improve the duplicate removal effect, the embodiment of the invention calculates the Levenshtein distance and the Jaro-Winkler distance of the two extended texts respectively, and splices the correlation characteristics with the Levenshtein distance and the Jaro-Winkler distance into a relevance feature by TF-IDF (term frequency-inverse document frequency) and BM25 (best match 25) corresponding to the Levenshtein distance and the Jaro-Winkler distance respectively, inputs the two extended texts into a random forest frame for duplicate/unrepeated binary classification calculation, and when the output binary classification result is duplicate, the two extended texts are identical or relevant and need duplicate removal treatment. The TF-IDF is a weighting technology for information retrieval and data mining, TF is word Frequency Term Frequency, and IDF is inverse text Frequency index Inverse Document Frequency; BM25 is an optimized version of TF-IDF.

Referring to fig. 3, a computer device is further provided in the embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store all the data required for the process of generating text data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of generating text data.

The processor executes the method for generating text data, and the method comprises the following steps: acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry labeling data; inputting the standard training text into a first Bert which is pre-arranged under a twin frame, and inputting the extended training text into a second Bert which is pre-arranged under the twin frame; training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is that Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data; judging whether the loss function converges or not; if yes, determining model parameters of the twin Bert model; standard text data to be generated into similar text is input into the twin Bert model,generating a vector corresponding to the standard text; and acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.

According to the computer equipment, the Bert model is arranged under the twin framework, and the training loss function of the twin Bert model is designed, so that the similarity of generated text data is higher, and the generated text corresponding to the expansion question is more coherent and has more accurate semantics.

In one embodiment, the step of obtaining the expanded text similar to the standard text by the processor according to the vector corresponding to the standard text includes: inputting the vector corresponding to the standard text into a nearest neighbor model; screening a specified statement vector meeting a preset similarity requirement with the standard text in a preset database through the nearest neighbor model; acquiring text data corresponding to the specified sentence vector; and taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.

In one embodiment, after the step of screening the specified sentence vector meeting the preset similarity requirement with the standard text in the preset database by the nearest neighbor model, the processor includes: vector stitching is carried out on the vector corresponding to the standard text and the specified sentence vector to form a stitched vector; inputting the spliced vector into a Bert classification model for descending order sorting; acquiring a specified number of splice vectors arranged in a preceding arrangement in the descending order; and determining the expanded text corresponding to the standard text according to the specified number of splicing vectors.

In one embodiment, the step of performing vector concatenation on the vector corresponding to the standard text and the specified sentence vector by the processor to form a concatenated vector includes: acquiring a first vector corresponding to the standard text and a second vector screened by the nearest neighbor model, wherein the second vector is any one of the specified sentence vectors; according to the first vector and the second vector, calculating element granularity differences, element granularity products and cosine distances among vectors of the first vector and the second vector; and splicing the element granularity differences, element granularity products and cosine distances among vectors into the spliced vectors.

In one embodiment, the step of determining the expanded text corresponding to the standard text by the processor according to the specified number of spliced vectors includes: respectively calculating the correlation degree between every two of the expansion texts in the appointed number; judging whether the correlation degree reaches a corresponding threshold value when two texts are correlated; if yes, combining the two appointed extended texts into one extended text.

In one embodiment, the step of calculating the correlation between each two of the specified number of expanded texts by the processor includes: respectively calculating a Levenshtein distance, a Jaro-Winkler distance between every two extended texts in the appointed number, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance, and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance; splicing the Levenshtein distance, the Jaro-Winkler distance, the first TF-IDF, the first BM25, the second TF-IDF and the second BM25 to form a correlation feature; inputting the correlation characteristics into a random forest frame, calling a random forest function, and calculating the repeatability; and taking the classification result output by the random forest framework as the correlation degree.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.

An embodiment of the present application further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of generating text data, comprising: acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry labeling data; inputting the standard training text into a first Bert pre-arranged under a twin frame, and inputting the standard training text into a second BertThe extended training text input is pre-arranged at a second Bert under the twin frame; training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is that Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data; judging whether the loss function converges or not; if yes, determining model parameters of the twin Bert model; inputting standard text data of a similar text to be generated into the twin Bert model, and generating a vector corresponding to the standard text; and acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.

According to the computer readable storage medium, the Bert model is laid under the twin framework, and the training loss function of the twin Bert model is designed, so that the similarity of generated text data is higher, and the generated text corresponding to the expansion question is more consistent and more accurate in semantic.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. A method of generating text data, comprising:

training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is thatloss represents a loss function,/->Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data;

judging whether the loss function converges or not;

if yes, determining model parameters of the twin Bert model;

acquiring an expanded text similar to the standard text according to the vector corresponding to the standard text;

the step of obtaining the expanded text similar to the standard text according to the vector corresponding to the standard text comprises the following steps:

screening a specified statement vector meeting a preset similarity requirement with the standard text in a preset database through the nearest neighbor model;

Acquiring text data corresponding to the specified sentence vector;

2. The method for generating text data according to claim 1, wherein after the step of screening a predetermined database for a specified sentence vector satisfying a predetermined similarity requirement with the standard text by the nearest neighbor model, the method comprises:

3. The method of generating text data according to claim 2, wherein the step of vector stitching the vector corresponding to the standard text and the specified sentence vector to form a stitched vector includes:

and splicing the first vector, the second vector, the element granularity difference, the element granularity product and the cosine distance among the vectors into the spliced vector.

4. The method of generating text data according to claim 2, wherein the step of determining the expanded text corresponding to the standard text based on the specified number of spliced vectors includes:

in the appointed number of the expanded texts, calculating the correlation degree between every two of the expanded texts respectively;

if yes, combining the two texts with the correlation degree reaching the threshold value into one expanded text.

5. The method of generating text data as set forth in claim 4, wherein the step of calculating the correlation between each pair among the specified number of expanded texts, respectively, includes:

in the appointed number of the extended texts, calculating a Levenshtein distance, a Jaro-Winkler distance between every two extended texts, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance respectively;

6. An apparatus for generating text data for implementing the method of any one of claims 1 to 5, comprising:

a training module for training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is that loss represents a loss function,/->Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data;

the second acquisition module is used for acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text;

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.