CN112328734B - Method, device and computer equipment for generating text data - Google Patents

Method, device and computer equipment for generating text data Download PDF

Info

Publication number
CN112328734B
CN112328734B CN202011224705.3A CN202011224705A CN112328734B CN 112328734 B CN112328734 B CN 112328734B CN 202011224705 A CN202011224705 A CN 202011224705A CN 112328734 B CN112328734 B CN 112328734B
Authority
CN
China
Prior art keywords
text
vector
standard
training
bert
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011224705.3A
Other languages
Chinese (zh)
Other versions
CN112328734A (en
Inventor
阮智昊
李茂昌
江炼鑫
莫洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202011224705.3A priority Critical patent/CN112328734B/en
Publication of CN112328734A publication Critical patent/CN112328734A/en
Application granted granted Critical
Publication of CN112328734B publication Critical patent/CN112328734B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to big data technology, and discloses a method for generating text data, which comprises the following steps: acquiring standard training texts in the training data and expanding training texts corresponding to the standard training texts; inputting a standard training text into a first Bert which is pre-arranged under the twin framework, and inputting an extended training text into a second Bert which is pre-arranged under the twin framework; training a twin Bert model consisting of a first Bert and a second Bert under the constraint of a loss function; judging whether the loss function converges or not; if yes, determining model parameters of a twin Bert model; inputting standard text data of a similar text to be generated into a twin Bert model, and generating a vector corresponding to the standard text; and acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text. By arranging the Bert model under the twin framework and designing the training loss function of the twin Bert model, the similarity of the generated text data is higher, and the generated text corresponding to the expansion question is more coherent and has more accurate semantics.

Description

Method, device and computer equipment for generating text data
Technical Field
The present application relates to the field of big data, and in particular, to a method, an apparatus, and a computer device for generating text data.
Background
In a plurality of scenes of natural language processing technology application, a large amount of corpus is needed to support the prediction and generation of the model. If the corpus is insufficient or the samples are severely unbalanced, chinese text data enhancement is required. The chinese text data enhancement may be performed by generating a similarity question or generating a synonym. The method is characterized in that a corpus short text is input, and a short text with similar semantics and different characters is output. However, the existing method for generating the similarity questions or generating the synonyms is mostly realized by a seq2seq model, synonym-based addition and deletion transformation, a transliteration model and the like, and the data precision requirement of training a natural language processing model cannot be met in terms of consistency of the generated text and semantic accuracy.
Disclosure of Invention
The method and the device mainly aim to solve the technical problem that the database is difficult to run and cannot respond to service requirements rapidly when the service volume is large.
The application provides a method for generating text data, which comprises the following steps:
acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry labeling data;
Inputting the standard training text into a first Bert which is pre-arranged under a twin frame, and inputting the extended training text into a second Bert which is pre-arranged under the twin frame;
training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is thatloss represents a loss function, v 1 Representing the vector corresponding to the standard text, v 2 Representing vectors corresponding to the extended text, and label representing annotation data;
judging whether the loss function converges or not;
if yes, determining model parameters of the twin Bert model;
inputting standard text data of a similar text to be generated into the twin Bert model, and generating a vector corresponding to the standard text;
and acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.
Preferably, after the step of obtaining the expanded text similar to the standard text according to the vector corresponding to the standard text, the method includes:
inputting the vector corresponding to the standard text into a nearest neighbor model;
screening a specified statement vector meeting a preset similarity requirement with the standard text in a preset database through a nearest neighbor model;
Acquiring text data corresponding to the specified sentence vector;
and taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.
Preferably, after the step of screening the specified sentence vector meeting the preset similarity requirement with the standard text in the preset database through the nearest neighbor model, the method includes:
vector stitching is carried out on the vector corresponding to the standard text and the specified sentence vector to form a stitched vector;
inputting the spliced vector into a Bert classification model for descending order sorting;
acquiring a specified number of splice vectors arranged in a preceding arrangement in the descending order;
and determining the expanded text corresponding to the standard text according to the specified number of splicing vectors.
Preferably, the step of vector stitching is performed on the vector corresponding to the standard text and the specified sentence vector to form a stitched vector, including:
acquiring a first vector corresponding to the standard text and a second vector screened by the nearest neighbor model, wherein the second vector is any one of the specified sentence vectors;
according to the first vector and the second vector, calculating element granularity differences, element granularity products and cosine distances among vectors of the first vector and the second vector;
And splicing the element granularity differences, element granularity products and cosine distances among vectors into the spliced vectors.
Preferably, the step of determining the expanded text corresponding to the standard text according to the specified number of spliced vectors includes:
respectively calculating the correlation degree between every two of the expansion texts in the appointed number;
judging whether the correlation degree reaches a corresponding threshold value when two texts are correlated;
if yes, combining the two appointed extended texts into one extended text.
Preferably, the step of calculating the correlation between each two of the specified number of expanded texts includes:
respectively calculating a Levenshtein distance, a Jaro-Winkler distance between every two extended texts in the appointed number, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance, and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance;
splicing the Levenshtein distance, the Jaro-Winkler distance, the first TF-IDF, the first BM25, the second TF-IDF and the second BM25 to form a correlation feature;
inputting the correlation characteristics into a random forest frame, calling a random forest function, and calculating the repeatability;
And taking the classification result output by the random forest framework as the correlation degree.
The application provides a device for generating text data, which comprises:
the first acquisition module is used for acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry annotation data;
the first input module is used for inputting the standard training text into a first Bert which is pre-arranged under a twin frame, and inputting the extended training text into a second Bert which is pre-arranged under the twin frame;
a training module for training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is thatv 1 Representing the vector corresponding to the standard text, v 2 Representing vectors corresponding to the extended text, and label representing annotation data;
the judging module is used for judging whether the loss function converges or not;
the first determining module is used for determining model parameters of the twin Bert model if convergence occurs;
The second input module is used for inputting standard text data of a similar text to be generated into the twin Bert model and generating a vector corresponding to the standard text;
and the second acquisition module is used for acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.
The present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.
The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above-described method.
According to the method, the Bert model is arranged under the twin framework, and the training loss function of the twin Bert model is designed, so that the similarity of generated text data is higher, and the generated text corresponding to the expansion question is more coherent and more accurate in semantic.
Drawings
FIG. 1 is a flow chart of a method of generating text data according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an apparatus for generating text data according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Referring to fig. 1, the method for generating text data of the present embodiment includes:
s1: acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry labeling data;
s2: inputting the standard training text into a first Bert which is pre-arranged under a twin frame, and inputting the extended training text into a second Bert which is pre-arranged under the twin frame;
s3: training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is thatloss represents a loss function,/->Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data;
S4: judging whether the loss function converges or not;
s5: if yes, determining model parameters of the twin Bert model;
s6: inputting standard text data of a similar text to be generated into the twin Bert model, and generating a vector corresponding to the standard text;
s7: and acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.
In the embodiment of the application, two identical berts are pre-arranged on a twin framework to form a twin Bert model, and the twin Bert model is trained on training data through a set loss function, so that the two berts have the same parameter parameters. Because the Bert can jointly mediate the bidirectional representation of the context training depth in all layers through the bidirectional encoder, the word with the word multi-meaning phenomenon can be interpreted and represented according to the context by combining the alignment operation of the twin vector, and finally the sentence vector conforming to the context semantic information is generated, so that the semantic accuracy is improved, and the twin Bert model can accurately determine the expanded text similar to the standard text. When the sentence vector mapping of the standard text is more accurate, the expanded text which is screened by the standard text and is similar to the standard text only has similar consistency and semantic accuracy with the standard text, and can meet the expansion precision requirement of the text data.
Further, after step S7 of obtaining the expanded text similar to the standard text according to the vector corresponding to the standard text, the method includes:
s71: inputting the vector corresponding to the standard text into a nearest neighbor model;
s72: screening a specified statement vector meeting a preset similarity requirement with the standard text in a preset database through the nearest neighbor model;
s73: acquiring text data corresponding to the specified sentence vector;
s74: and taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.
In the embodiment of the application, the nearest neighbor model plays a role of a recall layer, and the used recall corpus is added with a high-quality data set of the disclosed user dialogue besides the training corpus, so that the abundant recall corpus is ensured, and the consistency of texts is also ensured. The nearest neighbor model is a model for realizing data retrieval according to the nearest neighbor, and according to the similarity of the data, the most similar items with the target data are searched from a database, and the similarity is quantized to the distance between the data in the space position, so that the closer the distance between the data in the space is, the higher the similarity between the data is. When the first K data items closest to the target data need to be searched, the K nearest neighbor search model is called as K-NN for short. The recall layer obtains a specified statement vector which approximates the nearest neighbor in a high-dimensional space, and screens a vector meeting the preset similarity requirement with the standard text in a preset database by determining the statement vector which approximates the nearest neighbor. According to the embodiment of the application, an Annoy open source library is used, sentence vectors generated by corpus of standard texts input by a user are extracted, an Annoy index is built through the sentence vectors, a binary tree is built in a preset database, and approximate nearest neighbor searching of O (log) time complexity is achieved through binary tree searching. The search process is as follows: "from genesim. Models import KeyedVectors; send 2vec_model = keyvectors. Load (u 'send 2 vec_model'); send 2vec_model. Similar_by_vector (query) ".
Further, after the step S72 of screening the specified sentence vector satisfying the preset similarity requirement with the standard text in the preset database through the nearest neighbor model, the method includes:
s75: vector stitching is carried out on the vector corresponding to the standard text and the specified sentence vector to form a stitched vector;
s76: inputting the spliced vector into a Bert classification model for descending order sorting;
s77: acquiring a specified number of splice vectors arranged in a preceding arrangement in the descending order;
s78: and determining the expanded text corresponding to the standard text according to the specified number of splicing vectors.
In order to improve the accuracy of the selected similar sentence vectors, the appointed sentence vectors output by the recall layer are respectively subjected to vector splicing with sentence vectors corresponding to standard texts, and then are subjected to descending order sorting through a Bert classification model, wherein the Bert classification model plays a role of a sorting layer. Determining a specified number of spliced vectors arranged in the descending order according to the descending order of the Bert classification model, so as to determine sentence vectors with higher similarity degree, and further determine the expanded text with the most similar consistency and the most similar semantics with the standard text.
Further, the step S75 of vector stitching the vector corresponding to the standard text and the specified sentence vector to form a stitched vector includes:
s751: acquiring a first vector corresponding to the standard text and a second vector screened by the nearest neighbor model, wherein the second vector is any one of the specified sentence vectors;
s752: according to the first vector and the second vector, calculating element granularity differences, element granularity products and cosine distances among vectors of the first vector and the second vector;
s753: and splicing the element granularity differences, element granularity products and cosine distances among vectors into the spliced vectors.
According to the method and the device, the first vector and the second vector are subjected to interactive calculation, and the interactive information characteristics between the two vectors are captured, so that the ordering accuracy of the Bert classification model is improved. The interactive calculation includes, but is not limited to, calculating element granularity differences, element granularity products and cosine distances between two vectors, so as to obtain interactive information of the two text vectors from different dimensions, and extracting interactive information features. For example, the first vector is a vector vec_q1 of a standard text, the second vector is a vector vec_q2 of one of the extended texts output by the recall layer, and the interactive operation comprises calculating element granularity difference element-wise minus between the two vectors, wherein vec_q1-vec_q2; calculating element-wise multi-point of element granularity between two vectors, wherein vec_q1 is vec_q2; the cosine distance vec_q1.alpha.vec_q2 of the two vectors is calculated. Such as vec_q1= [ x1, x2, x3...xn ], vec_q2= [ y1, y2, y3..], vec_q1-vec_q2= [ x1-y1, x2-y2, x 3-y3..]; vec_q1×vecq2= [ x1×y1, x2×y2, x3× y3. ]; the cosine distance vec_q1+vecq2 in the embodiment of the present application is a cosine distance which is not normalized by the vector mode, so as to reduce the calculation amount. The above vec_q1++vecq2= x1×y1+x2+x3×y3+).
According to the vector input mode of the Bert classification model, and [ cls ] is added at the beginning of the spliced vector, and [ sep ] is arranged between every two vectors. The input vectors after the final concatenation were [ cls, vec_q1, sep, vec_q2, sep, vec_q1-vec_q2, sep, vec_q1.vec_q2 ].
Further, the step S78 of determining the expanded text corresponding to the standard text according to the specified number of spliced vectors includes:
s781: respectively calculating the correlation degree between every two of the expansion texts in the appointed number;
s782: judging whether the correlation degree reaches a corresponding threshold value when two texts are correlated;
s783: if yes, combining the two appointed extended texts into one extended text.
The relevance of the embodiment of the application can be expressed by a Levenshtein distance, so that the calculated amount is reduced as much as possible under the condition of meeting the precision. And judging whether character strings of the two texts are identical or not by carrying out Levenshtein distance calculation on the obtained expanded texts. For example, if the Levenshtein distance is less than or equal to 2, which indicates that the strings of two extended texts are substantially identical, a threshold with a correlation is reached. I.e., levenshtein Distance (q 1, q 2) <=2, where q1 and q2 represent two specified extended texts and Levenshtein Distance represents a Levenshtein distance, and the two specified extended texts are correlated or identical for deduplication. The two may be combined into one to increase the effectiveness of the expanded text as training expanded data for other models.
Further, the step S781 of calculating the correlation between the specified number of expanded texts, respectively, includes:
s7811: respectively calculating a Levenshtein distance, a Jaro-Winkler distance between every two extended texts in the appointed number, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance, and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance;
s7812: splicing the Levenshtein distance, the Jaro-Winkler distance, the first TF-IDF, the first BM25, the second TF-IDF and the second BM25 to form a correlation feature;
s7813: inputting the correlation characteristics into a random forest frame, calling a random forest function, and calculating the repeatability;
s7814: and taking the classification result output by the random forest framework as the correlation degree.
In order to evaluate the relevance of two extended texts from different dimensions and improve the duplicate removal effect, the embodiment of the invention calculates the Levenshtein distance and the Jaro-Winkler distance of the two extended texts respectively, and splices the correlation characteristics with the Levenshtein distance and the Jaro-Winkler distance into a relevance feature by TF-IDF (term frequency-inverse document frequency) and BM25 (best mapping 25) corresponding to the Levenshtein distance and the Jaro-Winkler distance respectively, inputs the two extended texts into a random forest frame for duplicate/unrepeated binary classification calculation, and when the output binary classification result is duplicate, the two extended texts are identical or relevant and need duplicate removal treatment. The TF-IDF is a weighting technology for information retrieval and data mining, TF is word Frequency Term Frequency, and IDF is inverse text Frequency index Inverse Document Frequency; BM25 is an optimized version of TF-IDF.
Referring to fig. 2, an apparatus for generating text data according to an embodiment of the present application includes:
the first acquisition module 1 is used for acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry annotation data;
a first input module 2, configured to input the standard training text into a first Bert that is pre-arranged under a twin frame, and input the extended training text into a second Bert that is pre-arranged under the twin frame;
a training module 3, configured to train a twin Bert model composed of the first Bert and the second Bert under a constraint of a loss function, where an expression of the loss function isloss represents a loss function,/->Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data;
a judging module 4, configured to judge whether the loss function converges;
the first determining module 5 is configured to determine model parameters of the twin Bert model if convergence occurs;
the second input module 6 is used for inputting standard text data of a similar text to be generated into the twin Bert model and generating a vector corresponding to the standard text;
And the second acquisition module 7 is used for acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.
In the embodiment of the application, two identical berts are pre-arranged on a twin framework to form a twin Bert model, and the twin Bert model is trained on training data through a set loss function, so that the two berts have the same parameter parameters. Because the Bert can jointly mediate the bidirectional representation of the context training depth in all layers through the bidirectional encoder, the word with the word multi-meaning phenomenon can be interpreted and represented according to the context by combining the alignment operation of the twin vector, and finally the sentence vector conforming to the context semantic information is generated, so that the semantic accuracy is improved, and the twin Bert model can accurately determine the expanded text similar to the standard text. When the sentence vector mapping of the standard text is more accurate, the expanded text which is screened by the standard text and is similar to the standard text only has similar consistency and semantic accuracy with the standard text, and can meet the expansion precision requirement of the text data.
Further, an apparatus for generating text data, comprising:
the third input module is used for inputting the vector corresponding to the standard text into a nearest neighbor model;
The screening module is used for screening specified sentence vectors meeting the preset similarity requirement with the standard text in a preset database through the nearest neighbor model;
the third acquisition module is used for acquiring text data corresponding to the specified sentence vector;
and the module is used for taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.
In the embodiment of the application, the nearest neighbor model plays a role of a recall layer, and the used recall corpus is added with a high-quality data set of the disclosed user dialogue besides the training corpus, so that the abundant recall corpus is ensured, and the consistency of texts is also ensured. The recall layer obtains a specified statement vector which approximates the nearest neighbor in a high-dimensional space, and screens a vector meeting the preset similarity requirement with the standard text in a preset database by determining the statement vector which approximates the nearest neighbor. According to the embodiment of the application, an Annoy open source library is used, sentence vectors generated by corpus of standard texts input by a user are extracted, an Annoy index is built through the sentence vectors, a binary tree is built in a preset database, and approximate nearest neighbor searching of O (log) time complexity is achieved through binary tree searching. The search process is as follows: "from genesim. Models import KeyedVectors; send 2vec_model = keyvectors. Load (u 'send 2 vec_model'); send 2vec_model. Similar_by_vector (query) ".
Further, an apparatus for generating text data, comprising:
the forming module is used for carrying out vector splicing on the vector corresponding to the standard text and the specified sentence vector to form a spliced vector;
the fourth input module is used for inputting the spliced vector into the Bert classification model to carry out descending order sequencing;
a fourth obtaining module, configured to obtain a specified number of stitching vectors arranged in a preceding order in the descending order;
and the second determining module is used for determining the expansion text corresponding to the standard text according to the specified number of splicing vectors.
In order to improve the accuracy of the selected similar sentence vectors, the appointed sentence vectors output by the recall layer are respectively subjected to vector splicing with sentence vectors corresponding to standard texts, and then are subjected to descending order sorting through a Bert classification model, wherein the Bert classification model plays a role of a sorting layer. Determining a specified number of spliced vectors arranged in the descending order according to the descending order of the Bert classification model, so as to determine sentence vectors with higher similarity degree, and further determine the expanded text with the most similar consistency and the most similar semantics with the standard text.
Further, forming a module, comprising:
The acquisition unit is used for acquiring a first vector corresponding to the standard text and a second vector screened by the nearest neighbor model, wherein the second vector is any one of the specified sentence vectors;
the first calculating unit is used for calculating element granularity differences, element granularity products and cosine distances among vectors of the first vector and the second vector according to the first vector and the second vector;
and the splicing unit is used for splicing the element granularity difference, the element granularity product and the cosine distance among vectors into the spliced vector.
According to the method and the device, the first vector and the second vector are subjected to interactive calculation, and the interactive information characteristics between the two vectors are captured, so that the ordering accuracy of the Bert classification model is improved. The interactive calculation includes, but is not limited to, calculating element granularity differences, element granularity products and cosine distances between two vectors, so as to obtain interactive information of the two text vectors from different dimensions, and extracting interactive information features. For example, the first vector is a vector vec_q1 of a standard text, the second vector is a vector vec_q2 of one of the extended texts output by the recall layer, and the interactive operation comprises calculating element granularity difference element-wise minus between the two vectors, wherein vec_q1-vec_q2; calculating element-wise multi-point of element granularity between two vectors, wherein vec_q1 is vec_q2; the cosine distance vec_q1.alpha.vec_q2 of the two vectors is calculated. Such as vec_q1= [ x1, x2, x3...xn ], vec_q2= [ y1, y2, y3..], vec_q1-vec_q2= [ x1-y1, x2-y2, x 3-y3..]; vec_q1×vecq2= [ x1×y1, x2×y2, x3× y3. ]; the cosine distance vec_q1+vecq2 in the embodiment of the present application is a cosine distance which is not normalized by the vector mode, so as to reduce the calculation amount. The above vec_q1++vecq2= x1×y1+x2+x3×y3+).
Then, according to the vector input mode of the Bert classification model, and [ cls ] is added at the beginning of the spliced vector, and [ sep ] is arranged between every two vectors. The input vectors after the final concatenation were [ cls, vec_q1, sep, vec_q2, sep, vec_q1-vec_q2, sep, vec_q1.vec_q2 ].
Further, the second determining module includes:
the second calculation unit is used for calculating the correlation degree between every two expansion texts in the appointed number;
the judging unit is used for judging whether the correlation degree reaches a corresponding threshold value when two texts are correlated;
and the merging unit is used for merging the two appointed expanded texts into one expanded text if the corresponding threshold value is reached when the two texts are related.
The relevance of the embodiment of the application can be expressed by a Levenshtein distance, so that the calculated amount is reduced as much as possible under the condition of meeting the precision. And judging whether character strings of the two texts are identical or not by carrying out Levenshtein distance calculation on the obtained expanded texts. For example, if the Levenshtein distance is less than or equal to 2, which indicates that the strings of two extended texts are substantially identical, a threshold with a correlation is reached. I.e., levenshtein Distance (q 1, q 2) <=2, where q1 and q2 represent two specified extended texts and Levenshtein Distance represents a Levenshtein distance, and the two specified extended texts are correlated or identical for deduplication. The two may be combined into one to increase the effectiveness of the expanded text as training expanded data for other models.
Further, the second calculation unit includes:
a calculating subunit, configured to calculate, in the specified number of extended texts, a Levenshtein distance, a Jaro-Winkler distance between every two extended texts, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance, and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance, respectively;
a splicing subunit, configured to splice the Levenshtein distance, the Jaro-Winkler distance, the first TF-IDF, the first BM25, the second TF-IDF, and the second BM25 to form a correlation feature;
an input subunit, configured to input the correlation feature into a random forest frame, call a random forest function, and perform repetition calculation;
and the subunit is used for taking the classification result output by the random forest framework as the correlation degree.
In order to evaluate the relevance of two extended texts from different dimensions and improve the duplicate removal effect, the embodiment of the invention calculates the Levenshtein distance and the Jaro-Winkler distance of the two extended texts respectively, and splices the correlation characteristics with the Levenshtein distance and the Jaro-Winkler distance into a relevance feature by TF-IDF (term frequency-inverse document frequency) and BM25 (best match 25) corresponding to the Levenshtein distance and the Jaro-Winkler distance respectively, inputs the two extended texts into a random forest frame for duplicate/unrepeated binary classification calculation, and when the output binary classification result is duplicate, the two extended texts are identical or relevant and need duplicate removal treatment. The TF-IDF is a weighting technology for information retrieval and data mining, TF is word Frequency Term Frequency, and IDF is inverse text Frequency index Inverse Document Frequency; BM25 is an optimized version of TF-IDF.
Referring to fig. 3, a computer device is further provided in the embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store all the data required for the process of generating text data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of generating text data.
The processor executes the method for generating text data, and the method comprises the following steps: acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry labeling data; inputting the standard training text into a first Bert which is pre-arranged under a twin frame, and inputting the extended training text into a second Bert which is pre-arranged under the twin frame; training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is that Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data; judging whether the loss function converges or not; if yes, determining model parameters of the twin Bert model; standard text data to be generated into similar text is input into the twin Bert model,generating a vector corresponding to the standard text; and acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.
According to the computer equipment, the Bert model is arranged under the twin framework, and the training loss function of the twin Bert model is designed, so that the similarity of generated text data is higher, and the generated text corresponding to the expansion question is more coherent and has more accurate semantics.
In one embodiment, the step of obtaining the expanded text similar to the standard text by the processor according to the vector corresponding to the standard text includes: inputting the vector corresponding to the standard text into a nearest neighbor model; screening a specified statement vector meeting a preset similarity requirement with the standard text in a preset database through the nearest neighbor model; acquiring text data corresponding to the specified sentence vector; and taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.
In one embodiment, after the step of screening the specified sentence vector meeting the preset similarity requirement with the standard text in the preset database by the nearest neighbor model, the processor includes: vector stitching is carried out on the vector corresponding to the standard text and the specified sentence vector to form a stitched vector; inputting the spliced vector into a Bert classification model for descending order sorting; acquiring a specified number of splice vectors arranged in a preceding arrangement in the descending order; and determining the expanded text corresponding to the standard text according to the specified number of splicing vectors.
In one embodiment, the step of performing vector concatenation on the vector corresponding to the standard text and the specified sentence vector by the processor to form a concatenated vector includes: acquiring a first vector corresponding to the standard text and a second vector screened by the nearest neighbor model, wherein the second vector is any one of the specified sentence vectors; according to the first vector and the second vector, calculating element granularity differences, element granularity products and cosine distances among vectors of the first vector and the second vector; and splicing the element granularity differences, element granularity products and cosine distances among vectors into the spliced vectors.
In one embodiment, the step of determining the expanded text corresponding to the standard text by the processor according to the specified number of spliced vectors includes: respectively calculating the correlation degree between every two of the expansion texts in the appointed number; judging whether the correlation degree reaches a corresponding threshold value when two texts are correlated; if yes, combining the two appointed extended texts into one extended text.
In one embodiment, the step of calculating the correlation between each two of the specified number of expanded texts by the processor includes: respectively calculating a Levenshtein distance, a Jaro-Winkler distance between every two extended texts in the appointed number, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance, and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance; splicing the Levenshtein distance, the Jaro-Winkler distance, the first TF-IDF, the first BM25, the second TF-IDF and the second BM25 to form a correlation feature; inputting the correlation characteristics into a random forest frame, calling a random forest function, and calculating the repeatability; and taking the classification result output by the random forest framework as the correlation degree.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.
An embodiment of the present application further provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of generating text data, comprising: acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry labeling data; inputting the standard training text into a first Bert pre-arranged under a twin frame, and inputting the standard training text into a second BertThe extended training text input is pre-arranged at a second Bert under the twin frame; training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is that Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data; judging whether the loss function converges or not; if yes, determining model parameters of the twin Bert model; inputting standard text data of a similar text to be generated into the twin Bert model, and generating a vector corresponding to the standard text; and acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text.
According to the computer readable storage medium, the Bert model is laid under the twin framework, and the training loss function of the twin Bert model is designed, so that the similarity of generated text data is higher, and the generated text corresponding to the expansion question is more consistent and more accurate in semantic.
In one embodiment, the step of obtaining the expanded text similar to the standard text by the processor according to the vector corresponding to the standard text includes: inputting the vector corresponding to the standard text into a nearest neighbor model; screening a specified statement vector meeting a preset similarity requirement with the standard text in a preset database through the nearest neighbor model; acquiring text data corresponding to the specified sentence vector; and taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.
In one embodiment, after the step of screening the specified sentence vector meeting the preset similarity requirement with the standard text in the preset database by the nearest neighbor model, the processor includes: vector stitching is carried out on the vector corresponding to the standard text and the specified sentence vector to form a stitched vector; inputting the spliced vector into a Bert classification model for descending order sorting; acquiring a specified number of splice vectors arranged in a preceding arrangement in the descending order; and determining the expanded text corresponding to the standard text according to the specified number of splicing vectors.
In one embodiment, the step of performing vector concatenation on the vector corresponding to the standard text and the specified sentence vector by the processor to form a concatenated vector includes: acquiring a first vector corresponding to the standard text and a second vector screened by the nearest neighbor model, wherein the second vector is any one of the specified sentence vectors; according to the first vector and the second vector, calculating element granularity differences, element granularity products and cosine distances among vectors of the first vector and the second vector; and splicing the element granularity differences, element granularity products and cosine distances among vectors into the spliced vectors.
In one embodiment, the step of determining the expanded text corresponding to the standard text by the processor according to the specified number of spliced vectors includes: respectively calculating the correlation degree between every two of the expansion texts in the appointed number; judging whether the correlation degree reaches a corresponding threshold value when two texts are correlated; if yes, combining the two appointed extended texts into one extended text.
In one embodiment, the step of calculating the correlation between each two of the specified number of expanded texts by the processor includes: respectively calculating a Levenshtein distance, a Jaro-Winkler distance between every two extended texts in the appointed number, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance, and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance; splicing the Levenshtein distance, the Jaro-Winkler distance, the first TF-IDF, the first BM25, the second TF-IDF and the second BM25 to form a correlation feature; inputting the correlation characteristics into a random forest frame, calling a random forest function, and calculating the repeatability; and taking the classification result output by the random forest framework as the correlation degree.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (8)

1. A method of generating text data, comprising:
acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry labeling data;
Inputting the standard training text into a first Bert which is pre-arranged under a twin frame, and inputting the extended training text into a second Bert which is pre-arranged under the twin frame;
training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is thatloss represents a loss function,/->Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data;
judging whether the loss function converges or not;
if yes, determining model parameters of the twin Bert model;
inputting standard text data of a similar text to be generated into the twin Bert model, and generating a vector corresponding to the standard text;
acquiring an expanded text similar to the standard text according to the vector corresponding to the standard text;
the step of obtaining the expanded text similar to the standard text according to the vector corresponding to the standard text comprises the following steps:
inputting the vector corresponding to the standard text into a nearest neighbor model;
screening a specified statement vector meeting a preset similarity requirement with the standard text in a preset database through the nearest neighbor model;
Acquiring text data corresponding to the specified sentence vector;
and taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.
2. The method for generating text data according to claim 1, wherein after the step of screening a predetermined database for a specified sentence vector satisfying a predetermined similarity requirement with the standard text by the nearest neighbor model, the method comprises:
vector stitching is carried out on the vector corresponding to the standard text and the specified sentence vector to form a stitched vector;
inputting the spliced vector into a Bert classification model for descending order sorting;
acquiring a specified number of splice vectors arranged in a preceding arrangement in the descending order;
and determining the expanded text corresponding to the standard text according to the specified number of splicing vectors.
3. The method of generating text data according to claim 2, wherein the step of vector stitching the vector corresponding to the standard text and the specified sentence vector to form a stitched vector includes:
acquiring a first vector corresponding to the standard text and a second vector screened by the nearest neighbor model, wherein the second vector is any one of the specified sentence vectors;
According to the first vector and the second vector, calculating element granularity differences, element granularity products and cosine distances among vectors of the first vector and the second vector;
and splicing the first vector, the second vector, the element granularity difference, the element granularity product and the cosine distance among the vectors into the spliced vector.
4. The method of generating text data according to claim 2, wherein the step of determining the expanded text corresponding to the standard text based on the specified number of spliced vectors includes:
in the appointed number of the expanded texts, calculating the correlation degree between every two of the expanded texts respectively;
judging whether the correlation degree reaches a corresponding threshold value when two texts are correlated;
if yes, combining the two texts with the correlation degree reaching the threshold value into one expanded text.
5. The method of generating text data as set forth in claim 4, wherein the step of calculating the correlation between each pair among the specified number of expanded texts, respectively, includes:
in the appointed number of the extended texts, calculating a Levenshtein distance, a Jaro-Winkler distance between every two extended texts, and a first TF-IDF and a first BM25 corresponding to the Levenshtein distance and a second TF-IDF and a second BM25 corresponding to the Jaro-Winkler distance respectively;
Splicing the Levenshtein distance, the Jaro-Winkler distance, the first TF-IDF, the first BM25, the second TF-IDF and the second BM25 to form a correlation feature;
inputting the correlation characteristics into a random forest frame, calling a random forest function, and calculating the repeatability;
and taking the classification result output by the random forest framework as the correlation degree.
6. An apparatus for generating text data for implementing the method of any one of claims 1 to 5, comprising:
the first acquisition module is used for acquiring standard training texts in training data and extended training texts corresponding to the standard training texts, wherein one standard training text corresponds to a plurality of extended training texts, and the standard training texts and the extended training texts both carry annotation data;
the first input module is used for inputting the standard training text into a first Bert which is pre-arranged under a twin frame, and inputting the extended training text into a second Bert which is pre-arranged under the twin frame;
a training module for training a twin Bert model composed of the first Bert and the second Bert under the constraint of a loss function, wherein the expression of the loss function is that loss represents a loss function,/->Representing the vector corresponding to the standard training text, < >>Representing vectors corresponding to the extended training text, and label representing annotation data;
the judging module is used for judging whether the loss function converges or not;
the first determining module is used for determining model parameters of the twin Bert model if convergence occurs;
the second input module is used for inputting standard text data of a similar text to be generated into the twin Bert model and generating a vector corresponding to the standard text;
the second acquisition module is used for acquiring the expanded text similar to the standard text according to the vector corresponding to the standard text;
the third input module is used for inputting the vector corresponding to the standard text into a nearest neighbor model;
the screening module is used for screening specified sentence vectors meeting the preset similarity requirement with the standard text in a preset database through the nearest neighbor model;
the third acquisition module is used for acquiring text data corresponding to the specified sentence vector;
and the module is used for taking the text data corresponding to the specified sentence vector as the expansion text corresponding to the standard text.
7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.
CN202011224705.3A 2020-11-05 2020-11-05 Method, device and computer equipment for generating text data Active CN112328734B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011224705.3A CN112328734B (en) 2020-11-05 2020-11-05 Method, device and computer equipment for generating text data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011224705.3A CN112328734B (en) 2020-11-05 2020-11-05 Method, device and computer equipment for generating text data

Publications (2)

Publication Number Publication Date
CN112328734A CN112328734A (en) 2021-02-05
CN112328734B true CN112328734B (en) 2024-02-13

Family

ID=74316148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011224705.3A Active CN112328734B (en) 2020-11-05 2020-11-05 Method, device and computer equipment for generating text data

Country Status (1)

Country Link
CN (1) CN112328734B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894436B (en) * 2023-09-06 2023-12-15 神州医疗科技股份有限公司 Data enhancement method and system based on medical named entity recognition

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324744A (en) * 2020-02-17 2020-06-23 中山大学 Data enhancement method based on target emotion analysis data set
CN111428867A (en) * 2020-06-15 2020-07-17 深圳市友杰智新科技有限公司 Model training method and device based on reversible separation convolution and computer equipment
CN111444731A (en) * 2020-06-15 2020-07-24 深圳市友杰智新科技有限公司 Model training method and device and computer equipment
CN111859986A (en) * 2020-07-27 2020-10-30 中国平安人寿保险股份有限公司 Semantic matching method, device, equipment and medium based on multitask twin network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200045128A (en) * 2018-10-22 2020-05-04 삼성전자주식회사 Model training method and apparatus, and data recognizing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324744A (en) * 2020-02-17 2020-06-23 中山大学 Data enhancement method based on target emotion analysis data set
CN111428867A (en) * 2020-06-15 2020-07-17 深圳市友杰智新科技有限公司 Model training method and device based on reversible separation convolution and computer equipment
CN111444731A (en) * 2020-06-15 2020-07-24 深圳市友杰智新科技有限公司 Model training method and device and computer equipment
CN111859986A (en) * 2020-07-27 2020-10-30 中国平安人寿保险股份有限公司 Semantic matching method, device, equipment and medium based on multitask twin network

Also Published As

Publication number Publication date
CN112328734A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
KR102304673B1 (en) Keyword extraction method, computer device, and storage medium
CN109670191B (en) Calibration optimization method and device for machine translation and electronic equipment
WO2021120627A1 (en) Data search matching method and apparatus, computer device, and storage medium
CN111190997B (en) Question-answering system implementation method using neural network and machine learning ordering algorithm
CN111581510A (en) Shared content processing method and device, computer equipment and storage medium
CN110929038A (en) Entity linking method, device, equipment and storage medium based on knowledge graph
CN111563384A (en) Evaluation object identification method and device for E-commerce products and storage medium
CN111259113A (en) Text matching method and device, computer readable storage medium and computer equipment
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN113569050A (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN112328734B (en) Method, device and computer equipment for generating text data
CN114997181A (en) Intelligent question-answering method and system based on user feedback correction
CN115600593A (en) Method and device for acquiring key content of literature
CN111859950A (en) Method for automatically generating lecture notes
CN109086386B (en) Data processing method, device, computer equipment and storage medium
CN117076652B (en) Semantic text retrieval method, system and storage medium for middle phrases
AU2018226420A1 (en) Voice assisted intelligent searching in mobile documents
CN117076946A (en) Short text similarity determination method, device and terminal
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN116127960A (en) Information extraction method, information extraction device, storage medium and computer equipment
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium
WO2022246162A1 (en) Content generation using target content derived modeling and unsupervised language modeling
CN114818727A (en) Key sentence extraction method and device
Bulfamante Generative enterprise search with extensible knowledge base using AI
CN113887244A (en) Text processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant