WO2020220636A1

WO2020220636A1 - Text data enhancement method and apparatus, electronic device, and non-volatile computer-readable storage medium

Info

Publication number: WO2020220636A1
Application number: PCT/CN2019/117663
Authority: WO
Inventors: 于凤英; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-04-28
Filing date: 2019-11-12
Publication date: 2020-11-05
Also published as: CN110222707A

Abstract

A text data enhancement method and apparatus, and an electronic device, relating to the technical field of machine learning. The method comprises: acquiring original text (201); performing word segmentation processing on the original text to acquire several candidate words (202); for a target candidate word, on the basis of context information of the target candidate word, acquiring N replacement words from a pre-set dictionary by using a bidirectional long short-term memory network model (203), wherein the target candidate word is any one candidate word in the several candidate words, a semantic label corresponding to each replacement word in the N replacement words matches a semantic label corresponding to the original text, and N is a positive integer; and generating N pieces of first extended text according to the N replacement words and the original text (204). The method can improve the semantic accuracy of text data enhancement.

Description

Text data enhancement method and device, electronic equipment, computer non-volatile readable storage medium

This application claims the priority of the Chinese patent application 201910350209.3 filed on April 28, 2019 with the application name "A method and device for text data enhancement, and electronic equipment", and the entire contents of which are incorporated herein by reference.

Technical field

This application relates to the field of machine learning technology, in particular to a text data enhancement method and device, electronic equipment, and computer non-volatile readable storage media.

Background technique

In the field of machine learning technology, data augmentation technology is an important means to expand the training set. It is often used to generate more new data to train the model, so that the model is more accurate and has more generalization capabilities. The core point of data enhancement is to use new data to replace the original data while ensuring that the new data and the original data belong to the same category. For data enhancement technology applied to images, this is very easy to achieve. For example, if you obtain a new image by horizontally flipping the original image, randomly cropping, or adjusting the RGB channel, the content of the new image still belongs to the original image.

The inventor of this application realized that for the data enhancement technology applied to text, due to the context of the text, if the original text is blindly reversed, intercepted, or replaced, the semantics of the original text will be changed. The semantic accuracy of text data enhancement is not high.

Summary of the invention

In order to solve the above technical problems, this application provides a text data enhancement method and device, and electronic equipment.

Among them, the technical solutions adopted in this application are:

On the one hand, a text data enhancement method includes: obtaining original text; performing word segmentation processing on the original text to obtain several candidate words; for target candidate words, based on the context information of the target candidate words, using bidirectional long and short term The memory network model obtains N replacement words from a preset dictionary; wherein the target candidate word is any one of the several candidate words, and the semantic label corresponding to each replacement word in the N replacement words is The semantic tags corresponding to the original text match, and the N is a positive integer; and N first extended texts are generated according to the N replacement words and the original text.

On the other hand, a text data enhancement device includes: the acquisition unit is used to obtain the original text; the word segmentation unit is used to perform word segmentation processing on the original text to obtain several candidate words; and the replacement word acquisition unit is used to For the target candidate word, based on the context information of the target candidate word, use the bidirectional long-term short-term memory network model to obtain N replacement words from the preset dictionary; wherein, the target candidate word is any one of the candidate words Word, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer; the text generation unit is configured to match the semantic label corresponding to the N replacement words And the original text to generate N first extended texts.

On the other hand, an electronic device includes a processor and a memory, and computer-readable instructions are stored on the memory, and the computer-readable instructions implement the text data enhancement method as described above when executed by the processor.

On the other hand, a computer nonvolatile readable storage medium has a computer program stored thereon, and the computer program implements the text data enhancement method as described above when the computer program is executed by a processor.

The technical solutions provided by the embodiments of the present application may include the following beneficial effects:

In the above technical solution, by dividing the original text into several candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary, and the replacement word can be used to replace the corresponding candidate word , Generating extended text can ensure that the semantic type of the extended text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, it is greatly enriched Word replacement and combination methods can generate a large amount of expanded text, thereby improving the efficiency of text data enhancement while ensuring accuracy.

Description of the drawings

The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments that conform to the application, and are used together with the specification to explain the principle of the application.

Figure 1 is a schematic structural diagram of a device disclosed in an embodiment of the present application;

Figure 2 is a flowchart of a text data enhancement method disclosed in an embodiment of the present application;

FIG. 3 is a flowchart of another text data enhancement method disclosed in an embodiment of the present application;

4 is a flowchart of another method for enhancing text data disclosed in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a text data enhancement device disclosed in an embodiment of the present application;

Figure 6 is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present application.

Detailed ways

Here, an exemplary embodiment will be described in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present application. On the contrary, they are only examples of devices and methods consistent with some aspects of the application as detailed in the appended claims.

Example one

The implementation environment of this application can be electronic devices, such as smart phones, tablet computers, and desktop computers.

Fig. 1 is a schematic structural diagram of a device disclosed in an embodiment of the present application. The apparatus 100 may be the aforementioned electronic device. As shown in FIG. 1, the device 100 may include one or more of the following components: a processing component 102, a memory 104, a power supply component 106, a multimedia component 108, an audio component 110, a sensor component 114, and a communication component 116.

The processing component 102 generally controls the overall operations of the device 100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 102 may include one or more processors 118 to execute instructions to complete all or part of the steps of the following method. In addition, the processing component 102 may include one or more modules to facilitate the interaction between the processing component 102 and other components. For example, the processing component 102 may include a multimedia module to facilitate the interaction between the multimedia component 108 and the processing component 102.

The memory 104 is configured to store various types of data to support operations in the device 100. Examples of these data include instructions for any application or method operating on the device 100. The memory 104 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (Static Random Access Memory, SRAM for short), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read Only Memory) Erasable Programmable Read-Only Memory (EEPROM for short), Erasable Programmable Read-Only Memory (EPROM for short), Programmable Red-Only Memory (PROM for short), Read-only memory ( Read-Only Memory, ROM for short), magnetic storage, flash memory, magnetic disk or optical disk. The memory 104 also stores one or more modules, and the one or more modules are configured to be executed by the one or more processors 118 to complete all or part of the steps in the method shown below.

The power supply component 106 provides power to various components of the device 100. The power supply component 106 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 100.

The multimedia component 108 includes a screen that provides an output interface between the device 100 and the user. In some embodiments, the screen may include a liquid crystal display (Liquid Crystal Display, LCD for short) and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor can not only sense the boundary of the touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. The screen may also include an organic electroluminescence display (Organic Light Emitting Display, OLED for short).

The audio component 110 is configured to output and/or input audio signals. For example, the audio component 110 includes a microphone (Microphone, MIC for short). When the device 100 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive external audio signals. The received audio signal can be further stored in the memory 104 or sent via the communication component 116. In some embodiments, the audio component 110 further includes a speaker for outputting audio signals.

The sensor component 114 includes one or more sensors for providing the device 100 with various aspects of state evaluation. For example, the sensor component 114 can detect the open/close state of the device 100 and the relative positioning of components. The sensor component 114 can also detect the position change of the device 100 or a component of the device 100 and the temperature change of the device 100. In some embodiments, the sensor component 114 may also include a magnetic sensor, a pressure sensor or a temperature sensor.

The communication component 116 is configured to facilitate wired or wireless communication between the apparatus 100 and other devices. The device 100 can access a wireless network based on a communication standard, such as WiFi (Wireless-Fidelity, wireless fidelity). In the embodiment of the present application, the communication component 116 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In this embodiment of the present application, the communication component 116 further includes a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth and other technologies.

In an exemplary embodiment, the apparatus 100 may be implemented by one or more Application Specific Integrated Circuits (ASIC for short), digital signal processors, digital signal processing equipment, programmable logic devices, field programmable gate arrays, The controller, microcontroller, microprocessor or other electronic components are implemented to perform the following methods.

Example two

Please refer to FIG. 2. FIG. 2 is a schematic flowchart of a text data enhancement method disclosed in an embodiment of the present application. As shown in Figure 2, the text data enhancement method may include the following steps:

201. Obtain the original text.

202. Perform word segmentation processing on the original text to obtain several candidate words.

203. For the target candidate word, based on the context information of the target candidate word, use a two-way long short-term memory network model to obtain N replacement words from a preset dictionary.

In this embodiment of the application, the target candidate word is any one of the above-mentioned candidate words, and the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text; N is a positive integer, and The value of N can be configured by oneself, and there is no specific limitation on this.

For example, assuming that the original text is "The actors are fantastic", after word segmentation is performed on the original text, four candidate words of "the", "actors", "are", and "fantastic" can be obtained. It can be understood that the replacement word corresponding to the position of any candidate word in the original text should be related to the arrangement order, part of speech and word meaning of all candidate words in the original text. Taking the candidate word "actors" as an example, the context information of the candidate word "actors" includes three candidate words "the", "are" and "fantastic". According to the word order of the original text, the candidate words "the", "actors", "are" and "fantastic" can form an input sequence arranged in chronological order. Using the bidirectional long and short-term memory network model, the candidate word "the" can be input forward according to the candidate word "actors", and the candidate word "are" and the candidate word "fantastic" can be input backward after the candidate word "actors" , Get multiple replacement words such as "performances", "films", "movies" and "stories" from the preset dictionary to replace the candidate word "actors" in the input sequence. At the same time, in order to ensure semantic consistency, suppose the original text belongs to a positive semantic type, and the semantic label of the original text is "positive", then the semantic label corresponding to each replacement word obtained from the preset dictionary should be " positive", so that the expanded text generated after replacing the corresponding candidate words in the original text with the replacement words also belongs to the positive semantic type.

In the same way, the above method is also applicable to any candidate words in "the", "are" and "fantastic", so I won't repeat them here.

204. According to the foregoing N replacement words and the original text, generate N first extended texts.

In the embodiments of this application, for example, if the candidate word “actors” in the original text “The actors are fantastic” is obtained, three replacement words “performances”, “films” and “movies” are obtained, then correspondingly, Generate three expanded texts "The performances are fantastic", "The films are fantastic" and "The movies are fantastic".

It can be seen that implementing the method described in Figure 2, by dividing the original text into a number of candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary and replaced with the replacement word Corresponding candidate words, generating expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, Therefore, word replacement and combination methods are greatly enriched, and a large amount of expanded text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.

Example three

Please refer to FIG. 3, which is a schematic flowchart of another text data enhancement method disclosed in an embodiment of the present application. As shown in Figure 3, the text data enhancement method may include the following steps:

301 Obtain the original text.

302. Perform word segmentation processing on the original text to obtain several candidate words.

303. For the target candidate word, based on the word order information of the original text, perform forward encoding on the context information of the target candidate word from left to right to obtain forward encoding information.

304. Perform backward coding on the context information of the target candidate word from right to left to obtain backward coding information.

In the embodiment of the present application, the method of forward encoding the context information of the target candidate word is mainly: forward numbering the candidate words included in the context information of the target candidate word from left to right; The forward number information is used to generate the forward word vector; the forward word vector is mapped to the forward word vector mapping matrix by using the pre-trained word vector parameters as the forward coding information.

Similarly, the method of backward encoding the context information of the target candidate word is: backward numbering the candidate words included in the context information of the target candidate word from right to left; according to the backward numbering information of each candidate word mentioned above , Generate the backward word vector; use the pre-trained word vector parameters to map the backward word vector into a backward word vector mapping matrix as the backward coding information.

305. Based on the forward coding information and the backward coding information, obtain N replacement words from a preset dictionary by using a two-way long and short-term memory network model.

It can be seen that by implementing the above steps 303 to 305, forward coding and backward coding are performed on the context information of the target candidate words respectively to input the forward coding information and the backward coding information into the two-way long and short-term memory network model, which can be used When the bidirectional long and short-term memory network model predicts the replacement word corresponding to the location of the target candidate word, it fully considers the context information of the target candidate word, and improves the semantic accuracy of obtaining the replacement word.

As an optional implementation manner, step 305 may specifically include:

Based on the forward coding information and backward coding information, the bidirectional long and short-term memory network model is used to predict the predicted probability of each replacement word in the preset dictionary, where the replacement word is the corresponding semantic label in the preset dictionary and the original text Words matching semantic tags; according to the predicted probability corresponding to each replacement word in the preset dictionary, sort all replacement words in the preset dictionary from largest to smallest, and obtain the top N replacement words.

It can be seen that by implementing an optional implementation, the bidirectional long and short-term memory network model can predict the predicted probability of all replacement words in the preset dictionary appearing at the location of the target candidate word, and, based on the predicted probability, the predicted probability of all replacement words Filtering out the N replacement words ranked in the top N positions can further improve the semantic accuracy of obtaining replacement words and ensure the quality of the generated extended text.

In the embodiment of this application, it is assumed that i+1 (i is a positive integer) words obtained after word segmentation processing of a text can form an input sequence (x ₀ , x ₁ , x ₂ ,..., x _i ) , The input sequence can be input into the bidirectional long and short-term memory network model. In the bidirectional long-term short-term memory network model, for any word x _t (t∈[0,i]) in the input sequence, after forward coding the context information of the word x _t from left to right, based on the forward coding information And the word x _t , the forward calculation result s _t can be obtained by the formula s _t =f(Ux _t +Ws _t-1 ); after the context information of the word x _t is backward-encoded from right to left, based on the backward For backward coding information and candidate words x _t , the backward calculation result s _t 'can be obtained using the formula s _t '=f(U'x _t +W's' _t+1 ); finally, the parameters s _t and s _t ' In the formula y _t =g(Vs _t +V's _t '), the predicted probability of the word x _t can be obtained, where U, W, U', W', V and V'are all two-way long short-term memory network models parameter.

Therefore, after performing word segmentation processing on the original text and obtaining several candidate words, the above several candidate words can be combined into an input candidate word sequence. By replacing specific candidate words in the input candidate word sequence with the replacement words in the preset dictionary, and then inputting the replaced input candidate word sequence into the above two-way long and short-term memory network model, the predicted probability of the replacement word can be obtained, and then According to the preset probability corresponding to each replacement word in the preset dictionary, N replacement words are selected from all the replacement words in the preset dictionary.

306. Based on the position information of the target candidate word in the original text, replace the target candidate word in the original text with each of the aforementioned N replacement words to generate N first extended texts.

It can be seen that implementing the method described in Figure 3, by dividing the original text into a number of candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary and replaced with the replacement word Corresponding candidate words to generate expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, based on the predicted probability of each replacement word, the The N replacement words ranked in the top N positions are selected from the replacement words, which can also guarantee the quality of the generated expanded text; in addition, since each candidate word can be replaced by multiple new words, it greatly enriches the word replacement and combination In this way, a large amount of extended text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.

Example four

Please refer to FIG. 4, which is a schematic flowchart of yet another text data enhancement method disclosed in an embodiment of the present application. As shown in Figure 4, the text data enhancement method may include the following steps:

Step 401 to step 406; for the description of step 401 to step 406, please refer to the detailed description of step 301 to step 306 in the third embodiment, which is not repeated in this embodiment of the application.

407. Identify the first language corresponding to the N first extended texts.

408. Translate the aforementioned N first extended texts from the first language to another language different from the first language to obtain N first translations.

409. Translate the aforementioned N first translations from other languages into the first language to obtain N second extended texts.

In the embodiment of this application, for steps 407 to 409, for example, for the text "Send me the information", it can be recognized that the language of the text is Chinese; the text is translated from Chinese to English, and the translation is obtained "Send me the imformation"; and then translate the translation from English back to Chinese, and you can get the new expanded text "Send information to me". It can be seen that by implementing the above steps 407 to 409, by using the translation tool to enhance the text data of the first extended text to obtain the second extended text, it can be ensured that the second extended text is semantically consistent with the first extended extended text. It can also broaden and expand the generation of text based on multiple language types.

410. Generate random noise.

411. For the target expanded text, the random noise is trained through the generator and the discriminator, until the discriminator cannot distinguish the sentence sample obtained after training the random noise from the target expanded text.

In the embodiment of the present application, the target extended text is any second extended text among the above N second extended texts.

412. Use the sentence sample as the third extended text.

It can be seen that the above steps 410 to 412 are implemented, and the data distribution of the second extended text is simulated by using the Generative Adversarial Networks (GAN) based on the long and short-term memory network model and the convolutional neural network model to generate close to the second extended text The third expanded text of the data distribution of, can not be limited to the limitations of human thinking, and on the basis of the existing expanded text, it can further expand a rich variety of new texts.

As an optional implementation manner, step 411 may specifically include:

For the target expanded text, random noise is input to the generator to generate sentence samples obtained after training random noise; the sentence samples and the target expanded text are input into the discriminator, so that the discriminator performs convolution operations on the sentence samples and the target expanded text. Pooling operation, extract the sentence sample feature information of the sentence sample and the real text feature information of the target extended text, and combine the sentence sample feature information and the real text feature information to determine whether the sentence sample can be distinguished from the target extended text; obtain the discriminator The output discriminant result; if the discriminant result indicates that the discriminator can distinguish the sentence sample and the target expanded text, obtain the loss function of the discriminator, and input the loss function into the generator to generate a new sentence sample, execute the sentence sample and the target expanded text input The steps of the discriminator; otherwise, it is determined that the discriminator cannot distinguish the sentence sample from the target extended text.

Among them, the generator is a long and short-term memory network model used to simulate the real data distribution of the target expanded text; the discriminator is a convolutional neural network model. Taking the target extended text as an example, the target extended text input to the discriminator can be expressed as a matrix X∈R ^k×T , where T is the length of the target extended text, and each column of the matrix X is represented by the word vector of the word in the target extended text Composition, k is the dimension of the word vector. Optionally, the convolution kernel of the discriminator is 1D convolution, and the width h of the convolution kernel matches the word vector width of the words in the target expanded text. The discriminator uses the convolution kernel to perform convolution operations on the continuous words in the target extended text in the convolution layer, and then connects to a maximum pooling layer for extracting important features of the text to obtain the true text feature information of the target extended text .

It can be seen that by implementing an optional implementation method, by continuously training the generator and the discriminator, the data distribution of the sentence sample is close to the data distribution of the target expanded text, and the optimized sentence sample is output as the third expanded text. It can also improve the semantic accuracy of the expanded text.

It can be seen that implementing the method described in Figure 4, by dividing the original text into several candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary, and the replacement word can be used to replace it. Corresponding candidate words, generating expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, Therefore, word replacement and combination methods are greatly enriched, and a large number of extended texts can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy; in addition, text data enhancement is performed on the first extended text by using translation tools. Obtaining the second extended text can not only guarantee the semantic consistency between the second extended text and the first extended extended text, but also broaden the generation of extended text based on multiple language types; by using a network model based on long and short-term memory The generation confrontation network with the convolutional neural network model simulates the data distribution of the second extended text, and generates the third extended text close to the data distribution of the second extended text. It can be not limited to the limitation of human thinking, based on the existing extended text. , And further expand a variety of new texts.

Example five

Please refer to FIG. 5. FIG. 5 is a schematic structural diagram of a text data enhancement device disclosed in an embodiment of the present invention. As shown in FIG. 5, the text data enhancement device may include: a text acquisition unit 501, a word segmentation unit 502, a replacement word acquisition unit 503, and a text generation unit 504. The text acquisition unit 501 is configured to acquire original text; and the word segmentation unit 502 , Used to perform word segmentation processing on the original text to obtain a number of candidate words; the replacement word acquisition unit 503, used to obtain the target candidate word from the preset dictionary based on the context information of the target candidate word using a bidirectional long and short-term memory network model N replacement words; among them, the target candidate word is any one of the above-mentioned candidate words, the semantic label corresponding to each of the above-mentioned N replacement words matches the semantic label corresponding to the original text, and N is a positive integer ; The text generating unit 504 is configured to generate N first extended texts based on the N replacement words and the original text.

It can be seen that implementing the device described in Figure 5, by dividing the original text into a number of candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary and replaced with the replacement word Corresponding candidate words, generating expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, Therefore, word replacement and combination methods are greatly enriched, and a large amount of expanded text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.

Example Six

Please refer to FIG. 6, which is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present invention. The text data enhancement device shown in FIG. 6 is optimized by the text data enhancement device shown in FIG. 5. Compared with the text data enhancement device shown in FIG. 5, in the text data enhancement device shown in FIG. 6:

The replacement word acquisition unit 503 includes: a forward coding subunit 5031 for forward coding the context information of the target candidate word from left to right based on the word order information of the original text for the target candidate word to obtain forward coding information ; Backward coding subunit 5032, used to backward-encode the context information of the target candidate word from right to left, to obtain backward coding information; Replacement word acquisition subunit 5033, used to obtain backward coding information based on forward coding information Information, using the two-way long-term short-term memory network model to obtain N replacement words from the preset dictionary.

As an optional implementation manner, the replacement word acquisition sub-unit 5033 includes: a prediction unit 50331, configured to predict each item in the preset dictionary based on the forward coding information and the backward coding information using a two-way long and short-term memory network model The predicted probability of the replacement word, where the replacement word is a word whose corresponding semantic label in the preset dictionary matches the semantic label corresponding to the original text; the acquiring unit 50332 is configured to predict the probability corresponding to each replacement word in the preset dictionary , Sort all replacement words in the preset dictionary from largest to smallest, and obtain the top N replacement words. The text generating unit 504 is specifically configured to replace the target candidate word in the original text with each of the above-mentioned N replacement words based on the position information of the target candidate word in the original text to generate N first expanded texts.

It can be seen that implementing the device described in Figure 6, by dividing the original text into several candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary, and replaced with the replacement word Corresponding candidate words to generate expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, based on the predicted probability of each replacement word, the The N replacement words ranked in the top N positions are selected from the replacement words, which can also guarantee the quality of the generated expanded text; in addition, since each candidate word can be replaced by multiple new words, it greatly enriches the word replacement and combination In this way, a large amount of extended text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.

Example Seven

Please refer to FIG. 7. FIG. 7 is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present invention. The text data enhancement device shown in FIG. 7 is optimized by the text data enhancement device shown in FIG. 6. Compared with the text data enhancement device shown in FIG. 6, the text data enhancement device shown in FIG. 7 may further include: a recognition unit 505, a first translation unit 506, a second translation unit 507, a noise generation unit 508, and a training unit 509 , Wherein the recognition unit 505 is used to recognize the first language corresponding to the N first extended texts; the first translation unit 506 is used to translate the N first extended texts from the first language to be different from the first language In other languages, obtain N first translations; the second translation unit 507 is used to translate the above N first translations from other languages into the first language to obtain N second extended texts; the noise generation unit 508 uses To generate random noise; the training unit 509 is used to expand the text for the target, and train the random noise through the generator and the discriminator until the discriminator cannot distinguish the sentence samples obtained after training the random noise from the target expanded text; among them, the target The extended text is any one of the above N second extended texts, the generator is a long-term short-term memory network model for simulating the real data distribution of the target extended text; the discriminator is a convolutional neural network model; and, Take the sentence sample as the third expanded text.

As an optional implementation, the training unit 509 includes: a sample generation subunit 5091, which is used to input random noise into the generator for the target extended text, and generate sentence samples obtained after training the random noise; and a discrimination subunit 5092 , Used to input sentence samples and target expanded text into the discriminator, so that the discriminator performs convolution and pooling operations on the sentence samples and target expanded text, and extracts the sentence sample feature information of the sentence sample and the real text of the target expanded text Feature information, and, combined with sentence sample feature information and real text feature information, determine whether the sentence sample can be distinguished from the target extended text; the acquisition subunit 5093 is used to obtain the discrimination result output by the discriminator; the training subunit 5094 is used in When the discrimination result indicates that the discriminator can distinguish between the sentence sample and the target expanded text, the loss function of the discriminator is obtained, and the loss function is input into the generator to generate a new sentence sample to trigger the discriminating subunit 5092 to execute the sentence sample and the target expanded text Enter the steps of the discriminator; otherwise, it is determined that the discriminator cannot distinguish the sentence sample from the target extended text, and the sentence sample is used as the third extended text.

In the embodiment of the present invention, taking the target extended text as an example, the target extended text input to the discriminator can be expressed as a matrix X∈R ^k×T , where T is the length of the target extended text, and each column of the matrix X is extended by the target The word vector composition of words in the text, and k is the dimension of the word vector. Optionally, the convolution kernel of the discriminator is 1D convolution, and the width h of the convolution kernel matches the word vector width of the words in the target expanded text. The discriminator uses the convolution kernel to perform convolution operations on the continuous words in the target extended text in the convolution layer, and then connects to a maximum pooling layer for extracting important features of the text to obtain the true text feature information of the target extended text .

It can be seen that by implementing the device described in Figure 7, by dividing the original text into several candidate words, it is possible to obtain replacement words from a preset dictionary based on the context information of any candidate word and the semantic type of the original text, and use the replacement words to replace Corresponding candidate words, generating expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, Therefore, word replacement and combination methods are greatly enriched, and a large number of extended texts can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy; in addition, text data enhancement is performed on the first extended text by using translation tools. Obtaining the second extended text can not only ensure the semantic consistency between the second extended text and the first extended extended text, but also broaden the generation of extended text based on multiple language types; further, by using The generation of the memory network model and the convolutional neural network model against the network simulates the data distribution of the second extended text, and generates the third extended text close to the data distribution of the second extended text. It can be not limited to the limitation of human thinking, and in the existing extended text On the basis of this, a rich variety of new texts are further expanded.

This application also provides an electronic device, which includes:

processor;

A memory, where computer readable instructions are stored, and when the computer readable instructions are executed by the processor, the method for enhancing text data as shown above is realized. The electronic device may be the apparatus 100 shown in FIG. 1.

In an exemplary embodiment, the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method for enhancing text data as shown above is realized.

It should be understood that the present application is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be performed without departing from its scope. The scope of the application is only limited by the appended claims.

Claims

A text data enhancement method, including:

Get the original text;

Perform word segmentation processing on the original text to obtain several candidate words;

For the target candidate word, based on the context information of the target candidate word, use the bidirectional long-term short-term memory network model to obtain N replacement words from the preset dictionary; wherein, the target candidate word is any one of the candidate words Words, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer;

According to the N replacement words and the original text, N first extended texts are generated.
The method according to claim 1, wherein the target candidate word, based on the context information of the target candidate word, using a two-way long short-term memory network model to obtain N replacement words from a preset dictionary comprises:

For the target candidate word, based on the word order information of the original text, forward encoding the context information of the target candidate word from left to right to obtain forward encoding information;

Perform backward coding on the context information of the target candidate word from right to left to obtain backward coding information;

Based on the forward coding information and the backward coding information, a two-way long short-term memory network model is used to obtain N replacement words from a preset dictionary.
The method according to claim 2, wherein the obtaining N replacement words from a preset dictionary based on the forward coding information and the backward coding information using a two-way long and short-term memory network model comprises:

Based on the forward coding information and the backward coding information, a bidirectional long and short-term memory network model is used to predict the predicted probability of each replacement word in the preset dictionary, where the replacement word is the corresponding semantic in the preset dictionary Words whose tags match the semantic tags corresponding to the original text;

According to the predicted probability corresponding to each replacement word in the preset dictionary, all replacement words in the preset dictionary are sorted from largest to smallest, and the top N replacement words are obtained.
The method according to claim 1, wherein after said generating N first extended texts based on said N replacement words and said original text, said method further comprises:

Identifying the first language corresponding to the N first extended texts;

Translating the N first extended texts from the first language into another language different from the first language to obtain N first translations;

Translating the N first translations from the other languages into the first language to obtain N second extended texts.
The method according to claim 4, wherein, after said obtaining the N second extended texts, the method further comprises:

Generate random noise;

For the target expanded text, the random noise is trained through a generator and a discriminator until the discriminator cannot distinguish between the sentence sample obtained after training the random noise and the target expanded text; wherein the target expansion The text is any second expanded text among the N second expanded texts, the generator is a long and short-term memory network model used to simulate the real data distribution of the target expanded text; the discriminator is a convolution Neural network model;

Use the sentence sample as the third extended text.
The method according to claim 5, wherein the target expanded text is trained on the random noise by a generator and a discriminator until the discriminator cannot distinguish the sentence samples obtained after training the random noise And the target expanded text, including:

For the target expanded text, input the random noise into a generator to generate sentence samples obtained after training the random noise;

The sentence sample and the target expanded text are input to a discriminator, so that the discriminator performs convolution and pooling operations on the sentence sample and the target expanded text, and extracts sentence samples of the sentence sample Feature information and the real text feature information of the target expanded text, and, combining the sentence sample feature information and the real text feature information, determine whether the sentence sample can be distinguished from the target expanded text;

Obtaining the discrimination result output by the discriminator;

If the discrimination result indicates that the discriminator can distinguish the sentence sample from the target expanded text, obtain the loss function of the discriminator, and input the loss function into the generator to generate a new sentence sample, The step of inputting the sentence sample and the target expanded text into the discriminator is performed; otherwise, it is determined that the discriminator cannot distinguish the sentence sample and the target expanded text.
The method according to any one of claims 1 to 6, wherein the generating N first extended texts according to the N replacement words and the original text comprises:

Based on the location information of the target candidate word in the original text, each of the N replacement words is used to replace the target candidate word in the original text to generate N first expanded texts .
A text data enhancement device, including:

The text obtaining unit is used to obtain the original text;

The word segmentation unit is used to perform word segmentation processing on the original text to obtain several candidate words;

The replacement word acquisition unit is used to obtain N replacement words from a preset dictionary based on the context information of the target candidate word based on the target candidate word; wherein, the target candidate word is the For any candidate word among several candidate words, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer;

The text generating unit is configured to generate N first extended texts according to the N replacement words and the original text.
8. The device according to claim 8, wherein the replacement word acquisition unit comprises:

A forward coding unit for forward coding the context information of the target candidate word from left to right based on the word order information of the original text for the target candidate word, to obtain forward coding information;

The backward coding subunit is used for backward coding the context information of the target candidate word from right to left to obtain backward coding information;

The replacement word obtaining subunit is configured to obtain N replacement words from a preset dictionary based on the forward coding information and the backward coding information by using a two-way long and short-term memory network model.
9. The device of claim 9, wherein the replacement word acquisition subunit comprises:

A prediction unit for predicting the prediction probability of each replacement word in a preset dictionary based on the forward coding information and the backward coding information using a two-way long and short-term memory network model, where the replacement word is the preset The corresponding semantic label in the dictionary matches the semantic label corresponding to the original text;

The acquiring unit is configured to sort all the replacement words in the preset dictionary from largest to smallest according to the predicted probability corresponding to each replacement word in the preset dictionary, and acquire the N replacements ranked in the top N word.
The device of claim 8, wherein the device further comprises:

A recognition unit, configured to recognize the first language corresponding to the N first extended texts;

A first translation unit, configured to translate the N first extended texts from the first language to another language different from the first language to obtain N first translations;

The second translation unit is configured to translate the N first translations from the other languages into the first language to obtain N second extended texts.
The device of claim 11, wherein the device further comprises:

Noise generating unit, used to generate random noise;

The training unit is used to train the random noise for the target extended text through the generating unit and the discriminating unit until the discriminating unit cannot distinguish the sentence sample obtained after training the random noise from the target expanded text; wherein The target extended text is any second extended text among the N second extended texts, and the generating unit is a long-short-term memory network model for simulating the real data distribution of the target extended text; The discrimination unit is a convolutional neural network model; and the sentence sample is used as the third extended text.
The device of claim 12, wherein the training unit comprises:

The sample generation subunit is used to expand the text for the target, input the random noise to the generating unit, and generate sentence samples obtained after training the random noise;

The discrimination subunit is used to input the sentence sample and the target extended text into the discrimination unit, so that the discrimination unit performs convolution and pooling operations on the sentence sample and the target extended text, and extracts all The sentence sample characteristic information of the sentence sample and the real text characteristic information of the target expanded text, and, in combination with the sentence sample characteristic information and the real text characteristic information, it is determined whether the sentence sample can be distinguished from the target expansion text;

An obtaining sub-unit for obtaining the discrimination result output by the discrimination unit;

The training subunit is used to obtain the loss function of the discrimination unit when the discrimination result indicates that the discrimination unit can distinguish the sentence sample from the target extended text, and input the loss function into the generating unit , Generate a new sentence sample, and execute the step of inputting the sentence sample and the target expanded text into the discrimination unit; otherwise, it is determined that the discrimination unit cannot distinguish the sentence sample and the target expanded text.
The device according to any one of claims 8-13, wherein the text generating unit is configured to use each of the N replacement words based on the position information of the target candidate word in the original text The replacement word replaces the target candidate word in the original text to generate N first expanded texts.
An electronic device including:

processor;

And a memory on which computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, the processor is configured to implement the following steps:

Get the original text;

Perform word segmentation processing on the original text to obtain several candidate words;

For the target candidate word, based on the context information of the target candidate word, use the bidirectional long-term short-term memory network model to obtain N replacement words from the preset dictionary; wherein, the target candidate word is any one of the candidate words Words, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer;

According to the N replacement words and the original text, N first extended texts are generated.
The electronic device according to claim 15, wherein the target candidate word is based on the context information of the target candidate word, and the two-way long short-term memory network model is used to obtain N replacement words from a preset dictionary, and the processing The device is configured to implement the following steps:

For the target candidate word, based on the word order information of the original text, forward encoding the context information of the target candidate word from left to right to obtain forward encoding information;

Perform backward coding on the context information of the target candidate word from right to left to obtain backward coding information;

Based on the forward coding information and the backward coding information, a two-way long short-term memory network model is used to obtain N replacement words from a preset dictionary.
The electronic device according to claim 16, wherein, based on the forward coding information and the backward coding information, using a two-way long short-term memory network model to obtain N replacement words from a preset dictionary, the processing The device is configured to implement the following steps:

Based on the forward coding information and the backward coding information, a bidirectional long and short-term memory network model is used to predict the predicted probability of each replacement word in the preset dictionary, where the replacement word is the corresponding semantic in the preset dictionary Words whose tags match the semantic tags corresponding to the original text;

According to the predicted probability corresponding to each replacement word in the preset dictionary, all replacement words in the preset dictionary are sorted from largest to smallest, and the top N replacement words are obtained.
The electronic device according to claim 15, wherein, after the N first extended texts are generated according to the N replacement words and the original text, the processor is further configured to implement the following steps:

Identifying the first language corresponding to the N first extended texts;

Translating the N first extended texts from the first language into another language different from the first language to obtain N first translations;

Translating the N first translations from the other languages into the first language to obtain N second extended texts.
The electronic device according to claim 18, wherein, after obtaining the N second extended texts, the processor is further configured to implement the following steps:

Generate random noise;

For the target expanded text, the random noise is trained through a generator and a discriminator until the discriminator cannot distinguish between the sentence sample obtained after training the random noise and the target expanded text; wherein the target expansion The text is any second expanded text among the N second expanded texts, the generator is a long and short-term memory network model used to simulate the real data distribution of the target expanded text; the discriminator is a convolution Neural network model;

Use the sentence sample as the third extended text.
The electronic device according to claim 19, wherein the target expanded text is trained on the random noise by a generator and a discriminator until the discriminator cannot distinguish sentences obtained after training on the random noise The sample and the target expanded text, the processor is configured to implement the following steps:

For the target expanded text, input the random noise into a generator to generate sentence samples obtained after training the random noise;

The sentence sample and the target expanded text are input to a discriminator, so that the discriminator performs convolution and pooling operations on the sentence sample and the target expanded text, and extracts sentence samples of the sentence sample Feature information and the real text feature information of the target expanded text, and, combining the sentence sample feature information and the real text feature information, determine whether the sentence sample can be distinguished from the target expanded text;

Obtaining the discrimination result output by the discriminator;

If the discrimination result indicates that the discriminator can distinguish the sentence sample from the target expanded text, obtain the loss function of the discriminator, and input the loss function into the generator to generate a new sentence sample, The step of inputting the sentence sample and the target expanded text into the discriminator is performed; otherwise, it is determined that the discriminator cannot distinguish the sentence sample and the target expanded text.
The electronic device according to any one of claims 15 to 20, wherein said generating N first extended texts according to said N replacement words and said original text, and said processor is configured to implement the following steps:

Based on the location information of the target candidate word in the original text, each of the N replacement words is used to replace the target candidate word in the original text to generate N first expanded texts .
A computer non-volatile readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the processor is configured to implement the following steps:

Get the original text;

Perform word segmentation processing on the original text to obtain several candidate words;

For the target candidate word, based on the context information of the target candidate word, use the bidirectional long-term short-term memory network model to obtain N replacement words from the preset dictionary; wherein, the target candidate word is any one of the candidate words Words, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer;

According to the N replacement words and the original text, N first extended texts are generated.
The computer non-volatile readable storage medium according to claim 22, wherein the target candidate word is based on the context information of the target candidate word, and N Alternative words, the processor is configured to implement the following steps:

For the target candidate word, based on the word order information of the original text, forward encoding the context information of the target candidate word from left to right to obtain forward encoding information;

Perform backward coding on the context information of the target candidate word from right to left to obtain backward coding information;

Based on the forward coding information and the backward coding information, a two-way long short-term memory network model is used to obtain N replacement words from a preset dictionary.
The computer non-volatile readable storage medium according to claim 23, wherein, based on the forward coding information and the backward coding information, N is obtained from a preset dictionary using a two-way long and short-term memory network model. Alternative words, the processor is configured to implement the following steps:

Based on the forward coding information and the backward coding information, a bidirectional long and short-term memory network model is used to predict the predicted probability of each replacement word in the preset dictionary, where the replacement word is the corresponding semantic in the preset dictionary Words whose tags match the semantic tags corresponding to the original text;

According to the predicted probability corresponding to each replacement word in the preset dictionary, all replacement words in the preset dictionary are sorted from largest to smallest, and the top N replacement words are obtained.
The computer non-volatile readable storage medium according to claim 22, wherein, after the N first extended texts are generated based on the N replacement words and the original text, the processor is further configured To achieve the following steps:

Identifying the first language corresponding to the N first extended texts;

Translating the N first extended texts from the first language into another language different from the first language to obtain N first translations;

Translating the N first translations from the other languages into the first language to obtain N second extended texts.
The computer non-volatile readable storage medium according to claim 25, wherein, after said obtaining the N second extended texts, the processor is further configured to implement the following steps:

Generate random noise;

For the target expanded text, the random noise is trained through a generator and a discriminator until the discriminator cannot distinguish between the sentence sample obtained after training the random noise and the target expanded text; wherein the target expansion The text is any second expanded text among the N second expanded texts, the generator is a long and short-term memory network model used to simulate the real data distribution of the target expanded text; the discriminator is a convolution Neural network model;

Use the sentence sample as the third extended text.
The computer non-volatile readable storage medium according to claim 26, wherein the target expanded text is trained on the random noise by a generator and a discriminator until the discriminator cannot distinguish the trained The sentence samples obtained after the random noise and the target expanded text are described, and the processor is configured to implement the following steps:

For the target expanded text, input the random noise into a generator to generate sentence samples obtained after training the random noise;

The sentence sample and the target expanded text are input to a discriminator, so that the discriminator performs convolution and pooling operations on the sentence sample and the target expanded text, and extracts sentence samples of the sentence sample Feature information and the real text feature information of the target expanded text, and, combining the sentence sample feature information and the real text feature information, determine whether the sentence sample can be distinguished from the target expanded text;

Obtaining the discrimination result output by the discriminator;

If the discrimination result indicates that the discriminator can distinguish the sentence sample from the target expanded text, obtain the loss function of the discriminator, and input the loss function into the generator to generate a new sentence sample, The step of inputting the sentence sample and the target expanded text into the discriminator is performed; otherwise, it is determined that the discriminator cannot distinguish the sentence sample and the target expanded text.
The computer non-volatile readable storage medium according to any one of claims 22 to 27, wherein said generating N first extended texts according to said N replacement words and said original text, said processing The device is configured to implement the following steps:

Based on the location information of the target candidate word in the original text, each of the N replacement words is used to replace the target candidate word in the original text to generate N first expanded texts .