WO2020220636A1 - Text data enhancement method and apparatus, electronic device, and non-volatile computer-readable storage medium - Google Patents

Text data enhancement method and apparatus, electronic device, and non-volatile computer-readable storage medium Download PDF

Info

Publication number
WO2020220636A1
WO2020220636A1 PCT/CN2019/117663 CN2019117663W WO2020220636A1 WO 2020220636 A1 WO2020220636 A1 WO 2020220636A1 CN 2019117663 W CN2019117663 W CN 2019117663W WO 2020220636 A1 WO2020220636 A1 WO 2020220636A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
target
word
replacement
words
Prior art date
Application number
PCT/CN2019/117663
Other languages
French (fr)
Chinese (zh)
Inventor
于凤英
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020220636A1 publication Critical patent/WO2020220636A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of machine learning technology, in particular to a text data enhancement method and device, electronic equipment, and computer non-volatile readable storage media.
  • data augmentation technology is an important means to expand the training set. It is often used to generate more new data to train the model, so that the model is more accurate and has more generalization capabilities.
  • the core point of data enhancement is to use new data to replace the original data while ensuring that the new data and the original data belong to the same category. For data enhancement technology applied to images, this is very easy to achieve. For example, if you obtain a new image by horizontally flipping the original image, randomly cropping, or adjusting the RGB channel, the content of the new image still belongs to the original image.
  • the inventor of this application realized that for the data enhancement technology applied to text, due to the context of the text, if the original text is blindly reversed, intercepted, or replaced, the semantics of the original text will be changed.
  • the semantic accuracy of text data enhancement is not high.
  • this application provides a text data enhancement method and device, and electronic equipment.
  • a text data enhancement method includes: obtaining original text; performing word segmentation processing on the original text to obtain several candidate words; for target candidate words, based on the context information of the target candidate words, using bidirectional long and short term
  • the memory network model obtains N replacement words from a preset dictionary; wherein the target candidate word is any one of the several candidate words, and the semantic label corresponding to each replacement word in the N replacement words is The semantic tags corresponding to the original text match, and the N is a positive integer; and N first extended texts are generated according to the N replacement words and the original text.
  • a text data enhancement device includes: the acquisition unit is used to obtain the original text; the word segmentation unit is used to perform word segmentation processing on the original text to obtain several candidate words; and the replacement word acquisition unit is used to For the target candidate word, based on the context information of the target candidate word, use the bidirectional long-term short-term memory network model to obtain N replacement words from the preset dictionary; wherein, the target candidate word is any one of the candidate words Word, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer; the text generation unit is configured to match the semantic label corresponding to the N replacement words And the original text to generate N first extended texts.
  • an electronic device includes a processor and a memory, and computer-readable instructions are stored on the memory, and the computer-readable instructions implement the text data enhancement method as described above when executed by the processor.
  • a computer nonvolatile readable storage medium has a computer program stored thereon, and the computer program implements the text data enhancement method as described above when the computer program is executed by a processor.
  • the replacement word can be obtained from the preset dictionary, and the replacement word can be used to replace the corresponding candidate word .
  • Generating extended text can ensure that the semantic type of the extended text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, it is greatly enriched Word replacement and combination methods can generate a large amount of expanded text, thereby improving the efficiency of text data enhancement while ensuring accuracy.
  • Figure 1 is a schematic structural diagram of a device disclosed in an embodiment of the present application.
  • Figure 2 is a flowchart of a text data enhancement method disclosed in an embodiment of the present application.
  • FIG. 3 is a flowchart of another text data enhancement method disclosed in an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a text data enhancement device disclosed in an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present application.
  • the implementation environment of this application can be electronic devices, such as smart phones, tablet computers, and desktop computers.
  • Fig. 1 is a schematic structural diagram of a device disclosed in an embodiment of the present application.
  • the apparatus 100 may be the aforementioned electronic device.
  • the device 100 may include one or more of the following components: a processing component 102, a memory 104, a power supply component 106, a multimedia component 108, an audio component 110, a sensor component 114, and a communication component 116.
  • the processing component 102 generally controls the overall operations of the device 100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 102 may include one or more processors 118 to execute instructions to complete all or part of the steps of the following method.
  • the processing component 102 may include one or more modules to facilitate the interaction between the processing component 102 and other components.
  • the processing component 102 may include a multimedia module to facilitate the interaction between the multimedia component 108 and the processing component 102.
  • the memory 104 is configured to store various types of data to support operations in the device 100. Examples of these data include instructions for any application or method operating on the device 100.
  • the memory 104 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (Static Random Access Memory, SRAM for short), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read Only Memory) Erasable Programmable Read-Only Memory (EEPROM for short), Erasable Programmable Read-Only Memory (EPROM for short), Programmable Red-Only Memory (PROM for short), Read-only memory ( Read-Only Memory, ROM for short), magnetic storage, flash memory, magnetic disk or optical disk.
  • the memory 104 also stores one or more modules, and the one or more modules are configured to be executed by the one or more processors 118 to complete all or part of the steps in the method shown below.
  • the power supply component 106 provides power to various components of the device 100.
  • the power supply component 106 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 100.
  • the multimedia component 108 includes a screen that provides an output interface between the device 100 and the user.
  • the screen may include a liquid crystal display (Liquid Crystal Display, LCD for short) and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor can not only sense the boundary of the touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
  • the screen may also include an organic electroluminescence display (Organic Light Emitting Display, OLED for short).
  • the audio component 110 is configured to output and/or input audio signals.
  • the audio component 110 includes a microphone (Microphone, MIC for short).
  • the microphone is configured to receive external audio signals.
  • the received audio signal can be further stored in the memory 104 or sent via the communication component 116.
  • the audio component 110 further includes a speaker for outputting audio signals.
  • the sensor component 114 includes one or more sensors for providing the device 100 with various aspects of state evaluation.
  • the sensor component 114 can detect the open/close state of the device 100 and the relative positioning of components.
  • the sensor component 114 can also detect the position change of the device 100 or a component of the device 100 and the temperature change of the device 100.
  • the sensor component 114 may also include a magnetic sensor, a pressure sensor or a temperature sensor.
  • the communication component 116 is configured to facilitate wired or wireless communication between the apparatus 100 and other devices.
  • the device 100 can access a wireless network based on a communication standard, such as WiFi (Wireless-Fidelity, wireless fidelity).
  • the communication component 116 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel.
  • the communication component 116 further includes a Near Field Communication (NFC) module to facilitate short-range communication.
  • NFC Near Field Communication
  • the NFC module can be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth and other technologies.
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wideband
  • the apparatus 100 may be implemented by one or more Application Specific Integrated Circuits (ASIC for short), digital signal processors, digital signal processing equipment, programmable logic devices, field programmable gate arrays, The controller, microcontroller, microprocessor or other electronic components are implemented to perform the following methods.
  • ASIC Application Specific Integrated Circuits
  • digital signal processors digital signal processing equipment
  • programmable logic devices programmable logic devices
  • field programmable gate arrays The controller, microcontroller, microprocessor or other electronic components are implemented to perform the following methods.
  • FIG. 2 is a schematic flowchart of a text data enhancement method disclosed in an embodiment of the present application. As shown in Figure 2, the text data enhancement method may include the following steps:
  • For the target candidate word based on the context information of the target candidate word, use a two-way long short-term memory network model to obtain N replacement words from a preset dictionary.
  • the target candidate word is any one of the above-mentioned candidate words, and the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text;
  • N is a positive integer, and The value of N can be configured by oneself, and there is no specific limitation on this.
  • the replacement word corresponding to the position of any candidate word in the original text should be related to the arrangement order, part of speech and word meaning of all candidate words in the original text.
  • the context information of the candidate word "actors” includes three candidate words “the”, “are” and “fantastic”. According to the word order of the original text, the candidate words "the”, “actors”, “are” and “fantastic” can form an input sequence arranged in chronological order.
  • the candidate word "the” can be input forward according to the candidate word “actors”, and the candidate word “are” and the candidate word “fantastic” can be input backward after the candidate word “actors” , Get multiple replacement words such as "performances”, “films”, “movies” and “stories” from the preset dictionary to replace the candidate word "actors” in the input sequence.
  • N replacement words and the original text generate N first extended texts.
  • FIG. 3 is a schematic flowchart of another text data enhancement method disclosed in an embodiment of the present application.
  • the text data enhancement method may include the following steps:
  • For the target candidate word based on the word order information of the original text, perform forward encoding on the context information of the target candidate word from left to right to obtain forward encoding information.
  • the method of forward encoding the context information of the target candidate word is mainly: forward numbering the candidate words included in the context information of the target candidate word from left to right; The forward number information is used to generate the forward word vector; the forward word vector is mapped to the forward word vector mapping matrix by using the pre-trained word vector parameters as the forward coding information.
  • the method of backward encoding the context information of the target candidate word is: backward numbering the candidate words included in the context information of the target candidate word from right to left; according to the backward numbering information of each candidate word mentioned above , Generate the backward word vector; use the pre-trained word vector parameters to map the backward word vector into a backward word vector mapping matrix as the backward coding information.
  • N replacement words Based on the forward coding information and the backward coding information, obtain N replacement words from a preset dictionary by using a two-way long and short-term memory network model.
  • forward coding and backward coding are performed on the context information of the target candidate words respectively to input the forward coding information and the backward coding information into the two-way long and short-term memory network model, which can be used
  • the bidirectional long and short-term memory network model predicts the replacement word corresponding to the location of the target candidate word, it fully considers the context information of the target candidate word, and improves the semantic accuracy of obtaining the replacement word.
  • step 305 may specifically include:
  • the bidirectional long and short-term memory network model is used to predict the predicted probability of each replacement word in the preset dictionary, where the replacement word is the corresponding semantic label in the preset dictionary and the original text Words matching semantic tags; according to the predicted probability corresponding to each replacement word in the preset dictionary, sort all replacement words in the preset dictionary from largest to smallest, and obtain the top N replacement words.
  • the bidirectional long and short-term memory network model can predict the predicted probability of all replacement words in the preset dictionary appearing at the location of the target candidate word, and, based on the predicted probability, the predicted probability of all replacement words Filtering out the N replacement words ranked in the top N positions can further improve the semantic accuracy of obtaining replacement words and ensure the quality of the generated extended text.
  • i+1 (i is a positive integer) words obtained after word segmentation processing of a text can form an input sequence (x 0 , x 1 , x 2 ,..., x i ) ,
  • the input sequence can be input into the bidirectional long and short-term memory network model.
  • the above several candidate words can be combined into an input candidate word sequence.
  • the predicted probability of the replacement word can be obtained, and then According to the preset probability corresponding to each replacement word in the preset dictionary, N replacement words are selected from all the replacement words in the preset dictionary.
  • the replacement word can be obtained from the preset dictionary and replaced with the replacement word
  • Corresponding candidate words to generate expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, based on the predicted probability of each replacement word, the The N replacement words ranked in the top N positions are selected from the replacement words, which can also guarantee the quality of the generated expanded text; in addition, since each candidate word can be replaced by multiple new words, it greatly enriches the word replacement and combination In this way, a large amount of extended text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.
  • FIG. 4 is a schematic flowchart of yet another text data enhancement method disclosed in an embodiment of the present application.
  • the text data enhancement method may include the following steps:
  • Step 401 to step 406 for the description of step 401 to step 406, please refer to the detailed description of step 301 to step 306 in the third embodiment, which is not repeated in this embodiment of the application.
  • steps 407 to 409 for example, for the text "Send me the information", it can be recognized that the language of the text is Chinese; the text is translated from Chinese to English, and the translation is obtained "Send me the imformation”; and then translate the translation from English back to Chinese, and you can get the new expanded text "Send information to me”. It can be seen that by implementing the above steps 407 to 409, by using the translation tool to enhance the text data of the first extended text to obtain the second extended text, it can be ensured that the second extended text is semantically consistent with the first extended extended text. It can also broaden and expand the generation of text based on multiple language types.
  • the random noise is trained through the generator and the discriminator, until the discriminator cannot distinguish the sentence sample obtained after training the random noise from the target expanded text.
  • the target extended text is any second extended text among the above N second extended texts.
  • the above steps 410 to 412 are implemented, and the data distribution of the second extended text is simulated by using the Generative Adversarial Networks (GAN) based on the long and short-term memory network model and the convolutional neural network model to generate close to the second extended text
  • GAN Generative Adversarial Networks
  • the third expanded text of the data distribution of can not be limited to the limitations of human thinking, and on the basis of the existing expanded text, it can further expand a rich variety of new texts.
  • step 411 may specifically include:
  • random noise is input to the generator to generate sentence samples obtained after training random noise; the sentence samples and the target expanded text are input into the discriminator, so that the discriminator performs convolution operations on the sentence samples and the target expanded text. Pooling operation, extract the sentence sample feature information of the sentence sample and the real text feature information of the target extended text, and combine the sentence sample feature information and the real text feature information to determine whether the sentence sample can be distinguished from the target extended text; obtain the discriminator The output discriminant result; if the discriminant result indicates that the discriminator can distinguish the sentence sample and the target expanded text, obtain the loss function of the discriminator, and input the loss function into the generator to generate a new sentence sample, execute the sentence sample and the target expanded text input The steps of the discriminator; otherwise, it is determined that the discriminator cannot distinguish the sentence sample from the target extended text.
  • the generator is a long and short-term memory network model used to simulate the real data distribution of the target expanded text
  • the discriminator is a convolutional neural network model.
  • the target extended text input to the discriminator can be expressed as a matrix X ⁇ R k ⁇ T , where T is the length of the target extended text, and each column of the matrix X is represented by the word vector of the word in the target extended text Composition, k is the dimension of the word vector.
  • the convolution kernel of the discriminator is 1D convolution, and the width h of the convolution kernel matches the word vector width of the words in the target expanded text.
  • the discriminator uses the convolution kernel to perform convolution operations on the continuous words in the target extended text in the convolution layer, and then connects to a maximum pooling layer for extracting important features of the text to obtain the true text feature information of the target extended text .
  • the data distribution of the sentence sample is close to the data distribution of the target expanded text, and the optimized sentence sample is output as the third expanded text. It can also improve the semantic accuracy of the expanded text.
  • the replacement word can be obtained from the preset dictionary, and the replacement word can be used to replace it.
  • generating expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, Therefore, word replacement and combination methods are greatly enriched, and a large number of extended texts can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy; in addition, text data enhancement is performed on the first extended text by using translation tools.
  • Obtaining the second extended text can not only guarantee the semantic consistency between the second extended text and the first extended extended text, but also broaden the generation of extended text based on multiple language types; by using a network model based on long and short-term memory
  • the generation confrontation network with the convolutional neural network model simulates the data distribution of the second extended text, and generates the third extended text close to the data distribution of the second extended text. It can be not limited to the limitation of human thinking, based on the existing extended text. , And further expand a variety of new texts.
  • FIG. 5 is a schematic structural diagram of a text data enhancement device disclosed in an embodiment of the present invention.
  • the text data enhancement device may include: a text acquisition unit 501, a word segmentation unit 502, a replacement word acquisition unit 503, and a text generation unit 504.
  • the text acquisition unit 501 is configured to acquire original text; and the word segmentation unit 502 , Used to perform word segmentation processing on the original text to obtain a number of candidate words; the replacement word acquisition unit 503, used to obtain the target candidate word from the preset dictionary based on the context information of the target candidate word using a bidirectional long and short-term memory network model N replacement words; among them, the target candidate word is any one of the above-mentioned candidate words, the semantic label corresponding to each of the above-mentioned N replacement words matches the semantic label corresponding to the original text, and N is a positive integer ;
  • the text generating unit 504 is configured to generate N first extended texts based on the N replacement words and the original text.
  • the replacement word can be obtained from the preset dictionary and replaced with the replacement word
  • generating expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, Therefore, word replacement and combination methods are greatly enriched, and a large amount of expanded text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.
  • FIG. 6 is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present invention.
  • the text data enhancement device shown in FIG. 6 is optimized by the text data enhancement device shown in FIG. 5.
  • the text data enhancement device shown in FIG. 6 Compared with the text data enhancement device shown in FIG. 5, in the text data enhancement device shown in FIG. 6:
  • the replacement word acquisition unit 503 includes: a forward coding subunit 5031 for forward coding the context information of the target candidate word from left to right based on the word order information of the original text for the target candidate word to obtain forward coding information ; Backward coding subunit 5032, used to backward-encode the context information of the target candidate word from right to left, to obtain backward coding information; Replacement word acquisition subunit 5033, used to obtain backward coding information based on forward coding information Information, using the two-way long-term short-term memory network model to obtain N replacement words from the preset dictionary.
  • the replacement word acquisition sub-unit 5033 includes: a prediction unit 50331, configured to predict each item in the preset dictionary based on the forward coding information and the backward coding information using a two-way long and short-term memory network model The predicted probability of the replacement word, where the replacement word is a word whose corresponding semantic label in the preset dictionary matches the semantic label corresponding to the original text; the acquiring unit 50332 is configured to predict the probability corresponding to each replacement word in the preset dictionary , Sort all replacement words in the preset dictionary from largest to smallest, and obtain the top N replacement words.
  • the text generating unit 504 is specifically configured to replace the target candidate word in the original text with each of the above-mentioned N replacement words based on the position information of the target candidate word in the original text to generate N first expanded texts.
  • the replacement word can be obtained from the preset dictionary, and replaced with the replacement word
  • Corresponding candidate words to generate expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, based on the predicted probability of each replacement word, the The N replacement words ranked in the top N positions are selected from the replacement words, which can also guarantee the quality of the generated expanded text; in addition, since each candidate word can be replaced by multiple new words, it greatly enriches the word replacement and combination In this way, a large amount of extended text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.
  • FIG. 7 is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present invention.
  • the text data enhancement device shown in FIG. 7 is optimized by the text data enhancement device shown in FIG. 6. Compared with the text data enhancement device shown in FIG. 6, the text data enhancement device shown in FIG.
  • the 7 may further include: a recognition unit 505, a first translation unit 506, a second translation unit 507, a noise generation unit 508, and a training unit 509 ,
  • the recognition unit 505 is used to recognize the first language corresponding to the N first extended texts
  • the first translation unit 506 is used to translate the N first extended texts from the first language to be different from the first language In other languages, obtain N first translations
  • the second translation unit 507 is used to translate the above N first translations from other languages into the first language to obtain N second extended texts
  • the noise generation unit 508 uses To generate random noise
  • the training unit 509 is used to expand the text for the target, and train the random noise through the generator and the discriminator until the discriminator cannot distinguish the sentence samples obtained after training the random noise from the target expanded text; among them, the target
  • the extended text is any one of the above N second extended texts
  • the generator is a long-term short-term memory network model for simulating the real data distribution of the target extended text
  • the discriminator is a con
  • the training unit 509 includes: a sample generation subunit 5091, which is used to input random noise into the generator for the target extended text, and generate sentence samples obtained after training the random noise; and a discrimination subunit 5092 , Used to input sentence samples and target expanded text into the discriminator, so that the discriminator performs convolution and pooling operations on the sentence samples and target expanded text, and extracts the sentence sample feature information of the sentence sample and the real text of the target expanded text Feature information, and, combined with sentence sample feature information and real text feature information, determine whether the sentence sample can be distinguished from the target extended text; the acquisition subunit 5093 is used to obtain the discrimination result output by the discriminator; the training subunit 5094 is used in When the discrimination result indicates that the discriminator can distinguish between the sentence sample and the target expanded text, the loss function of the discriminator is obtained, and the loss function is input into the generator to generate a new sentence sample to trigger the discriminating subunit 5092 to execute the sentence sample and the target expanded text Enter the steps of the discriminator; otherwise, it
  • the target extended text input to the discriminator can be expressed as a matrix X ⁇ R k ⁇ T , where T is the length of the target extended text, and each column of the matrix X is extended by the target The word vector composition of words in the text, and k is the dimension of the word vector.
  • the convolution kernel of the discriminator is 1D convolution, and the width h of the convolution kernel matches the word vector width of the words in the target expanded text.
  • the discriminator uses the convolution kernel to perform convolution operations on the continuous words in the target extended text in the convolution layer, and then connects to a maximum pooling layer for extracting important features of the text to obtain the true text feature information of the target extended text .
  • Obtaining the second extended text can not only ensure the semantic consistency between the second extended text and the first extended extended text, but also broaden the generation of extended text based on multiple language types; further, by using The generation of the memory network model and the convolutional neural network model against the network simulates the data distribution of the second extended text, and generates the third extended text close to the data distribution of the second extended text. It can be not limited to the limitation of human thinking, and in the existing extended text On the basis of this, a rich variety of new texts are further expanded.
  • This application also provides an electronic device, which includes:
  • a memory where computer readable instructions are stored, and when the computer readable instructions are executed by the processor, the method for enhancing text data as shown above is realized.
  • the electronic device may be the apparatus 100 shown in FIG. 1.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method for enhancing text data as shown above is realized.

Abstract

A text data enhancement method and apparatus, and an electronic device, relating to the technical field of machine learning. The method comprises: acquiring original text (201); performing word segmentation processing on the original text to acquire several candidate words (202); for a target candidate word, on the basis of context information of the target candidate word, acquiring N replacement words from a pre-set dictionary by using a bidirectional long short-term memory network model (203), wherein the target candidate word is any one candidate word in the several candidate words, a semantic label corresponding to each replacement word in the N replacement words matches a semantic label corresponding to the original text, and N is a positive integer; and generating N pieces of first extended text according to the N replacement words and the original text (204). The method can improve the semantic accuracy of text data enhancement.

Description

文本数据增强方法及装置、电子设备、计算机非易失性可读存储介质Text data enhancement method and device, electronic equipment, computer non-volatile readable storage medium
本申请要求2019年4月28日递交、申请名称为“一种文本数据增强方法及装置、电子设备”的中国专利申请201910350209.3的优先权,在此通过引用将其全部内容合并于此。This application claims the priority of the Chinese patent application 201910350209.3 filed on April 28, 2019 with the application name "A method and device for text data enhancement, and electronic equipment", and the entire contents of which are incorporated herein by reference.
技术领域Technical field
本申请涉及机器学习技术领域,尤其涉及一种文本数据增强方法及装置、电子设备、计算机非易失性可读存储介质。This application relates to the field of machine learning technology, in particular to a text data enhancement method and device, electronic equipment, and computer non-volatile readable storage media.
背景技术Background technique
在机器学习技术领域,数据增强技术是扩充训练集的重要手段,常用于产生更多的新数据去训练模型,以使得模型更加精准,且更具泛化能力。数据增强的核心点在于:既要利用新数据去替换原有数据,又要保证新数据和原有数据属于同一个类别。对于应用于图像的数据增强技术,这一点非常容易实现,比如,通过对原图像进行水平翻转、随机剪裁或者调整RGB通道等操作后获得新图像,那么新图像所包含的内容仍属于原图像。In the field of machine learning technology, data augmentation technology is an important means to expand the training set. It is often used to generate more new data to train the model, so that the model is more accurate and has more generalization capabilities. The core point of data enhancement is to use new data to replace the original data while ensuring that the new data and the original data belong to the same category. For data enhancement technology applied to images, this is very easy to achieve. For example, if you obtain a new image by horizontally flipping the original image, randomly cropping, or adjusting the RGB channel, the content of the new image still belongs to the original image.
本申请的发明人意识到,对于应用于文本的数据增强技术,由于文本的上下文之间存在前后关联,如果盲目地对原始文本进行反转、截取或者替换等操作,将改变原始文本的语义,使得文本数据增强的语义准确性不高。The inventor of this application realized that for the data enhancement technology applied to text, due to the context of the text, if the original text is blindly reversed, intercepted, or replaced, the semantics of the original text will be changed. The semantic accuracy of text data enhancement is not high.
发明内容Summary of the invention
为了解决上述技术问题,本申请提供了一种文本数据增强方法及装置、电子设备。In order to solve the above technical problems, this application provides a text data enhancement method and device, and electronic equipment.
其中,本申请所采用的技术方案为:Among them, the technical solutions adopted in this application are:
一方面,一种文本数据增强方法,包括:获取原始文本;对所述原始文本进行分词处理,以获得若干候选词;针对目标候选词,基于所述目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词;其中,所述目标候选词为所述若干候选词中任一候选词,所述N个替换词中的每一个替换词对应的语义标签与所述原始文本对应的语义标签相匹配,所述N为正整数;根据所述N个替换词和所述原始文本,生成N个第一扩充文本。On the one hand, a text data enhancement method includes: obtaining original text; performing word segmentation processing on the original text to obtain several candidate words; for target candidate words, based on the context information of the target candidate words, using bidirectional long and short term The memory network model obtains N replacement words from a preset dictionary; wherein the target candidate word is any one of the several candidate words, and the semantic label corresponding to each replacement word in the N replacement words is The semantic tags corresponding to the original text match, and the N is a positive integer; and N first extended texts are generated according to the N replacement words and the original text.
另一方面,一种文本数据增强装置,包括:本获取单元,用于获取原始文本;分词单元,用于对所述原始文本进行分词处理,以获得若干候选词;替换词获取单元,用于针对目标候选词,基于所述目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词;其中,所述目标候选词为所述若干候选词中任一候选词,所述N个替换词中的每一个替换词对应的语义标签与所述原始文本对应的语义标签相匹配,所述N为正整数;文本生成单元,用于根据所述N个替换词和所述原始文本,生成N个第一扩充文本。On the other hand, a text data enhancement device includes: the acquisition unit is used to obtain the original text; the word segmentation unit is used to perform word segmentation processing on the original text to obtain several candidate words; and the replacement word acquisition unit is used to For the target candidate word, based on the context information of the target candidate word, use the bidirectional long-term short-term memory network model to obtain N replacement words from the preset dictionary; wherein, the target candidate word is any one of the candidate words Word, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer; the text generation unit is configured to match the semantic label corresponding to the N replacement words And the original text to generate N first extended texts.
另一方面,一种电子设备,包括处理器及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现如上所述的文本数据增强方法。On the other hand, an electronic device includes a processor and a memory, and computer-readable instructions are stored on the memory, and the computer-readable instructions implement the text data enhancement method as described above when executed by the processor.
另一方面,一种计算机非易失性可读存储介质,其上存储有计算机程序,所述计算机 程序被处理器执行时实现如上所述的文本数据增强方法。On the other hand, a computer nonvolatile readable storage medium has a computer program stored thereon, and the computer program implements the text data enhancement method as described above when the computer program is executed by a processor.
本申请实施例提供的技术方案可以包括以下有益效果:The technical solutions provided by the embodiments of the present application may include the following beneficial effects:
在上述技术方案中,通过将原始文本分为若干候选词,可以基于任一候选词的上下文信息以及原始文本的语义类型,从预设词典中获取替换词,并且利用替换词替换相应的候选词,生成扩充文本,能够保证扩充文本的语义类型和原始文本的语义类型一致,提高了文本数据增强的语义准确性;并且,由于每一候选词都可以被多个新词替换,因此大大丰富了词语替换和组合方式,能够生成大量的扩充文本,从而在保证准确性的同时,提高了文本数据增强的效率。In the above technical solution, by dividing the original text into several candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary, and the replacement word can be used to replace the corresponding candidate word , Generating extended text can ensure that the semantic type of the extended text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, it is greatly enriched Word replacement and combination methods can generate a large amount of expanded text, thereby improving the efficiency of text data enhancement while ensuring accuracy.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments that conform to the application, and are used together with the specification to explain the principle of the application.
图1是本申请实施例公开的一种装置的结构示意图;Figure 1 is a schematic structural diagram of a device disclosed in an embodiment of the present application;
图2是本申请实施例公开的一种文本数据增强方法的流程图;Figure 2 is a flowchart of a text data enhancement method disclosed in an embodiment of the present application;
图3是本申请实施例公开的另一种文本数据增强方法的流程图;FIG. 3 is a flowchart of another text data enhancement method disclosed in an embodiment of the present application;
图4是本申请实施例公开的又一种文本数据增强方法的流程图;4 is a flowchart of another method for enhancing text data disclosed in an embodiment of the present application;
图5是本申请实施例公开的一种文本数据增强装置的结构示意图;FIG. 5 is a schematic structural diagram of a text data enhancement device disclosed in an embodiment of the present application;
图6是本申请实施例公开的另一种文本数据增强装置的结构示意图;Figure 6 is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present application;
图7是本申请实施例公开的又一种文本数据增强装置的结构示意图。FIG. 7 is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present application.
具体实施方式Detailed ways
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Here, an exemplary embodiment will be described in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present application. On the contrary, they are only examples of devices and methods consistent with some aspects of the application as detailed in the appended claims.
实施例一Example one
本申请的实施环境可以是电子设备,例如智能手机、平板电脑、台式电脑。The implementation environment of this application can be electronic devices, such as smart phones, tablet computers, and desktop computers.
图1是本申请实施例公开的一种装置的结构示意图。装置100可以是上述电子设备。如图1所示,装置100可以包括以下一个或多个组件:处理组件102,存储器104,电源组件106,多媒体组件108,音频组件110,传感器组件114以及通信组件116。Fig. 1 is a schematic structural diagram of a device disclosed in an embodiment of the present application. The apparatus 100 may be the aforementioned electronic device. As shown in FIG. 1, the device 100 may include one or more of the following components: a processing component 102, a memory 104, a power supply component 106, a multimedia component 108, an audio component 110, a sensor component 114, and a communication component 116.
处理组件102通常控制装置100的整体操作,诸如与显示,电话呼叫,数据通信,相机操作以及记录操作相关联的操作等。处理组件102可以包括一个或多个处理器118来执行指令,以完成下述的方法的全部或部分步骤。此外,处理组件102可以包括一个或多个模块,用于便于处理组件102和其他组件之间的交互。例如,处理组件102可以包括多媒体模块,用于以方便多媒体组件108和处理组件102之间的交互。The processing component 102 generally controls the overall operations of the device 100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 102 may include one or more processors 118 to execute instructions to complete all or part of the steps of the following method. In addition, the processing component 102 may include one or more modules to facilitate the interaction between the processing component 102 and other components. For example, the processing component 102 may include a multimedia module to facilitate the interaction between the multimedia component 108 and the processing component 102.
存储器104被配置为存储各种类型的数据以支持在装置100的操作。这些数据的示例包括用于在装置100上操作的任何应用程序或方法的指令。存储器104可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(Static Random Access Memory,简称SRAM),电可擦除可编程只读存储器(Electrically Erasable  Programmable Read-Only Memory,简称EEPROM),可擦除可编程只读存储器(Erasable Programmable Read Only Memory,简称EPROM),可编程只读存储器(Programmable Red-Only Memory,简称PROM),只读存储器(Read-Only Memory,简称ROM),磁存储器,快闪存储器,磁盘或光盘。存储器104中还存储有一个或多个模块,用于该一个或多个模块被配置成由该一个或多个处理器118执行,以完成如下所示方法中的全部或者部分步骤。The memory 104 is configured to store various types of data to support operations in the device 100. Examples of these data include instructions for any application or method operating on the device 100. The memory 104 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (Static Random Access Memory, SRAM for short), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read Only Memory) Erasable Programmable Read-Only Memory (EEPROM for short), Erasable Programmable Read-Only Memory (EPROM for short), Programmable Red-Only Memory (PROM for short), Read-only memory ( Read-Only Memory, ROM for short), magnetic storage, flash memory, magnetic disk or optical disk. The memory 104 also stores one or more modules, and the one or more modules are configured to be executed by the one or more processors 118 to complete all or part of the steps in the method shown below.
电源组件106为装置100的各种组件提供电力。电源组件106可以包括电源管理系统,一个或多个电源,及其他与为装置100生成、管理和分配电力相关联的组件。The power supply component 106 provides power to various components of the device 100. The power supply component 106 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 100.
多媒体组件108包括在装置100和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(Liquid Crystal Display,简称LCD)和触摸面板。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。屏幕还可以包括有机电致发光显示器(Organic Light Emitting Display,简称OLED)。The multimedia component 108 includes a screen that provides an output interface between the device 100 and the user. In some embodiments, the screen may include a liquid crystal display (Liquid Crystal Display, LCD for short) and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor can not only sense the boundary of the touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. The screen may also include an organic electroluminescence display (Organic Light Emitting Display, OLED for short).
音频组件110被配置为输出和/或输入音频信号。例如,音频组件110包括一个麦克风(Microphone,简称MIC),当装置100处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器104或经由通信组件116发送。在一些实施例中,音频组件110还包括一个扬声器,用于输出音频信号。The audio component 110 is configured to output and/or input audio signals. For example, the audio component 110 includes a microphone (Microphone, MIC for short). When the device 100 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive external audio signals. The received audio signal can be further stored in the memory 104 or sent via the communication component 116. In some embodiments, the audio component 110 further includes a speaker for outputting audio signals.
传感器组件114包括一个或多个传感器,用于为装置100提供各个方面的状态评估。例如,传感器组件114可以检测到装置100的打开/关闭状态,组件的相对定位,传感器组件114还可以检测装置100或装置100一个组件的位置改变以及装置100的温度变化。在一些实施例中,该传感器组件114还可以包括磁传感器,压力传感器或温度传感器。The sensor component 114 includes one or more sensors for providing the device 100 with various aspects of state evaluation. For example, the sensor component 114 can detect the open/close state of the device 100 and the relative positioning of components. The sensor component 114 can also detect the position change of the device 100 or a component of the device 100 and the temperature change of the device 100. In some embodiments, the sensor component 114 may also include a magnetic sensor, a pressure sensor or a temperature sensor.
通信组件116被配置为便于装置100和其他设备之间有线或无线方式的通信。装置100可以接入基于通信标准的无线网络,如WiFi(Wireless-Fidelity,无线保真)。在本申请实施例中,通信组件116经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在本申请实施例中,通信组件116还包括近场通信(Near Field Communication,简称NFC)模块,用于以促进短程通信。例如,在NFC模块可基于射频识别(Radio Frequency Identification,简称RFID)技术,红外数据协会(Infrared Data Association,简称IrDA)技术,超宽带(Ultra Wideband,简称UWB)技术,蓝牙和其他技术来实现。The communication component 116 is configured to facilitate wired or wireless communication between the apparatus 100 and other devices. The device 100 can access a wireless network based on a communication standard, such as WiFi (Wireless-Fidelity, wireless fidelity). In the embodiment of the present application, the communication component 116 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In this embodiment of the present application, the communication component 116 further includes a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth and other technologies.
在示例性实施例中,装置100可以被一个或多个应用专用集成电路(Application Specific Integrated Circuit,简称ASIC)、数字信号处理器、数字信号处理设备、可编程逻辑器件、现场可编程门阵列、控制器、微控制器、微处理器或其他电子元件实现,用于执行下述方法。In an exemplary embodiment, the apparatus 100 may be implemented by one or more Application Specific Integrated Circuits (ASIC for short), digital signal processors, digital signal processing equipment, programmable logic devices, field programmable gate arrays, The controller, microcontroller, microprocessor or other electronic components are implemented to perform the following methods.
实施例二Example two
请参阅图2,图2是本申请实施例公开的一种文本数据增强方法的流程示意图。如图2所示,该文本数据增强方法可以包括以下步骤:Please refer to FIG. 2. FIG. 2 is a schematic flowchart of a text data enhancement method disclosed in an embodiment of the present application. As shown in Figure 2, the text data enhancement method may include the following steps:
201、获取原始文本。201. Obtain the original text.
202、对原始文本进行分词处理,以获得若干候选词。202. Perform word segmentation processing on the original text to obtain several candidate words.
203、针对目标候选词,基于目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词。203. For the target candidate word, based on the context information of the target candidate word, use a two-way long short-term memory network model to obtain N replacement words from a preset dictionary.
本申请实施例中,目标候选词为上述若干候选词中任一候选词,N个替换词中的每一个替换词对应的语义标签与原始文本对应的语义标签相匹配;N为正整数,且N的取值可自行配置,对此不作具体限定。In this embodiment of the application, the target candidate word is any one of the above-mentioned candidate words, and the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text; N is a positive integer, and The value of N can be configured by oneself, and there is no specific limitation on this.
举例来说,假设原始文本为“The actors are fantastic”,对该原始文本进行分词处理之后,可以获得“the”、“actors”、“are”以及”fantastic”四个候选词。可以理解,原始文本中任意一个候选词所在位置对应的替换词应该与原始文本中所有候选词的排列顺序、词性以及词义有关。以候选词“actors”为例,候选词“actors”的上下文信息包括“the”、“are”以及“fantastic”三个候选词。根据原始文本的语序,候选词“the”、“actors”、“are”以及”fantastic”可以组成按照时间顺序排好的输入序列。利用双向长短期记忆网络模型,可以根据在候选词“actors”之前的前向输入候选词“the”,以及在候选词“actors”之后的后向输入候选词“are”和候选词“fantastic”,从预设词典中获取“performances”、“films”、“movies”和“stories”等多个替换词,用以替换输入序列中的候选词“actors”。同时,为了保证语义的一致性,设原始文本属于积极的语义类型,且原始文本的语义标签为“positive”,则从预设词典中获取到的每个替换词对应的语义标签都应为“positive”,以使得利用替换词替换原始文本中相应的候选词后生成的扩充文本也属于积极的语义类型。For example, assuming that the original text is "The actors are fantastic", after word segmentation is performed on the original text, four candidate words of "the", "actors", "are", and "fantastic" can be obtained. It can be understood that the replacement word corresponding to the position of any candidate word in the original text should be related to the arrangement order, part of speech and word meaning of all candidate words in the original text. Taking the candidate word "actors" as an example, the context information of the candidate word "actors" includes three candidate words "the", "are" and "fantastic". According to the word order of the original text, the candidate words "the", "actors", "are" and "fantastic" can form an input sequence arranged in chronological order. Using the bidirectional long and short-term memory network model, the candidate word "the" can be input forward according to the candidate word "actors", and the candidate word "are" and the candidate word "fantastic" can be input backward after the candidate word "actors" , Get multiple replacement words such as "performances", "films", "movies" and "stories" from the preset dictionary to replace the candidate word "actors" in the input sequence. At the same time, in order to ensure semantic consistency, suppose the original text belongs to a positive semantic type, and the semantic label of the original text is "positive", then the semantic label corresponding to each replacement word obtained from the preset dictionary should be " positive", so that the expanded text generated after replacing the corresponding candidate words in the original text with the replacement words also belongs to the positive semantic type.
同理,上述方法同样适用于“the”、“are”以及“fantastic”中任一候选词,此处不再赘述。In the same way, the above method is also applicable to any candidate words in "the", "are" and "fantastic", so I won't repeat them here.
204、根据上述N个替换词和原始文本,生成N个第一扩充文本。204. According to the foregoing N replacement words and the original text, generate N first extended texts.
本申请实施例中,举例来说,若针对原始文本“The actors are fantastic”中的候选词“actors”,获得“performances”、“films”以及“movies”三个替换词,那么相应地,可以生成“The performances are fantastic”、“The films are fantastic”以及“The movies are fantastic”三条扩充文本。In the embodiments of this application, for example, if the candidate word “actors” in the original text “The actors are fantastic” is obtained, three replacement words “performances”, “films” and “movies” are obtained, then correspondingly, Generate three expanded texts "The performances are fantastic", "The films are fantastic" and "The movies are fantastic".
可见,实施图2所描述的方法,通过将原始文本分为若干候选词,可以基于任一候选词的上下文信息以及原始文本的语义类型,从预设词典中获取替换词,并且利用替换词替换相应的候选词,生成扩充文本,能够保证扩充文本的语义类型和原始文本的语义类型一致,提高了文本数据增强的语义准确性;并且,由于每一候选词都可以被多个新词替换,因此大大丰富了词语替换和组合方式,能够生成大量的扩充文本,从而在保证准确性的同时,提高了文本数据增强的效率。It can be seen that implementing the method described in Figure 2, by dividing the original text into a number of candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary and replaced with the replacement word Corresponding candidate words, generating expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, Therefore, word replacement and combination methods are greatly enriched, and a large amount of expanded text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.
实施例三Example three
请参阅图3,图3是本申请实施例公开的另一种文本数据增强方法的流程示意图。如图3所示,该文本数据增强方法可以包括以下步骤:Please refer to FIG. 3, which is a schematic flowchart of another text data enhancement method disclosed in an embodiment of the present application. As shown in Figure 3, the text data enhancement method may include the following steps:
301、获取原始文本。301 Obtain the original text.
302、对原始文本进行分词处理,以获得若干候选词。302. Perform word segmentation processing on the original text to obtain several candidate words.
303、针对目标候选词,基于原始文本的语序信息,对目标候选词的上下文信息从左至右进行前向编码,获得前向编码信息。303. For the target candidate word, based on the word order information of the original text, perform forward encoding on the context information of the target candidate word from left to right to obtain forward encoding information.
304、对目标候选词的上下文信息从右至左进行后向编码,获得后向编码信息。304. Perform backward coding on the context information of the target candidate word from right to left to obtain backward coding information.
本申请实施例中,对目标候选词的上下文信息进行前向编码的方式主要为:对目标候 选词的上下文信息中包括的候选词从左至右进行前向编号;根据上述每一个候选词的前向编号信息,生成前向词向量;利用预训练的词向量参数将前向词向量映射为前向词向量映射矩阵,以作为前向编码信息。In the embodiment of the present application, the method of forward encoding the context information of the target candidate word is mainly: forward numbering the candidate words included in the context information of the target candidate word from left to right; The forward number information is used to generate the forward word vector; the forward word vector is mapped to the forward word vector mapping matrix by using the pre-trained word vector parameters as the forward coding information.
类似的,对目标候选词的上下文信息进行后向编码的方式为:对目标候选词的上下文信息中包括的候选词从右至左进行后向编号;根据上述每一个候选词的后向编号信息,生成后向词向量;利用预训练的词向量参数将后向词向量映射为后向词向量映射矩阵,以作为后向编码信息。Similarly, the method of backward encoding the context information of the target candidate word is: backward numbering the candidate words included in the context information of the target candidate word from right to left; according to the backward numbering information of each candidate word mentioned above , Generate the backward word vector; use the pre-trained word vector parameters to map the backward word vector into a backward word vector mapping matrix as the backward coding information.
305、基于前向编码信息和后向编码信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词。305. Based on the forward coding information and the backward coding information, obtain N replacement words from a preset dictionary by using a two-way long and short-term memory network model.
可见,实施上述步骤303~步骤305,通过对目标候选词的上下文信息分别进行前向编码和后向编码,以将前向编码信息和后向编码信息输入双向长短期记忆网络模型,能够在利用双向长短期记忆网络模型预测目标候选词所在位置对应的替换词时,充分考虑到目标候选词的上下文信息,提高获取替换词的语义准确性。It can be seen that by implementing the above steps 303 to 305, forward coding and backward coding are performed on the context information of the target candidate words respectively to input the forward coding information and the backward coding information into the two-way long and short-term memory network model, which can be used When the bidirectional long and short-term memory network model predicts the replacement word corresponding to the location of the target candidate word, it fully considers the context information of the target candidate word, and improves the semantic accuracy of obtaining the replacement word.
作为一种可选的实施方式,步骤305具体可以包括:As an optional implementation manner, step 305 may specifically include:
基于前向编码信息和后向编码信息,利用双向长短期记忆网络模型预测出预设词典中每一个替换词的预测概率,其中,替换词为预设词典中对应的语义标签与原始文本对应的语义标签相匹配的词语;根据预设词典中每一个替换词对应的预测概率,对预设词典中所有替换词从大到小进行排序,并获取排位为前N位的N个替换词。Based on the forward coding information and backward coding information, the bidirectional long and short-term memory network model is used to predict the predicted probability of each replacement word in the preset dictionary, where the replacement word is the corresponding semantic label in the preset dictionary and the original text Words matching semantic tags; according to the predicted probability corresponding to each replacement word in the preset dictionary, sort all replacement words in the preset dictionary from largest to smallest, and obtain the top N replacement words.
可见,实施可选的实施方式,通过双向长短期记忆网络模型可以预测出预设词典中所有替换词出现在目标候选词所在位置的预测概率,并且,基于预测概率的大小,通过从所有替换词中筛选出排位为前N位的N个替换词,能够进一步地提高获取替换词的语义准确性,保证生成扩充文本的质量。It can be seen that by implementing an optional implementation, the bidirectional long and short-term memory network model can predict the predicted probability of all replacement words in the preset dictionary appearing at the location of the target candidate word, and, based on the predicted probability, the predicted probability of all replacement words Filtering out the N replacement words ranked in the top N positions can further improve the semantic accuracy of obtaining replacement words and ensure the quality of the generated extended text.
本申请实施例中,假设对一个文本进行分词处理后所获得的i+1(i为正整数)个词语可以组成一个输入序列(x 0,x 1,x 2,...,x i),可以将输入序列输入双向长短期记忆网络模型。双向长短期记忆网络模型中,针对输入序列中任一词语x t(t∈[0,i]),在对词语x t的上下文信息从左至右进行前向编码后,基于前向编码信息和词语x t,可以利用公式s t=f(Ux t+Ws t-1)求得前向计算结果s t;在对词语x t的上下文信息从右至左进行后向编码后,基于后向编码信息和候选词x t,可以利用公式s t'=f(U'x t+W's' t+1)求得后向计算结果s t';最终,将参数s t和s t'代入公式y t=g(Vs t+V's t')中,即可求得词语x t的预测概率,其中,U、W、U'、W'、V以及V'均为双向长短期记忆网络模型参数。 In the embodiment of this application, it is assumed that i+1 (i is a positive integer) words obtained after word segmentation processing of a text can form an input sequence (x 0 , x 1 , x 2 ,..., x i ) , The input sequence can be input into the bidirectional long and short-term memory network model. In the bidirectional long-term short-term memory network model, for any word x t (t∈[0,i]) in the input sequence, after forward coding the context information of the word x t from left to right, based on the forward coding information And the word x t , the forward calculation result s t can be obtained by the formula s t =f(Ux t +Ws t-1 ); after the context information of the word x t is backward-encoded from right to left, based on the backward For backward coding information and candidate words x t , the backward calculation result s t 'can be obtained using the formula s t '=f(U'x t +W's' t+1 ); finally, the parameters s t and s t ' In the formula y t =g(Vs t +V's t '), the predicted probability of the word x t can be obtained, where U, W, U', W', V and V'are all two-way long short-term memory network models parameter.
因此,对原始文本进行分词处理,获得若干候选词后,可以将上述若干候选词组成一个输入候选词序列。通过利用预设词典中的替换词替换输入候选词序列中的特定候选词,再将替换后的输入候选词序列输入上述双向长短期记忆网络模型,便可以求得该替换词的预测概率,进而根据预设词典中每一个替换词对应的预设概率,从预设词典中的所有替换词中筛选出N个替换词。Therefore, after performing word segmentation processing on the original text and obtaining several candidate words, the above several candidate words can be combined into an input candidate word sequence. By replacing specific candidate words in the input candidate word sequence with the replacement words in the preset dictionary, and then inputting the replaced input candidate word sequence into the above two-way long and short-term memory network model, the predicted probability of the replacement word can be obtained, and then According to the preset probability corresponding to each replacement word in the preset dictionary, N replacement words are selected from all the replacement words in the preset dictionary.
306、基于目标候选词在原始文本中的位置信息,利用上述N个替换词中的每一个替换词在原始文本中替换目标候选词,以生成N个第一扩充文本。306. Based on the position information of the target candidate word in the original text, replace the target candidate word in the original text with each of the aforementioned N replacement words to generate N first extended texts.
可见,实施图3所描述的方法,通过将原始文本分为若干候选词,可以基于任一候选词的上下文信息以及原始文本的语义类型,从预设词典中获取替换词,并且利用替换词替 换相应的候选词,生成扩充文本,能够保证扩充文本的语义类型和原始文本的语义类型一致,提高了文本数据增强的语义准确性;并且,基于每一个替换词对应的预测概率大小,通过从所有替换词中筛选出排位为前N位的N个替换词,还能够保证生成扩充文本的质量;此外,由于每一候选词都可以被多个新词替换,因此大大丰富了词语替换和组合方式,能够生成大量的扩充文本,从而在保证准确性的同时,提高了文本数据增强的效率。It can be seen that implementing the method described in Figure 3, by dividing the original text into a number of candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary and replaced with the replacement word Corresponding candidate words to generate expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, based on the predicted probability of each replacement word, the The N replacement words ranked in the top N positions are selected from the replacement words, which can also guarantee the quality of the generated expanded text; in addition, since each candidate word can be replaced by multiple new words, it greatly enriches the word replacement and combination In this way, a large amount of extended text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.
实施例四Example four
请参阅图4,图4是本申请实施例公开的又一种文本数据增强方法的流程示意图。如图4所示,该文本数据增强方法可以包括以下步骤:Please refer to FIG. 4, which is a schematic flowchart of yet another text data enhancement method disclosed in an embodiment of the present application. As shown in Figure 4, the text data enhancement method may include the following steps:
步骤401~步骤406;其中,针对步骤401~步骤406的描述,请参照实施例三中针对步骤301~步骤306的详细描述,本申请实施例不再赘述。Step 401 to step 406; for the description of step 401 to step 406, please refer to the detailed description of step 301 to step 306 in the third embodiment, which is not repeated in this embodiment of the application.
407、识别出上述N个第一扩充文本对应的第一语种。407. Identify the first language corresponding to the N first extended texts.
408、将上述N个第一扩充文本从第一语种翻译为不同于第一语种的其他语种,获得N个第一译文。408. Translate the aforementioned N first extended texts from the first language to another language different from the first language to obtain N first translations.
409、将上述N个第一译文从其他语种翻译为第一语种,获得N个第二扩充文本。409. Translate the aforementioned N first translations from other languages into the first language to obtain N second extended texts.
本申请实施例中,针对步骤407~步骤409,举例来说,对于文本“你把资料发给我吧”,可以识别出该文本的语种为中文;将该文本从中文翻译为英文,获得译文“Send me the imformation”;再将译文从英文翻译回中文,便可以获得新的扩充文本“把信息发给我”。可见,实施上述步骤407~步骤409,通过利用翻译工具对第一扩充文本进行文本数据增强,以获得第二扩充文本,既可以保证的第二扩充文本与第一扩充扩充文本在语义上的一致性,又可以基于多种语种类型,拓宽扩充文本的生成途径。In the embodiment of this application, for steps 407 to 409, for example, for the text "Send me the information", it can be recognized that the language of the text is Chinese; the text is translated from Chinese to English, and the translation is obtained "Send me the imformation"; and then translate the translation from English back to Chinese, and you can get the new expanded text "Send information to me". It can be seen that by implementing the above steps 407 to 409, by using the translation tool to enhance the text data of the first extended text to obtain the second extended text, it can be ensured that the second extended text is semantically consistent with the first extended extended text. It can also broaden and expand the generation of text based on multiple language types.
410、生成随机噪声。410. Generate random noise.
411、针对目标扩充文本,通过生成器和判别器对随机噪声进行训练,直至判别器无法区分经训练随机噪声后获得的语句样本和目标扩充文本。411. For the target expanded text, the random noise is trained through the generator and the discriminator, until the discriminator cannot distinguish the sentence sample obtained after training the random noise from the target expanded text.
本申请实施例中,目标扩充文本为上述N个第二扩充文本中的任一第二扩充文本。In the embodiment of the present application, the target extended text is any second extended text among the above N second extended texts.
412、将语句样本作为第三扩充文本。412. Use the sentence sample as the third extended text.
可见,实施上述步骤410~步骤412,通过利用基于长短期记忆网络模型和卷积神经网络模型的生成对抗网络(Generative Adversarial Networks,GAN)模拟第二扩充文本的数据分布,生成接近第二扩充文本的数据分布的第三扩充文本,能够不限于人类思维的限制,在已有扩充文本的基础上,再进一步扩充出种类丰富的新文本。It can be seen that the above steps 410 to 412 are implemented, and the data distribution of the second extended text is simulated by using the Generative Adversarial Networks (GAN) based on the long and short-term memory network model and the convolutional neural network model to generate close to the second extended text The third expanded text of the data distribution of, can not be limited to the limitations of human thinking, and on the basis of the existing expanded text, it can further expand a rich variety of new texts.
作为一种可选的实施方式,步骤411具体可以包括:As an optional implementation manner, step 411 may specifically include:
针对目标扩充文本,将随机噪声输入生成器,生成经训练随机噪声后获得的语句样本;将语句样本和目标扩充文本输入判别器,以使得判别器对语句样本和目标扩充文本进行卷积操作和池化操作,提取出语句样本的语句样本特征信息和目标扩充文本的真实文本特征信息,以及,结合语句样本特征信息和真实文本特征信息,判断能否区分语句样本和目标扩充文本;获取判别器输出的判别结果;若判别结果指示判别器能够区分语句样本和目标扩充文本,获取判别器的损失函数,并将损失函数输入生成器,生成新的语句样本,执行将语句样本和目标扩充文本输入判别器的步骤;否则,判定出判别器无法区分语句样本和目标扩充文本。For the target expanded text, random noise is input to the generator to generate sentence samples obtained after training random noise; the sentence samples and the target expanded text are input into the discriminator, so that the discriminator performs convolution operations on the sentence samples and the target expanded text. Pooling operation, extract the sentence sample feature information of the sentence sample and the real text feature information of the target extended text, and combine the sentence sample feature information and the real text feature information to determine whether the sentence sample can be distinguished from the target extended text; obtain the discriminator The output discriminant result; if the discriminant result indicates that the discriminator can distinguish the sentence sample and the target expanded text, obtain the loss function of the discriminator, and input the loss function into the generator to generate a new sentence sample, execute the sentence sample and the target expanded text input The steps of the discriminator; otherwise, it is determined that the discriminator cannot distinguish the sentence sample from the target extended text.
其中,生成器为用于模拟目标扩充文本的真实数据分布的长短期记忆网络模型;判别 器为卷积神经网络模型。以目标扩充文本为例,输入判别器的目标扩充文本可以表示为一个矩阵X∈R k×T,其中,T为目标扩充文本的长度,矩阵X的每一列由目标扩充文本中单词的词向量组成,k为词向量的维度。可选的,判别器的卷积核为1D卷积,且卷积核的宽度h与目标扩充文本中单词的词向量宽度相匹配。判别器在卷积层利用卷积核对目标扩充文本中的连续单词进行卷积操作后,再接入一个用于提取文本重要特征的最大池化层,便可以获得目标扩充文本的真实文本特征信息。 Among them, the generator is a long and short-term memory network model used to simulate the real data distribution of the target expanded text; the discriminator is a convolutional neural network model. Taking the target extended text as an example, the target extended text input to the discriminator can be expressed as a matrix X∈R k×T , where T is the length of the target extended text, and each column of the matrix X is represented by the word vector of the word in the target extended text Composition, k is the dimension of the word vector. Optionally, the convolution kernel of the discriminator is 1D convolution, and the width h of the convolution kernel matches the word vector width of the words in the target expanded text. The discriminator uses the convolution kernel to perform convolution operations on the continuous words in the target extended text in the convolution layer, and then connects to a maximum pooling layer for extracting important features of the text to obtain the true text feature information of the target extended text .
可见,实施可选的实施方式,通过不断地对生成器和判别器进行训练,使得语句样本的数据分布接近目标扩充文本的数据分布,并将最优化的语句样本作为第三扩充文本进行输出,也能够提高扩充文本的语义准确性。It can be seen that by implementing an optional implementation method, by continuously training the generator and the discriminator, the data distribution of the sentence sample is close to the data distribution of the target expanded text, and the optimized sentence sample is output as the third expanded text. It can also improve the semantic accuracy of the expanded text.
可见,实施图4所描述的方法,通过将原始文本分为若干候选词,可以基于任一候选词的上下文信息以及原始文本的语义类型,从预设词典中获取替换词,并且利用替换词替换相应的候选词,生成扩充文本,能够保证扩充文本的语义类型和原始文本的语义类型一致,提高了文本数据增强的语义准确性;并且,由于每一候选词都可以被多个新词替换,因此大大丰富了词语替换和组合方式,能够生成大量的扩充文本,从而在保证准确性的同时,提高了文本数据增强的效率;此外,通过利用翻译工具对第一扩充文本进行文本数据增强,以获得第二扩充文本,既可以保证的第二扩充文本与第一扩充扩充文本在语义上的一致性,又可以基于多种语种类型,拓宽扩充文本的生成途径;通过利用基于长短期记忆网络模型和卷积神经网络模型的生成对抗网络模拟第二扩充文本的数据分布,生成接近第二扩充文本的数据分布的第三扩充文本,能够不限于人类思维的限制,在已有扩充文本的基础上,再进一步扩充出种类丰富的新文本。It can be seen that implementing the method described in Figure 4, by dividing the original text into several candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary, and the replacement word can be used to replace it. Corresponding candidate words, generating expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, Therefore, word replacement and combination methods are greatly enriched, and a large number of extended texts can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy; in addition, text data enhancement is performed on the first extended text by using translation tools. Obtaining the second extended text can not only guarantee the semantic consistency between the second extended text and the first extended extended text, but also broaden the generation of extended text based on multiple language types; by using a network model based on long and short-term memory The generation confrontation network with the convolutional neural network model simulates the data distribution of the second extended text, and generates the third extended text close to the data distribution of the second extended text. It can be not limited to the limitation of human thinking, based on the existing extended text. , And further expand a variety of new texts.
实施例五Example five
请参阅图5,图5是本发明实施例公开的一种文本数据增强装置的结构示意图。如图5所示,该文本数据增强装置可以包括:文本获取单元501、分词单元502、替换词获取单元503以及文本生成单元504,其中,文本获取单元501,用于获取原始文本;分词单元502,用于对原始文本进行分词处理,以获得若干候选词;替换词获取单元503,用于针对目标候选词,基于目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词;其中,目标候选词为上述若干候选词中任一候选词,上述N个替换词中的每一个替换词对应的语义标签与原始文本对应的语义标签相匹配,N为正整数;文本生成单元504,用于根据上述N个替换词和原始文本,生成N个第一扩充文本。Please refer to FIG. 5. FIG. 5 is a schematic structural diagram of a text data enhancement device disclosed in an embodiment of the present invention. As shown in FIG. 5, the text data enhancement device may include: a text acquisition unit 501, a word segmentation unit 502, a replacement word acquisition unit 503, and a text generation unit 504. The text acquisition unit 501 is configured to acquire original text; and the word segmentation unit 502 , Used to perform word segmentation processing on the original text to obtain a number of candidate words; the replacement word acquisition unit 503, used to obtain the target candidate word from the preset dictionary based on the context information of the target candidate word using a bidirectional long and short-term memory network model N replacement words; among them, the target candidate word is any one of the above-mentioned candidate words, the semantic label corresponding to each of the above-mentioned N replacement words matches the semantic label corresponding to the original text, and N is a positive integer ; The text generating unit 504 is configured to generate N first extended texts based on the N replacement words and the original text.
可见,实施图5所描述的装置,通过将原始文本分为若干候选词,可以基于任一候选词的上下文信息以及原始文本的语义类型,从预设词典中获取替换词,并且利用替换词替换相应的候选词,生成扩充文本,能够保证扩充文本的语义类型和原始文本的语义类型一致,提高了文本数据增强的语义准确性;并且,由于每一候选词都可以被多个新词替换,因此大大丰富了词语替换和组合方式,能够生成大量的扩充文本,从而在保证准确性的同时,提高了文本数据增强的效率。It can be seen that implementing the device described in Figure 5, by dividing the original text into a number of candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary and replaced with the replacement word Corresponding candidate words, generating expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, Therefore, word replacement and combination methods are greatly enriched, and a large amount of expanded text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.
实施例六Example Six
请参阅图6,图6是本发明实施例公开的另一种文本数据增强装置的结构示意图。图6所示的文本数据增强装置是由图5所示的文本数据增强装置进行优化得到的。与图5所示的文本数据增强装置相比较,在图6所示的文本数据增强装置中:Please refer to FIG. 6, which is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present invention. The text data enhancement device shown in FIG. 6 is optimized by the text data enhancement device shown in FIG. 5. Compared with the text data enhancement device shown in FIG. 5, in the text data enhancement device shown in FIG. 6:
替换词获取单元503,包括:前向编码子单元5031,用于针对目标候选词,基于原始文本的语序信息,对目标候选词的上下文信息从左至右进行前向编码,获得前向编码信息;后向编码子单元5032,用于对目标候选词的上下文信息从右至左进行后向编码,获得后向编码信息;替换词获取子单元5033,用于基于前向编码信息和后向编码信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词。The replacement word acquisition unit 503 includes: a forward coding subunit 5031 for forward coding the context information of the target candidate word from left to right based on the word order information of the original text for the target candidate word to obtain forward coding information ; Backward coding subunit 5032, used to backward-encode the context information of the target candidate word from right to left, to obtain backward coding information; Replacement word acquisition subunit 5033, used to obtain backward coding information based on forward coding information Information, using the two-way long-term short-term memory network model to obtain N replacement words from the preset dictionary.
作为一种可选的实施方式,替换词获取子单元5033,包括:预测单元50331,用于基于前向编码信息和后向编码信息,利用双向长短期记忆网络模型预测出预设词典中每一个替换词的预测概率,其中,替换词为预设词典中对应的语义标签与原始文本对应的语义标签相匹配的词语;获取单元50332,用于根据预设词典中每一个替换词对应的预测概率,对预设词典中所有替换词从大到小进行排序,并获取排位为前N位的N个替换词。文本生成单元504,具体用于基于目标候选词在原始文本中的位置信息,利用上述N个替换词中的每一个替换词在原始文本中替换目标候选词,以生成N个第一扩充文本。As an optional implementation manner, the replacement word acquisition sub-unit 5033 includes: a prediction unit 50331, configured to predict each item in the preset dictionary based on the forward coding information and the backward coding information using a two-way long and short-term memory network model The predicted probability of the replacement word, where the replacement word is a word whose corresponding semantic label in the preset dictionary matches the semantic label corresponding to the original text; the acquiring unit 50332 is configured to predict the probability corresponding to each replacement word in the preset dictionary , Sort all replacement words in the preset dictionary from largest to smallest, and obtain the top N replacement words. The text generating unit 504 is specifically configured to replace the target candidate word in the original text with each of the above-mentioned N replacement words based on the position information of the target candidate word in the original text to generate N first expanded texts.
可见,实施图6所描述的装置,通过将原始文本分为若干候选词,可以基于任一候选词的上下文信息以及原始文本的语义类型,从预设词典中获取替换词,并且利用替换词替换相应的候选词,生成扩充文本,能够保证扩充文本的语义类型和原始文本的语义类型一致,提高了文本数据增强的语义准确性;并且,基于每一个替换词对应的预测概率大小,通过从所有替换词中筛选出排位为前N位的N个替换词,还能够保证生成扩充文本的质量;此外,由于每一候选词都可以被多个新词替换,因此大大丰富了词语替换和组合方式,能够生成大量的扩充文本,从而在保证准确性的同时,提高了文本数据增强的效率。It can be seen that implementing the device described in Figure 6, by dividing the original text into several candidate words, based on the context information of any candidate word and the semantic type of the original text, the replacement word can be obtained from the preset dictionary, and replaced with the replacement word Corresponding candidate words to generate expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, based on the predicted probability of each replacement word, the The N replacement words ranked in the top N positions are selected from the replacement words, which can also guarantee the quality of the generated expanded text; in addition, since each candidate word can be replaced by multiple new words, it greatly enriches the word replacement and combination In this way, a large amount of extended text can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy.
实施例七Example Seven
请参阅图7,图7是本发明实施例公开的又一种文本数据增强装置的结构示意图。图7所示的文本数据增强装置是由图6所示的文本数据增强装置进行优化得到的。与图6所示的文本数据增强装置相比较,图7所示的文本数据增强装置还可以包括:识别单元505、第一翻译单元506、第二翻译单元507、噪声生成单元508以及训练单元509,其中,识别单元505,用于识别出上述N个第一扩充文本对应的第一语种;第一翻译单元506,用于将上述N个第一扩充文本从第一语种翻译为不同于第一语种的其他语种,获得N个第一译文;第二翻译单元507,用于将上述N个第一译文从其他语种翻译为第一语种,获得N个第二扩充文本;噪声生成单元508,用于生成随机噪声;训练单元509,用于针对目标扩充文本,通过生成器和判别器对随机噪声进行训练,直至判别器无法区分经训练随机噪声后获得的语句样本和目标扩充文本;其中,目标扩充文本为上述N个第二扩充文本中的任一第二扩充文本,生成器为用于模拟目标扩充文本的真实数据分布的长短期记忆网络模型;判别器为卷积神经网络模型;以及,将语句样本作为第三扩充文本。Please refer to FIG. 7. FIG. 7 is a schematic structural diagram of another text data enhancement device disclosed in an embodiment of the present invention. The text data enhancement device shown in FIG. 7 is optimized by the text data enhancement device shown in FIG. 6. Compared with the text data enhancement device shown in FIG. 6, the text data enhancement device shown in FIG. 7 may further include: a recognition unit 505, a first translation unit 506, a second translation unit 507, a noise generation unit 508, and a training unit 509 , Wherein the recognition unit 505 is used to recognize the first language corresponding to the N first extended texts; the first translation unit 506 is used to translate the N first extended texts from the first language to be different from the first language In other languages, obtain N first translations; the second translation unit 507 is used to translate the above N first translations from other languages into the first language to obtain N second extended texts; the noise generation unit 508 uses To generate random noise; the training unit 509 is used to expand the text for the target, and train the random noise through the generator and the discriminator until the discriminator cannot distinguish the sentence samples obtained after training the random noise from the target expanded text; among them, the target The extended text is any one of the above N second extended texts, the generator is a long-term short-term memory network model for simulating the real data distribution of the target extended text; the discriminator is a convolutional neural network model; and, Take the sentence sample as the third expanded text.
作为一种可选的实施方式,训练单元509,包括:样本生成子单元5091,用于针对目标扩充文本,将随机噪声输入生成器,生成经训练随机噪声后获得的语句样本;判别子单元5092,用于将语句样本和目标扩充文本输入判别器,以使得判别器对语句样本和目标扩充文本进行卷积操作和池化操作,提取出语句样本的语句样本特征信息和目标扩充文本的真实文本特征信息,以及,结合语句样本特征信息和真实文本特征信息,判断能否区分语句样本和目标扩充文本;获取子单元5093,用于获取判别器输出的判别结果;训练子单元5094,用于在判别结果指示判别器能够区分语句样本和目标扩充文本时,获取判别 器的损失函数,并将损失函数输入生成器,生成新的语句样本,以触发判别子单元5092执行将语句样本和目标扩充文本输入判别器的步骤;否则,判定出判别器无法区分语句样本和目标扩充文本,将语句样本作为第三扩充文本。As an optional implementation, the training unit 509 includes: a sample generation subunit 5091, which is used to input random noise into the generator for the target extended text, and generate sentence samples obtained after training the random noise; and a discrimination subunit 5092 , Used to input sentence samples and target expanded text into the discriminator, so that the discriminator performs convolution and pooling operations on the sentence samples and target expanded text, and extracts the sentence sample feature information of the sentence sample and the real text of the target expanded text Feature information, and, combined with sentence sample feature information and real text feature information, determine whether the sentence sample can be distinguished from the target extended text; the acquisition subunit 5093 is used to obtain the discrimination result output by the discriminator; the training subunit 5094 is used in When the discrimination result indicates that the discriminator can distinguish between the sentence sample and the target expanded text, the loss function of the discriminator is obtained, and the loss function is input into the generator to generate a new sentence sample to trigger the discriminating subunit 5092 to execute the sentence sample and the target expanded text Enter the steps of the discriminator; otherwise, it is determined that the discriminator cannot distinguish the sentence sample from the target extended text, and the sentence sample is used as the third extended text.
本发明实施例中,以目标扩充文本为例,输入判别器的目标扩充文本可以表示为一个矩阵X∈R k×T,其中,T为目标扩充文本的长度,矩阵X的每一列由目标扩充文本中单词的词向量组成,k为词向量的维度。可选的,判别器的卷积核为1D卷积,且卷积核的宽度h与目标扩充文本中单词的词向量宽度相匹配。判别器在卷积层利用卷积核对目标扩充文本中的连续单词进行卷积操作后,再接入一个用于提取文本重要特征的最大池化层,便可以获得目标扩充文本的真实文本特征信息。 In the embodiment of the present invention, taking the target extended text as an example, the target extended text input to the discriminator can be expressed as a matrix X∈R k×T , where T is the length of the target extended text, and each column of the matrix X is extended by the target The word vector composition of words in the text, and k is the dimension of the word vector. Optionally, the convolution kernel of the discriminator is 1D convolution, and the width h of the convolution kernel matches the word vector width of the words in the target expanded text. The discriminator uses the convolution kernel to perform convolution operations on the continuous words in the target extended text in the convolution layer, and then connects to a maximum pooling layer for extracting important features of the text to obtain the true text feature information of the target extended text .
可见,实施图7所描述的装置,通过将原始文本分为若干候选词,可以基于任一候选词的上下文信息以及原始文本的语义类型,从预设词典中获取替换词,并且利用替换词替换相应的候选词,生成扩充文本,能够保证扩充文本的语义类型和原始文本的语义类型一致,提高了文本数据增强的语义准确性;并且,由于每一候选词都可以被多个新词替换,因此大大丰富了词语替换和组合方式,能够生成大量的扩充文本,从而在保证准确性的同时,提高了文本数据增强的效率;此外,通过利用翻译工具对第一扩充文本进行文本数据增强,以获得第二扩充文本,既可以保证的第二扩充文本与第一扩充扩充文本在语义上的一致性,又可以基于多种语种类型,拓宽扩充文本的生成途径;进一步地,通过利用基于长短期记忆网络模型和卷积神经网络模型的生成对抗网络模拟第二扩充文本的数据分布,生成接近第二扩充文本的数据分布的第三扩充文本,能够不限于人类思维的限制,在已有扩充文本的基础上,再进一步扩充出种类丰富的新文本。It can be seen that by implementing the device described in Figure 7, by dividing the original text into several candidate words, it is possible to obtain replacement words from a preset dictionary based on the context information of any candidate word and the semantic type of the original text, and use the replacement words to replace Corresponding candidate words, generating expanded text can ensure that the semantic type of the expanded text is consistent with the semantic type of the original text, which improves the semantic accuracy of text data enhancement; and, since each candidate word can be replaced by multiple new words, Therefore, word replacement and combination methods are greatly enriched, and a large number of extended texts can be generated, thereby improving the efficiency of text data enhancement while ensuring accuracy; in addition, text data enhancement is performed on the first extended text by using translation tools. Obtaining the second extended text can not only ensure the semantic consistency between the second extended text and the first extended extended text, but also broaden the generation of extended text based on multiple language types; further, by using The generation of the memory network model and the convolutional neural network model against the network simulates the data distribution of the second extended text, and generates the third extended text close to the data distribution of the second extended text. It can be not limited to the limitation of human thinking, and in the existing extended text On the basis of this, a rich variety of new texts are further expanded.
本申请还提供一种电子设备,该电子设备包括:This application also provides an electronic device, which includes:
处理器;processor;
存储器,该存储器上存储有计算机可读指令,该计算机可读指令被处理器执行时,实现如前所示的文本数据增强方法。该电子设备可以是图1所示装置100。A memory, where computer readable instructions are stored, and when the computer readable instructions are executed by the processor, the method for enhancing text data as shown above is realized. The electronic device may be the apparatus 100 shown in FIG. 1.
在一示例性实施例中,本申请还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现如前所示的文本数据增强方法。In an exemplary embodiment, the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method for enhancing text data as shown above is realized.
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It should be understood that the present application is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be performed without departing from its scope. The scope of the application is only limited by the appended claims.

Claims (28)

  1. 一种文本数据增强方法,包括:A text data enhancement method, including:
    获取原始文本;Get the original text;
    对所述原始文本进行分词处理,以获得若干候选词;Perform word segmentation processing on the original text to obtain several candidate words;
    针对目标候选词,基于所述目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词;其中,所述目标候选词为所述若干候选词中任一候选词,所述N个替换词中的每一个替换词对应的语义标签与所述原始文本对应的语义标签相匹配,所述N为正整数;For the target candidate word, based on the context information of the target candidate word, use the bidirectional long-term short-term memory network model to obtain N replacement words from the preset dictionary; wherein, the target candidate word is any one of the candidate words Words, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer;
    根据所述N个替换词和所述原始文本,生成N个第一扩充文本。According to the N replacement words and the original text, N first extended texts are generated.
  2. 如权利要求1所述的方法,其中,所述针对目标候选词,基于所述目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词,包括:The method according to claim 1, wherein the target candidate word, based on the context information of the target candidate word, using a two-way long short-term memory network model to obtain N replacement words from a preset dictionary comprises:
    针对目标候选词,基于所述原始文本的语序信息,对所述目标候选词的上下文信息从左至右进行前向编码,获得前向编码信息;For the target candidate word, based on the word order information of the original text, forward encoding the context information of the target candidate word from left to right to obtain forward encoding information;
    对所述目标候选词的上下文信息从右至左进行后向编码,获得后向编码信息;Perform backward coding on the context information of the target candidate word from right to left to obtain backward coding information;
    基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词。Based on the forward coding information and the backward coding information, a two-way long short-term memory network model is used to obtain N replacement words from a preset dictionary.
  3. 如权利要求2所述的方法,其中,所述基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词,包括:The method according to claim 2, wherein the obtaining N replacement words from a preset dictionary based on the forward coding information and the backward coding information using a two-way long and short-term memory network model comprises:
    基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型预测出预设词典中每一个替换词的预测概率,其中,替换词为所述预设词典中对应的语义标签与所述原始文本对应的语义标签相匹配的词语;Based on the forward coding information and the backward coding information, a bidirectional long and short-term memory network model is used to predict the predicted probability of each replacement word in the preset dictionary, where the replacement word is the corresponding semantic in the preset dictionary Words whose tags match the semantic tags corresponding to the original text;
    根据所述预设词典中每一个替换词对应的预测概率,对所述预设词典中所有替换词从大到小进行排序,并获取排位为前N位的N个替换词。According to the predicted probability corresponding to each replacement word in the preset dictionary, all replacement words in the preset dictionary are sorted from largest to smallest, and the top N replacement words are obtained.
  4. 如权利要求1所述的方法,其中,在所述根据所述N个替换词和所述原始文本,生成N个第一扩充文本之后,所述方法还包括:The method according to claim 1, wherein after said generating N first extended texts based on said N replacement words and said original text, said method further comprises:
    识别出所述N个第一扩充文本对应的第一语种;Identifying the first language corresponding to the N first extended texts;
    将所述N个第一扩充文本从所述第一语种翻译为不同于所述第一语种的其他语种,获得N个第一译文;Translating the N first extended texts from the first language into another language different from the first language to obtain N first translations;
    将所述N个第一译文从所述其他语种翻译为所述第一语种,获得N个第二扩充文本。Translating the N first translations from the other languages into the first language to obtain N second extended texts.
  5. 如权利要求4所述的方法,其中,在所述获得N个第二扩充文本之后,所述方法还包括:The method according to claim 4, wherein, after said obtaining the N second extended texts, the method further comprises:
    生成随机噪声;Generate random noise;
    针对目标扩充文本,通过生成器和判别器对所述随机噪声进行训练,直至所述判别器无法区分经训练所述随机噪声后获得的语句样本和所述目标扩充文本;其中,所述目标扩充文本为所述N个第二扩充文本中的任一第二扩充文本,所述生成器为用于模拟所述目标扩充文本的真实数据分布的长短期记忆网络模型;所述判别器为卷积神经网络模型;For the target expanded text, the random noise is trained through a generator and a discriminator until the discriminator cannot distinguish between the sentence sample obtained after training the random noise and the target expanded text; wherein the target expansion The text is any second expanded text among the N second expanded texts, the generator is a long and short-term memory network model used to simulate the real data distribution of the target expanded text; the discriminator is a convolution Neural network model;
    将所述语句样本作为第三扩充文本。Use the sentence sample as the third extended text.
  6. 如权利要求5所述的方法,其中,所述针对目标扩充文本,通过生成器和判别器对所述随机噪声进行训练,直至所述判别器无法区分经训练所述随机噪声后获得的语句样 本和所述目标扩充文本,包括:The method according to claim 5, wherein the target expanded text is trained on the random noise by a generator and a discriminator until the discriminator cannot distinguish the sentence samples obtained after training the random noise And the target expanded text, including:
    针对目标扩充文本,将所述随机噪声输入生成器,生成经训练所述随机噪声后获得的语句样本;For the target expanded text, input the random noise into a generator to generate sentence samples obtained after training the random noise;
    将所述语句样本和所述目标扩充文本输入判别器,以使得所述判别器对所述语句样本和所述目标扩充文本进行卷积操作和池化操作,提取出所述语句样本的语句样本特征信息和所述目标扩充文本的真实文本特征信息,以及,结合所述语句样本特征信息和所述真实文本特征信息,判断能否区分所述语句样本和所述目标扩充文本;The sentence sample and the target expanded text are input to a discriminator, so that the discriminator performs convolution and pooling operations on the sentence sample and the target expanded text, and extracts sentence samples of the sentence sample Feature information and the real text feature information of the target expanded text, and, combining the sentence sample feature information and the real text feature information, determine whether the sentence sample can be distinguished from the target expanded text;
    获取所述判别器输出的判别结果;Obtaining the discrimination result output by the discriminator;
    若所述判别结果指示所述判别器能够区分所述语句样本和所述目标扩充文本,获取所述判别器的损失函数,并将所述损失函数输入所述生成器,生成新的语句样本,执行所述将所述语句样本和所述目标扩充文本输入判别器的步骤;否则,判定出所述判别器无法区分所述语句样本和所述目标扩充文本。If the discrimination result indicates that the discriminator can distinguish the sentence sample from the target expanded text, obtain the loss function of the discriminator, and input the loss function into the generator to generate a new sentence sample, The step of inputting the sentence sample and the target expanded text into the discriminator is performed; otherwise, it is determined that the discriminator cannot distinguish the sentence sample and the target expanded text.
  7. 如权利要求1至6任一项所述的方法,其中,所述根据所述N个替换词和所述原始文本,生成N个第一扩充文本,包括:The method according to any one of claims 1 to 6, wherein the generating N first extended texts according to the N replacement words and the original text comprises:
    基于所述目标候选词在所述原始文本中的位置信息,利用所述N个替换词中的每一个替换词在所述原始文本中替换所述目标候选词,以生成N个第一扩充文本。Based on the location information of the target candidate word in the original text, each of the N replacement words is used to replace the target candidate word in the original text to generate N first expanded texts .
  8. 一种文本数据增强装置,包括:A text data enhancement device, including:
    文本获取单元,用于为获取原始文本;The text obtaining unit is used to obtain the original text;
    分词单元,用于对所述原始文本进行分词处理,以获得若干候选词;The word segmentation unit is used to perform word segmentation processing on the original text to obtain several candidate words;
    替换词获取单元,用于针对目标候选词,基于所述目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词;其中,所述目标候选词为所述若干候选词中任一候选词,所述N个替换词中的每一个替换词对应的语义标签与所述原始文本对应的语义标签相匹配,所述N为正整数;The replacement word acquisition unit is used to obtain N replacement words from a preset dictionary based on the context information of the target candidate word based on the target candidate word; wherein, the target candidate word is the For any candidate word among several candidate words, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer;
    文本生成单元,用于根据所述N个替换词和所述原始文本,生成N个第一扩充文本。The text generating unit is configured to generate N first extended texts according to the N replacement words and the original text.
  9. 如权利要求8所述的装置,其中,所述替换词获取单元包括:8. The device according to claim 8, wherein the replacement word acquisition unit comprises:
    前向编码单元,用于针对目标候选词,基于所述原始文本的语序信息,对所述目标候选词的上下文信息从左至右进行前向编码,获得前向编码信息;A forward coding unit for forward coding the context information of the target candidate word from left to right based on the word order information of the original text for the target candidate word, to obtain forward coding information;
    后向编码子单元,用于对所述目标候选词的上下文信息从右至左进行后向编码,获得后向编码信息;The backward coding subunit is used for backward coding the context information of the target candidate word from right to left to obtain backward coding information;
    替换词获取子单元,用于基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词。The replacement word obtaining subunit is configured to obtain N replacement words from a preset dictionary based on the forward coding information and the backward coding information by using a two-way long and short-term memory network model.
  10. 如权利要求9所述的装置,其中,所述替换词获取子单元包括:9. The device of claim 9, wherein the replacement word acquisition subunit comprises:
    预测单元,用于基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型预测出预设词典中每一个替换词的预测概率,其中,替换词为所述预设词典中对应的语义标签与所述原始文本对应的语义标签相匹配的词语;A prediction unit for predicting the prediction probability of each replacement word in a preset dictionary based on the forward coding information and the backward coding information using a two-way long and short-term memory network model, where the replacement word is the preset The corresponding semantic label in the dictionary matches the semantic label corresponding to the original text;
    获取单元,用于根据所述预设词典中每一个替换词对应的预测概率,对所述预设词典中所有替换词从大到小进行排序,并获取排位为前N位的N个替换词。The acquiring unit is configured to sort all the replacement words in the preset dictionary from largest to smallest according to the predicted probability corresponding to each replacement word in the preset dictionary, and acquire the N replacements ranked in the top N word.
  11. 如权利要求8所述的装置,其中,所述装置还包括:The device of claim 8, wherein the device further comprises:
    识别单元,用于识别出所述N个第一扩充文本对应的第一语种;A recognition unit, configured to recognize the first language corresponding to the N first extended texts;
    第一翻译单元,用于将所述N个第一扩充文本从所述第一语种翻译为不同于所述第一语种的其他语种,获得N个第一译文;A first translation unit, configured to translate the N first extended texts from the first language to another language different from the first language to obtain N first translations;
    第二翻译单元,用于将所述N个第一译文从所述其他语种翻译为所述第一语种,获得N个第二扩充文本。The second translation unit is configured to translate the N first translations from the other languages into the first language to obtain N second extended texts.
  12. 如权利要求11所述的装置,其中,所述装置还包括:The device of claim 11, wherein the device further comprises:
    噪声生成单元,用于生成随机噪声;Noise generating unit, used to generate random noise;
    训练单元,用于针对目标扩充文本,通过生成单元和判别单元对所述随机噪声进行训练,直至所述判别单元无法区分经训练所述随机噪声后获得的语句样本和所述目标扩充文本;其中,所述目标扩充文本为所述N个第二扩充文本中的任一第二扩充文本,所述生成单元为用于模拟所述目标扩充文本的真实数据分布的长短期记忆网络模型;所述判别单元为卷积神经网络模型;以及将所述语句样本作为第三扩充文本。The training unit is used to train the random noise for the target extended text through the generating unit and the discriminating unit until the discriminating unit cannot distinguish the sentence sample obtained after training the random noise from the target expanded text; wherein The target extended text is any second extended text among the N second extended texts, and the generating unit is a long-short-term memory network model for simulating the real data distribution of the target extended text; The discrimination unit is a convolutional neural network model; and the sentence sample is used as the third extended text.
  13. 如权利要求12所述的装置,其中,所述训练单元包括:The device of claim 12, wherein the training unit comprises:
    样本生成子单元,用于针对目标扩充文本,将所述随机噪声输入生成单元,生成经训练所述随机噪声后获得的语句样本;The sample generation subunit is used to expand the text for the target, input the random noise to the generating unit, and generate sentence samples obtained after training the random noise;
    判别子单元,用于将所述语句样本和所述目标扩充文本输入判别单元,以使得所述判别单元对所述语句样本和所述目标扩充文本进行卷积操作和池化操作,提取出所述语句样本的语句样本特征信息和所述目标扩充文本的真实文本特征信息,以及,结合所述语句样本特征信息和所述真实文本特征信息,判断能否区分所述语句样本和所述目标扩充文本;The discrimination subunit is used to input the sentence sample and the target extended text into the discrimination unit, so that the discrimination unit performs convolution and pooling operations on the sentence sample and the target extended text, and extracts all The sentence sample characteristic information of the sentence sample and the real text characteristic information of the target expanded text, and, in combination with the sentence sample characteristic information and the real text characteristic information, it is determined whether the sentence sample can be distinguished from the target expansion text;
    获取子单元,用于获取所述判别单元输出的判别结果;An obtaining sub-unit for obtaining the discrimination result output by the discrimination unit;
    训练子单元,用于在所述判别结果指示所述判别单元能够区分所述语句样本和所述目标扩充文本时,获取所述判别单元的损失函数,并将所述损失函数输入所述生成单元,生成新的语句样本,执行所述将所述语句样本和所述目标扩充文本输入判别单元的步骤;否则,判定出所述判别单元无法区分所述语句样本和所述目标扩充文本。The training subunit is used to obtain the loss function of the discrimination unit when the discrimination result indicates that the discrimination unit can distinguish the sentence sample from the target extended text, and input the loss function into the generating unit , Generate a new sentence sample, and execute the step of inputting the sentence sample and the target expanded text into the discrimination unit; otherwise, it is determined that the discrimination unit cannot distinguish the sentence sample and the target expanded text.
  14. 如权利要求8-13任一项所述的装置,其中,所述文本生成单元用于基于所述目标候选词在所述原始文本中的位置信息,利用所述N个替换词中的每一个替换词在所述原始文本中替换所述目标候选词,以生成N个第一扩充文本。The device according to any one of claims 8-13, wherein the text generating unit is configured to use each of the N replacement words based on the position information of the target candidate word in the original text The replacement word replaces the target candidate word in the original text to generate N first expanded texts.
  15. 一种电子设备,包括:An electronic device including:
    处理器;processor;
    及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,所述处理器配置为实现以下步骤:And a memory on which computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, the processor is configured to implement the following steps:
    获取原始文本;Get the original text;
    对所述原始文本进行分词处理,以获得若干候选词;Perform word segmentation processing on the original text to obtain several candidate words;
    针对目标候选词,基于所述目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词;其中,所述目标候选词为所述若干候选词中任一候选词,所述N个替换词中的每一个替换词对应的语义标签与所述原始文本对应的语义标签相匹配,所述N为正整数;For the target candidate word, based on the context information of the target candidate word, use the bidirectional long-term short-term memory network model to obtain N replacement words from the preset dictionary; wherein, the target candidate word is any one of the candidate words Words, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer;
    根据所述N个替换词和所述原始文本,生成N个第一扩充文本。According to the N replacement words and the original text, N first extended texts are generated.
  16. 如权利要求15所述的电子设备,其中,所述针对目标候选词,基于所述目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词,所述 处理器配置为实现以下步骤:The electronic device according to claim 15, wherein the target candidate word is based on the context information of the target candidate word, and the two-way long short-term memory network model is used to obtain N replacement words from a preset dictionary, and the processing The device is configured to implement the following steps:
    针对目标候选词,基于所述原始文本的语序信息,对所述目标候选词的上下文信息从左至右进行前向编码,获得前向编码信息;For the target candidate word, based on the word order information of the original text, forward encoding the context information of the target candidate word from left to right to obtain forward encoding information;
    对所述目标候选词的上下文信息从右至左进行后向编码,获得后向编码信息;Perform backward coding on the context information of the target candidate word from right to left to obtain backward coding information;
    基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词。Based on the forward coding information and the backward coding information, a two-way long short-term memory network model is used to obtain N replacement words from a preset dictionary.
  17. 如权利要求16所述的电子设备,其中,所述基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词,所述处理器配置为实现以下步骤:The electronic device according to claim 16, wherein, based on the forward coding information and the backward coding information, using a two-way long short-term memory network model to obtain N replacement words from a preset dictionary, the processing The device is configured to implement the following steps:
    基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型预测出预设词典中每一个替换词的预测概率,其中,替换词为所述预设词典中对应的语义标签与所述原始文本对应的语义标签相匹配的词语;Based on the forward coding information and the backward coding information, a bidirectional long and short-term memory network model is used to predict the predicted probability of each replacement word in the preset dictionary, where the replacement word is the corresponding semantic in the preset dictionary Words whose tags match the semantic tags corresponding to the original text;
    根据所述预设词典中每一个替换词对应的预测概率,对所述预设词典中所有替换词从大到小进行排序,并获取排位为前N位的N个替换词。According to the predicted probability corresponding to each replacement word in the preset dictionary, all replacement words in the preset dictionary are sorted from largest to smallest, and the top N replacement words are obtained.
  18. 如权利要求15所述的电子设备,其中,在所述根据所述N个替换词和所述原始文本,生成N个第一扩充文本之后,所述处理器还配置为实现以下步骤:The electronic device according to claim 15, wherein, after the N first extended texts are generated according to the N replacement words and the original text, the processor is further configured to implement the following steps:
    识别出所述N个第一扩充文本对应的第一语种;Identifying the first language corresponding to the N first extended texts;
    将所述N个第一扩充文本从所述第一语种翻译为不同于所述第一语种的其他语种,获得N个第一译文;Translating the N first extended texts from the first language into another language different from the first language to obtain N first translations;
    将所述N个第一译文从所述其他语种翻译为所述第一语种,获得N个第二扩充文本。Translating the N first translations from the other languages into the first language to obtain N second extended texts.
  19. 如权利要求18所述的电子设备,其中,在所述获得N个第二扩充文本之后,所述处理器还配置为实现以下步骤:The electronic device according to claim 18, wherein, after obtaining the N second extended texts, the processor is further configured to implement the following steps:
    生成随机噪声;Generate random noise;
    针对目标扩充文本,通过生成器和判别器对所述随机噪声进行训练,直至所述判别器无法区分经训练所述随机噪声后获得的语句样本和所述目标扩充文本;其中,所述目标扩充文本为所述N个第二扩充文本中的任一第二扩充文本,所述生成器为用于模拟所述目标扩充文本的真实数据分布的长短期记忆网络模型;所述判别器为卷积神经网络模型;For the target expanded text, the random noise is trained through a generator and a discriminator until the discriminator cannot distinguish between the sentence sample obtained after training the random noise and the target expanded text; wherein the target expansion The text is any second expanded text among the N second expanded texts, the generator is a long and short-term memory network model used to simulate the real data distribution of the target expanded text; the discriminator is a convolution Neural network model;
    将所述语句样本作为第三扩充文本。Use the sentence sample as the third extended text.
  20. 如权利要求19所述的电子设备,其中,所述针对目标扩充文本,通过生成器和判别器对所述随机噪声进行训练,直至所述判别器无法区分经训练所述随机噪声后获得的语句样本和所述目标扩充文本,所述处理器配置为实现以下步骤:The electronic device according to claim 19, wherein the target expanded text is trained on the random noise by a generator and a discriminator until the discriminator cannot distinguish sentences obtained after training on the random noise The sample and the target expanded text, the processor is configured to implement the following steps:
    针对目标扩充文本,将所述随机噪声输入生成器,生成经训练所述随机噪声后获得的语句样本;For the target expanded text, input the random noise into a generator to generate sentence samples obtained after training the random noise;
    将所述语句样本和所述目标扩充文本输入判别器,以使得所述判别器对所述语句样本和所述目标扩充文本进行卷积操作和池化操作,提取出所述语句样本的语句样本特征信息和所述目标扩充文本的真实文本特征信息,以及,结合所述语句样本特征信息和所述真实文本特征信息,判断能否区分所述语句样本和所述目标扩充文本;The sentence sample and the target expanded text are input to a discriminator, so that the discriminator performs convolution and pooling operations on the sentence sample and the target expanded text, and extracts sentence samples of the sentence sample Feature information and the real text feature information of the target expanded text, and, combining the sentence sample feature information and the real text feature information, determine whether the sentence sample can be distinguished from the target expanded text;
    获取所述判别器输出的判别结果;Obtaining the discrimination result output by the discriminator;
    若所述判别结果指示所述判别器能够区分所述语句样本和所述目标扩充文本,获取所 述判别器的损失函数,并将所述损失函数输入所述生成器,生成新的语句样本,执行所述将所述语句样本和所述目标扩充文本输入判别器的步骤;否则,判定出所述判别器无法区分所述语句样本和所述目标扩充文本。If the discrimination result indicates that the discriminator can distinguish the sentence sample from the target expanded text, obtain the loss function of the discriminator, and input the loss function into the generator to generate a new sentence sample, The step of inputting the sentence sample and the target expanded text into the discriminator is performed; otherwise, it is determined that the discriminator cannot distinguish the sentence sample and the target expanded text.
  21. 如权利要求15至20任一项所述的电子设备,其中,所述根据所述N个替换词和所述原始文本,生成N个第一扩充文本,所述处理器配置为实现以下步骤:The electronic device according to any one of claims 15 to 20, wherein said generating N first extended texts according to said N replacement words and said original text, and said processor is configured to implement the following steps:
    基于所述目标候选词在所述原始文本中的位置信息,利用所述N个替换词中的每一个替换词在所述原始文本中替换所述目标候选词,以生成N个第一扩充文本。Based on the location information of the target candidate word in the original text, each of the N replacement words is used to replace the target candidate word in the original text to generate N first expanded texts .
  22. 一种计算机非易失性可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,所述处理器配置为实现以下步骤:A computer non-volatile readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the processor is configured to implement the following steps:
    获取原始文本;Get the original text;
    对所述原始文本进行分词处理,以获得若干候选词;Perform word segmentation processing on the original text to obtain several candidate words;
    针对目标候选词,基于所述目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词;其中,所述目标候选词为所述若干候选词中任一候选词,所述N个替换词中的每一个替换词对应的语义标签与所述原始文本对应的语义标签相匹配,所述N为正整数;For the target candidate word, based on the context information of the target candidate word, use the bidirectional long-term short-term memory network model to obtain N replacement words from the preset dictionary; wherein, the target candidate word is any one of the candidate words Words, the semantic label corresponding to each of the N replacement words matches the semantic label corresponding to the original text, and the N is a positive integer;
    根据所述N个替换词和所述原始文本,生成N个第一扩充文本。According to the N replacement words and the original text, N first extended texts are generated.
  23. 如权利要求22所述的计算机非易失性可读存储介质,其中,所述针对目标候选词,基于所述目标候选词的上下文信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词,所述处理器配置为实现以下步骤:The computer non-volatile readable storage medium according to claim 22, wherein the target candidate word is based on the context information of the target candidate word, and N Alternative words, the processor is configured to implement the following steps:
    针对目标候选词,基于所述原始文本的语序信息,对所述目标候选词的上下文信息从左至右进行前向编码,获得前向编码信息;For the target candidate word, based on the word order information of the original text, forward encoding the context information of the target candidate word from left to right to obtain forward encoding information;
    对所述目标候选词的上下文信息从右至左进行后向编码,获得后向编码信息;Perform backward coding on the context information of the target candidate word from right to left to obtain backward coding information;
    基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词。Based on the forward coding information and the backward coding information, a two-way long short-term memory network model is used to obtain N replacement words from a preset dictionary.
  24. 如权利要求23所述的计算机非易失性可读存储介质,其中,所述基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型从预设词典中获取N个替换词,所述处理器配置为实现以下步骤:The computer non-volatile readable storage medium according to claim 23, wherein, based on the forward coding information and the backward coding information, N is obtained from a preset dictionary using a two-way long and short-term memory network model. Alternative words, the processor is configured to implement the following steps:
    基于所述前向编码信息和所述后向编码信息,利用双向长短期记忆网络模型预测出预设词典中每一个替换词的预测概率,其中,替换词为所述预设词典中对应的语义标签与所述原始文本对应的语义标签相匹配的词语;Based on the forward coding information and the backward coding information, a bidirectional long and short-term memory network model is used to predict the predicted probability of each replacement word in the preset dictionary, where the replacement word is the corresponding semantic in the preset dictionary Words whose tags match the semantic tags corresponding to the original text;
    根据所述预设词典中每一个替换词对应的预测概率,对所述预设词典中所有替换词从大到小进行排序,并获取排位为前N位的N个替换词。According to the predicted probability corresponding to each replacement word in the preset dictionary, all replacement words in the preset dictionary are sorted from largest to smallest, and the top N replacement words are obtained.
  25. 如权利要求22所述的计算机非易失性可读存储介质,其中,在所述根据所述N个替换词和所述原始文本,生成N个第一扩充文本之后,所述处理器还配置为实现以下步骤:The computer non-volatile readable storage medium according to claim 22, wherein, after the N first extended texts are generated based on the N replacement words and the original text, the processor is further configured To achieve the following steps:
    识别出所述N个第一扩充文本对应的第一语种;Identifying the first language corresponding to the N first extended texts;
    将所述N个第一扩充文本从所述第一语种翻译为不同于所述第一语种的其他语种,获得N个第一译文;Translating the N first extended texts from the first language into another language different from the first language to obtain N first translations;
    将所述N个第一译文从所述其他语种翻译为所述第一语种,获得N个第二扩充文本。Translating the N first translations from the other languages into the first language to obtain N second extended texts.
  26. 如权利要求25所述的计算机非易失性可读存储介质,其中,在所述获得N个第二扩充文本之后,所述处理器还配置为实现以下步骤:The computer non-volatile readable storage medium according to claim 25, wherein, after said obtaining the N second extended texts, the processor is further configured to implement the following steps:
    生成随机噪声;Generate random noise;
    针对目标扩充文本,通过生成器和判别器对所述随机噪声进行训练,直至所述判别器无法区分经训练所述随机噪声后获得的语句样本和所述目标扩充文本;其中,所述目标扩充文本为所述N个第二扩充文本中的任一第二扩充文本,所述生成器为用于模拟所述目标扩充文本的真实数据分布的长短期记忆网络模型;所述判别器为卷积神经网络模型;For the target expanded text, the random noise is trained through a generator and a discriminator until the discriminator cannot distinguish between the sentence sample obtained after training the random noise and the target expanded text; wherein the target expansion The text is any second expanded text among the N second expanded texts, the generator is a long and short-term memory network model used to simulate the real data distribution of the target expanded text; the discriminator is a convolution Neural network model;
    将所述语句样本作为第三扩充文本。Use the sentence sample as the third extended text.
  27. 如权利要求26所述的计算机非易失性可读存储介质,其中,所述针对目标扩充文本,通过生成器和判别器对所述随机噪声进行训练,直至所述判别器无法区分经训练所述随机噪声后获得的语句样本和所述目标扩充文本,所述处理器配置为实现以下步骤:The computer non-volatile readable storage medium according to claim 26, wherein the target expanded text is trained on the random noise by a generator and a discriminator until the discriminator cannot distinguish the trained The sentence samples obtained after the random noise and the target expanded text are described, and the processor is configured to implement the following steps:
    针对目标扩充文本,将所述随机噪声输入生成器,生成经训练所述随机噪声后获得的语句样本;For the target expanded text, input the random noise into a generator to generate sentence samples obtained after training the random noise;
    将所述语句样本和所述目标扩充文本输入判别器,以使得所述判别器对所述语句样本和所述目标扩充文本进行卷积操作和池化操作,提取出所述语句样本的语句样本特征信息和所述目标扩充文本的真实文本特征信息,以及,结合所述语句样本特征信息和所述真实文本特征信息,判断能否区分所述语句样本和所述目标扩充文本;The sentence sample and the target expanded text are input to a discriminator, so that the discriminator performs convolution and pooling operations on the sentence sample and the target expanded text, and extracts sentence samples of the sentence sample Feature information and the real text feature information of the target expanded text, and, combining the sentence sample feature information and the real text feature information, determine whether the sentence sample can be distinguished from the target expanded text;
    获取所述判别器输出的判别结果;Obtaining the discrimination result output by the discriminator;
    若所述判别结果指示所述判别器能够区分所述语句样本和所述目标扩充文本,获取所述判别器的损失函数,并将所述损失函数输入所述生成器,生成新的语句样本,执行所述将所述语句样本和所述目标扩充文本输入判别器的步骤;否则,判定出所述判别器无法区分所述语句样本和所述目标扩充文本。If the discrimination result indicates that the discriminator can distinguish the sentence sample from the target expanded text, obtain the loss function of the discriminator, and input the loss function into the generator to generate a new sentence sample, The step of inputting the sentence sample and the target expanded text into the discriminator is performed; otherwise, it is determined that the discriminator cannot distinguish the sentence sample and the target expanded text.
  28. 如权利要求22至27任一项所述的计算机非易失性可读存储介质,其中,所述根据所述N个替换词和所述原始文本,生成N个第一扩充文本,所述处理器配置为实现以下步骤:The computer non-volatile readable storage medium according to any one of claims 22 to 27, wherein said generating N first extended texts according to said N replacement words and said original text, said processing The device is configured to implement the following steps:
    基于所述目标候选词在所述原始文本中的位置信息,利用所述N个替换词中的每一个替换词在所述原始文本中替换所述目标候选词,以生成N个第一扩充文本。Based on the location information of the target candidate word in the original text, each of the N replacement words is used to replace the target candidate word in the original text to generate N first expanded texts .
PCT/CN2019/117663 2019-04-28 2019-11-12 Text data enhancement method and apparatus, electronic device, and non-volatile computer-readable storage medium WO2020220636A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910350209.3 2019-04-28
CN201910350209.3A CN110222707A (en) 2019-04-28 2019-04-28 A kind of text data Enhancement Method and device, electronic equipment

Publications (1)

Publication Number Publication Date
WO2020220636A1 true WO2020220636A1 (en) 2020-11-05

Family

ID=67820173

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117663 WO2020220636A1 (en) 2019-04-28 2019-11-12 Text data enhancement method and apparatus, electronic device, and non-volatile computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN110222707A (en)
WO (1) WO2020220636A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632232A (en) * 2021-03-09 2021-04-09 北京世纪好未来教育科技有限公司 Text matching method, device, equipment and medium
CN112883724A (en) * 2021-02-03 2021-06-01 虎博网络技术(上海)有限公司 Text data enhancement processing method and device, electronic equipment and readable storage medium

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222707A (en) * 2019-04-28 2019-09-10 平安科技(深圳)有限公司 A kind of text data Enhancement Method and device, electronic equipment
CN112487182B (en) * 2019-09-12 2024-04-12 华为技术有限公司 Training method of text processing model, text processing method and device
CN110782002B (en) * 2019-09-12 2022-04-05 成都四方伟业软件股份有限公司 LSTM neural network training method and device
CN111027312B (en) * 2019-12-12 2024-04-19 中金智汇科技有限责任公司 Text expansion method and device, electronic equipment and readable storage medium
CN111444326B (en) * 2020-03-30 2023-10-20 腾讯科技(深圳)有限公司 Text data processing method, device, equipment and storage medium
CN111695356A (en) * 2020-05-28 2020-09-22 平安科技(深圳)有限公司 Synonym corpus generation method, synonym corpus generation device, computer system and readable storage medium
CN111694826B (en) * 2020-05-29 2024-03-19 平安科技(深圳)有限公司 Data enhancement method and device based on artificial intelligence, electronic equipment and medium
CN111783451A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Method and apparatus for enhancing text samples
CN112183074A (en) * 2020-09-27 2021-01-05 中国建设银行股份有限公司 Data enhancement method, device, equipment and medium
CN112906392B (en) * 2021-03-23 2022-04-01 北京天融信网络安全技术有限公司 Text enhancement method, text classification method and related device
CN113657093A (en) * 2021-07-12 2021-11-16 广东外语外贸大学 Grammar error correction data enhancement method and device based on real error mode
CN113627149A (en) * 2021-08-10 2021-11-09 华南师范大学 Classroom conversation evaluation method, system and storage medium
CN113779959B (en) * 2021-08-31 2023-06-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Small sample text data mixing enhancement method
CN114595327A (en) * 2022-02-22 2022-06-07 平安科技(深圳)有限公司 Data enhancement method and device, electronic equipment and storage medium
CN114912448B (en) * 2022-07-15 2022-12-09 山东海量信息技术研究院 Text extension method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100268725A1 (en) * 2009-04-20 2010-10-21 Microsoft Corporation Acquisition of semantic class lexicons for query tagging
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information
CN109408812A (en) * 2018-09-30 2019-03-01 北京工业大学 A method of the sequence labelling joint based on attention mechanism extracts entity relationship
CN109522406A (en) * 2018-10-11 2019-03-26 平安科技(深圳)有限公司 Text semantic matching process, device, computer equipment and storage medium
CN110222707A (en) * 2019-04-28 2019-09-10 平安科技(深圳)有限公司 A kind of text data Enhancement Method and device, electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635116B (en) * 2018-12-17 2023-03-24 腾讯科技(深圳)有限公司 Training method of text word vector model, electronic equipment and computer storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100268725A1 (en) * 2009-04-20 2010-10-21 Microsoft Corporation Acquisition of semantic class lexicons for query tagging
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information
CN109408812A (en) * 2018-09-30 2019-03-01 北京工业大学 A method of the sequence labelling joint based on attention mechanism extracts entity relationship
CN109522406A (en) * 2018-10-11 2019-03-26 平安科技(深圳)有限公司 Text semantic matching process, device, computer equipment and storage medium
CN110222707A (en) * 2019-04-28 2019-09-10 平安科技(深圳)有限公司 A kind of text data Enhancement Method and device, electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883724A (en) * 2021-02-03 2021-06-01 虎博网络技术(上海)有限公司 Text data enhancement processing method and device, electronic equipment and readable storage medium
CN112632232A (en) * 2021-03-09 2021-04-09 北京世纪好未来教育科技有限公司 Text matching method, device, equipment and medium

Also Published As

Publication number Publication date
CN110222707A (en) 2019-09-10

Similar Documents

Publication Publication Date Title
WO2020220636A1 (en) Text data enhancement method and apparatus, electronic device, and non-volatile computer-readable storage medium
CN111191078B (en) Video information processing method and device based on video information processing model
US10114809B2 (en) Method and apparatus for phonetically annotating text
US20190057145A1 (en) Interactive information retrieval using knowledge graphs
WO2019100319A1 (en) Providing a response in a session
EP3405912A1 (en) Analyzing textual data
WO2021000497A1 (en) Retrieval method and apparatus, and computer device and storage medium
CN105531758B (en) Use the speech recognition of foreign words grammer
CN107291690A (en) Punctuate adding method and device, the device added for punctuate
WO2020056621A1 (en) Learning method and apparatus for intention recognition model, and device
WO2021073390A1 (en) Data screening method and apparatus, device and computer-readable storage medium
WO2018076450A1 (en) Input method and apparatus, and apparatus for input
WO2021120690A1 (en) Speech recognition method and apparatus, and medium
CN107274903A (en) Text handling method and device, the device for text-processing
US20200342168A1 (en) System and Method for Domain- and Language-Independent Definition Extraction Using Deep Neural Networks
CN109710732A (en) Information query method, device, storage medium and electronic equipment
CN108345612A (en) A kind of question processing method and device, a kind of device for issue handling
CN111144093A (en) Intelligent text processing method and device, electronic equipment and storage medium
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
CN114882862A (en) Voice processing method and related equipment
US10714087B2 (en) Speech control for complex commands
CN115661846A (en) Data processing method and device, electronic equipment and storage medium
CN117290515A (en) Training method of text annotation model, method and device for generating text graph
WO2021097629A1 (en) Data processing method and apparatus, and electronic device and storage medium
CN110888940A (en) Text information extraction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19927188

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19927188

Country of ref document: EP

Kind code of ref document: A1