CN114398855A

CN114398855A - Text extraction method, system and medium based on fusion pre-training

Info

Publication number: CN114398855A
Application number: CN202210038607.3A
Authority: CN
Inventors: 林远平; 甘伟超; 喻广博; 邹鸿岳; 周靖宇
Original assignee: Beijing Kuaique Information Technology Co ltd
Current assignee: Beijing Kuaique Information Technology Co ltd
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-04-26

Abstract

The invention discloses a text extraction method, a system and a medium based on fusion pre-training, wherein the method comprises the following steps: acquiring a text to be extracted; pre-training and coding the text to be extracted through a pre-training model to obtain a corresponding character vector; selecting at least part of the character vectors to carry out semantic extraction on adjacent texts, and splicing to obtain semantic feature vectors; performing feature selection on the semantic feature vectors and fusing to obtain effective word feature vectors; and carrying out flow division decoding on the effective word characteristic vectors to respectively obtain a word segmentation result and an entity identification result. The character vectors are obtained by encoding based on a pre-training model frame, and at least part of the character vectors are fused to extract the semantics of the adjacent text so as to learn the text semantic information, so that the learning capability of the semantics is enhanced, the problem of fuzzy boundary of the finally obtained word segmentation result can be effectively avoided, and the accuracy of text extraction is improved.

Description

Text extraction method, system and medium based on fusion pre-training

Technical Field

The invention relates to the technical field of computers, in particular to a text extraction method, a text extraction system and a text extraction medium based on fusion pre-training.

Background

Text information extraction is a relatively mature algorithm technology in the field of deep learning, and is also successfully applied to various service scenes. However, in the financial field, especially in the currency field, the existing text extraction method has certain boundary problems, such as "1Y 0000013.097540005.29 +0A fund TO B fund", the extraction of the digital text "3.0975" may only be performed TO "3.09", or the extraction of the digital text "4000" may only be performed TO "400", so that the accuracy of text extraction is not high enough.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, the present invention provides a method, a system and a medium for text extraction based on fusion pre-training, which aims to improve the accuracy of text extraction.

The technical scheme of the invention is as follows:

a text extraction method based on fusion pre-training comprises the following steps:

acquiring a text to be extracted;

pre-training and coding the text to be extracted through a pre-training model to obtain a corresponding character vector;

selecting at least part of the character vectors to carry out semantic extraction on adjacent texts, and splicing to obtain semantic feature vectors;

performing feature selection on the semantic feature vectors and fusing to obtain effective word feature vectors;

and carrying out flow division decoding on the effective word characteristic vectors to respectively obtain a word segmentation result and an entity identification result.

In an embodiment, before the pre-training encoding is performed on the text to be extracted through the pre-training model to obtain the corresponding character vector, the method further includes:

and carrying out countermeasure training on the pre-training model.

In one embodiment, the training of the pre-training model for confrontation comprises:

constructing a countermeasure sample, and adding the countermeasure sample into an input embedding layer of the pre-training model for perturbation;

and carrying out countermeasure training on the pre-training model according to the countermeasure sample to update the model parameters, and ending the countermeasure training until the number of updating times reaches a preset number.

In one embodiment, the construction confrontation sample specifically comprises:

the challenge sample is calculated according to the following formula,

wherein, g_advRepresenting the gradient of the pre-trained model during the training of confrontation, X representing the input information, y representing the label information, delta_t-1Representing the magnitude of the disturbance at time t-1, f_θRepresents the output of the pre-trained model, L represents the loss function,

representing graduating the disturbance in the loss function, a representing the learning rate, | |)_FIs the Frobenius norm, g_tAnd II, representing the gradient of the pre-training model at the moment t, wherein II is a cumulative multiplication sign.

In an embodiment, the performing the countermeasure training on the pre-training model according to the countermeasure sample to update the model parameters until the number of updating times reaches a preset number, and then the performing the countermeasure training is ended specifically includes:

after the pre-training model is disturbed according to the confrontation sample, according to a formula

The gradient of the parameter θ is accumulated, where K represents the number of times the gradient is raised, E represents the mathematical expectation, g_t-1The gradient of the pre-trained model at time t-1,

representing the gradient of a parameter in the loss function;

and updating parameters of the pre-training model according to the accumulated gradient, and ending the confrontation training until the updating times reach the preset times.

In one embodiment, the selecting at least part of the character vectors to perform semantic extraction on the neighboring texts and obtaining semantic feature vectors by splicing includes:

selecting coding layers at a plurality of preset positions in the pre-training model as target coding layers;

respectively inputting the output results of the target coding layer into text classification models which are connected in a one-to-one correspondence manner to perform semantic extraction of adjacent texts, wherein the number of the text classification models is the same as that of the target coding layer, and the sizes of the kernels of the text classification models are different;

and performing fusion splicing on the extraction result of each text classification model to obtain the semantic feature vector.

In one embodiment, the performing feature selection and fusion on the semantic feature vectors to obtain valid term feature vectors specifically includes:

performing feature selection on the semantic feature vectors through a full connection layer and fusing to obtain effective word feature vectors, wherein the input of the full connection layer is F_inputAn output of F_output，

F_input＝concat(E₁,E₂,E_i…,E_n)，

F_output＝softmax(F_input)＝softmax(concat(E₁,E₂,E_i…,E_n) Wherein E) is_iAnd n is the output result of the ith target coding layer, and the number of the target coding layers.

In one embodiment, the kernel size of the text classification model is 3-7.

A text extraction system based on fusion pretraining, the system comprising at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for text extraction based on fusion pretraining described above.

A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method for text extraction based on fusion pre-training described above.

Has the advantages that: compared with the prior art, the text extraction method, the text extraction system and the text extraction medium based on the fusion pre-training are characterized in that character vectors are obtained by encoding based on a pre-training model frame, at least part of the fusion character vectors are subjected to semantic extraction of adjacent texts to learn text semantic information, the learning capability of semantics is enhanced, the problem of fuzzy boundaries can be effectively avoided for finally obtained word segmentation results, and the accuracy of text extraction is improved.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flowchart of a text extraction method based on pre-fusion training according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a model framework of a text extraction method based on fusion pre-training according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of functional modules of a text extraction device based on pre-fusion training according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a hardware structure of a text extraction system based on fusion pre-training according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention is described in further detail below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart illustrating a text extraction method based on pre-fusion training according to an embodiment of the present invention. The text extraction method based on the fusion pre-training provided by the embodiment is suitable for the situation of automatically identifying the transaction opponent in the transaction process. As shown in fig. 1, the method specifically includes the following steps:

and S100, acquiring a text to be extracted.

In this embodiment, the text to be extracted may be text information of a transaction session in a current coupon transaction process, for example, order information, consultation information, and the like sent between different transaction institutions, and the text information in the transaction session is acquired as the text to be extracted to perform automatic text extraction processing and efficiency of financial information identification processing.

S200, pre-training and coding the text to be extracted through a pre-training model to obtain a corresponding character vector.

The pre-training model is trained through large-scale corpus information, and can achieve good effect in downstream tasks through training fine adjustment of downstream tasks, so in the embodiment, the pre-training coding is performed on a text to be extracted through the pre-training model, and then a corresponding character vector is obtained, specifically, in the embodiment, a Bert pre-training model is preferably adopted for character coding, Bert is a pre-trained language characterization model, and emphasizes that the pre-training is not performed by adopting a traditional one-way language model or a method for shallow splicing two one-way language models like the prior art, but a new MLM (masked language model) is adopted to generate a deep two-way language characterization, namely, a plurality of words in the text are randomly shielded according to a certain probability for the input text, and then the Bert model is used for predicting the shielded words to perform pre-training, the vector code of each character is obtained, but in other embodiments, a pre-training model such as Albert or RoBerta may also be used for pre-training coding, which is not limited in this embodiment.

In one embodiment, before step S200, the method further comprises:

and carrying out countermeasure training on the pre-training model.

In this embodiment, before character encoding is performed on a text to be extracted, a pre-training model is combined with a confrontation training learning method to improve the robustness and accuracy of the model as much as possible, and specifically, a confrontation training algorithm such as FreeLB, FGM, PGD, and the like may be selected, which is not limited in this embodiment.

In one embodiment, the pre-training model is confrontational trained, comprising:

In the embodiment, the countermeasure training is an important way for enhancing the robustness of the model, in the process of the countermeasure training, the countermeasure samples are constructed and added into the input embedding layer of the pre-trained model to be disturbed, the input samples of the pre-trained model are mixed with a few tiny disturbances, and the models can identify the real labels of the countermeasure samples through the disturbed countermeasure sample attack models, namely, the countermeasure training is performed on the pre-trained model according to the countermeasure samples during the training, so that the model adapts to the change to update the model parameters until the countermeasure training is finished, thereby improving the robustness of the model when encountering the countermeasure samples, and simultaneously improving the performance and generalization capability of the model to a certain extent.

In specific implementation, FreeLB is adopted for countertraining, and disturbance is calculated through the following formula to attack the weight of the pre-training model:

After the pre-training model is disturbed according to the confrontation sample, the formula is followed

representing the gradient of a parameter in the loss function.

And after the accumulated gradient is obtained, updating parameters of the pre-training model, finishing the confrontation training when the updating times reach the preset times, and regularizing the model parameters by a training mode of introducing noise, namely the confrontation training, so that the robustness and the generalization capability of the model are improved.

S300, selecting at least part of the character vectors to carry out semantic extraction on the adjacent texts, and splicing to obtain semantic feature vectors.

In this embodiment, after the corresponding character vectors are obtained through encoding, because a single character mode is adopted when the pretrained model of the Bert series constructs embedding, the mode may cause the loss of vocabulary semantic information in the context of chinese and also cause the boundary problem of text extraction, such as the case where the digital text "3.0975" is only extracted to "3.09", in order to avoid the boundary problem, in this embodiment, at least part of the encoding results of the pretrained model is selected to perform semantic extraction on adjacent texts, that is, the semantic relationship between adjacent characters is learned to better capture local correlation, thereby avoiding the boundary problem caused by a single character and improving the accuracy of text extraction.

In one embodiment, step S300 includes:

In this embodiment, the pre-training model usually includes multiple coding layers, that is, a hidden layer structure including multiple transforms, and since the number of coding layers of the pre-training model is higher, the data features obtained by the output hidden vectors are finer, so that only the last (highest) layer of the coding results is output in the pre-training model of Bert and the like in the prior art, in this embodiment, to learn the short-distance semantic features, the coding layers at a plurality of preset positions are selected as target coding layers, specifically, the coding layers located at the last 25% -50% of all the coding layers are selected, for example, when the number of the coding layers, that is, the transforms layers in the pre-training model is 12, the last 3 to 6 layers (that is, 3 to 6 layers are counted down from the last layer) are selected, and when the number of the coding layers is 18 layers, the last 5 to 9 layers are selected.

A text classification model is connected behind each selected target coding layer, in this embodiment, a text classification model TextCNN based on a convolutional neural network is adopted, for example, when 12 Transformers layers are adopted, the last 6 Transformers layers are selected as target coding layers, a TextCNN module is connected behind the 6 Transformers layers to perform semantic extraction on adjacent texts, and in order to better capture local correlation, kernel sizes of the textcnns in this embodiment are different, and kernel ambiguity is preferably set to be 3-7. Because the TextCNN module can learn the semantic relationship between the word and the word whose distance is the size of the kernel, the setting of the size of the kernel is equivalent to setting the learning range of the TextCNN model, and the distance cannot exceed the size of the kernel when learning the relationship between the word and the word, the model can learn text semantic information through multiple angles by setting the kernel with different sizes in the embodiment, the generalization capability and the semantic comprehension capability of the model are increased, and the boundary identification capability is improved.

And after the adjacent texts are subjected to semantic extraction through the TextCNN, fusion splicing is performed on extraction results output by each TextCNN in a vector fusion mode to obtain semantic feature vectors, n hidden _ size-dimensional vectors are converted into 1 hidden _ size-dimensional feature vectors through fusion splicing, n is the number of target coding layers, and the model can keep different TextCNNs to learn semantic information through different angles by adopting the fusion mode, so that the semantic learning capability of the model is further enhanced.

S400, performing feature selection on the semantic feature vectors and fusing to obtain effective word feature vectors.

In this embodiment, after short-distance semantic extraction is implemented on the partial encoding layer fused text classification model, the semantic feature vectors obtained by fusion and splicing perform feature selection on a full connection layer of the result, and effective word feature vectors are obtained by selection and fusion.

In one embodiment, step S400 includes:

F_input＝concat(E₁,E₂,E_i…,E_n)，

In this embodiment, the output result concat (E) of the text classification model is fused to the target programming layer through the full connection layer₁,E₂,E_i…,E_n) And (4) selecting features, specifically classifying through a softmax function, and selecting the most effective word features.

S500, conducting flow distribution decoding on the effective word characteristic vectors to respectively obtain word segmentation results and entity recognition results.

In this embodiment, based on the output of the full connection layer, the downstream task is split-decoded, so that text extraction can be efficiently performed while entity recognition is achieved, and a word segmentation result and an entity recognition result are obtained, specifically, the effective word feature vectors are respectively input to the entity recognition task layer and the word segmentation task layer which have completed training, wherein for the entity recognition task, the output of the full connection layer is extracted again by a Long-distance semantic feature through an LSTM (Long Short-Term Memory) network structure, the output is used as the input of a decoding layer in the entity recognition task, the decoding layer adopts a CRF (conditional random field) to perform entity label prediction, and finally outputs a corresponding entity label; for the word segmentation task, decoding the output of the full connection layer through a CRF decoder, outputting character marks in the valid word feature vector to obtain a word segmentation result, where the character marks include an entity start mark, an entity remaining mark, and a non-entity mark, for example, a text "a deb B mechanism gives C mechanism", and the final segmentation result is "BI 0 biiibiii", where "B" is the entity start mark, and "I" is the entity remaining mark, i.e., other positions in the entity except for the start position, and an "O" non-entity mark, here, an analysis result of a blank space. Words in a sentence can be well segmented through the B, I, O form, so that the model can well learn how to segment the sentence, and accurate text segmentation and extraction are realized.

In order to better understand the implementation process of the text extraction method based on the fusion pre-training provided by the present invention, the following introduces the text extraction process based on the fusion pre-training provided by the present invention with reference to the specific model structure in fig. 2:

as shown in fig. 2, a text to be extracted, "a debt B machine … structure", is obtained, first, character vectorization is performed on an input text through a Bert pre-training model to obtain a character or word vector with fixed dimensions, in addition, FreeLB (FreeLB) countertraining is added into an input embedding layer of the pre-training model to disturb the input embedding so as to increase the robustness of the model, the Bert pre-training model adopts 12 transform high layer structures, in order to learn short-distance semantic features, the last 6 layers of Transformers layer fusion TextCNN module is selected by the semantic feature selection module to carry out semantic extraction on the adjacent text, namely, after the last 6 layers of transformations, connecting a TextCNN with different sizes of kernerl to extract key information in the sentence, enabling the model to learn text semantic information through a plurality of angles, improving the generalization and boundary identification capability of the model, then fusing each TextCNN output by adopting a vector fusion module, and converting vectors of 6 hidden _ size dimensions into feature vectors of 1 hidden _ size dimension; after passing through the semantic feature selection module, the vectors obtained by splicing are subjected to feature selection through a full Connected Layer (full Connected Layer), and the most effective word features are selected and fused; then, split decoding is carried out based on the output of the full connection layer, the output of the full connection layer is decoded and labeled through a CRF (conditional random access memory) decoder in a word segmentation task, an B, I, O-form character labeling result is obtained to accurately segment words and phrases, and the labeling result of the 'A deb B machine …' is 'B, I, B, I, O'; in the entity identification task, the output of the full connection layer is decoded sequentially through the LSTM and the CRF to obtain an entity labeling result, for example, the labeling result of the A debt B machine … is 'B-BN, I-BN, B-ORG and I-ORG O', one word corresponds to one mark, BN and ORG are different entity labels respectively, BN represents an entity of a bond, and ORG represents an entity of an organization, so that accurate word segmentation is realized while entity identification and extraction are realized, and the extraction accuracy is improved.

Another embodiment of the present invention provides a text extraction device based on fusion pre-training, as shown in fig. 3, the device includes:

the acquisition module 11 is used for acquiring a text to be extracted;

the pre-training module 12 is used for pre-training and coding the text to be extracted through a pre-training model to obtain a corresponding character vector;

the semantic extraction module 13 is used for selecting at least part of the character vectors to carry out semantic extraction on the adjacent texts and splicing to obtain semantic feature vectors;

the fusion module 14 is used for performing feature selection on the semantic feature vectors and fusing the semantic feature vectors to obtain effective word feature vectors;

and the segmentation identification module 15 is used for performing flow division decoding on the effective word characteristic vectors to respectively obtain a word segmentation result and an entity identification result.

The acquisition module 11, the pre-training module 12, the semantic extraction module 13, the fusion module 14, and the segmentation recognition module 15 are connected in sequence, the module referred to in the present invention refers to a series of computer program instruction segments capable of completing a specific function, and is more suitable for describing an execution process of text extraction based on fusion pre-training than a program, and the specific implementation of each module refers to the above corresponding method embodiment, and is not described herein again.

Another embodiment of the present invention provides a text extraction system based on fusion pre-training, as shown in fig. 4, the system 10 includes:

one or more processors 110 and a memory 120, where one processor 110 is illustrated in fig. 4, the processor 110 and the memory 120 may be connected by a bus or other means, and fig. 4 illustrates a connection by a bus as an example.

Processor 110 is used to implement various control logic for system 10, which may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a single chip, an ARM (Acorn RISC machine) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. Also, the processor 110 may be any conventional processor, microprocessor, or state machine. Processor 110 may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.

The memory 120, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions corresponding to the text extraction method based on pre-fusion training in the embodiments of the present invention. The processor 110 executes various functional applications and data processing of the system 10, namely, implementing the text extraction method based on the pre-training fusion in the above method embodiments, by executing the non-volatile software programs, instructions and units stored in the memory 120.

The memory 120 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the system 10, and the like. Further, the memory 120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 120 optionally includes memory located remotely from processor 110, which may be connected to system 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more units are stored in the memory 120, and when executed by the one or more processors 110, perform the text extraction method based on fusion pre-training in any of the above-described method embodiments, e.g., performing the above-described method steps S100 to S500 in fig. 1.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer-executable instructions for execution by one or more processors, for example, to perform method steps S100-S500 of fig. 1 described above.

By way of example, non-volatile storage media can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Synchronous RAM (SRAM), dynamic RAM, (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The disclosed memory components or memory of the operating environment described herein are intended to comprise one or more of these and/or any other suitable types of memory.

In summary, in the text extraction method, system and medium based on the fusion pre-training disclosed by the invention, the method obtains the text to be extracted; pre-training and coding the text to be extracted through a pre-training model to obtain a corresponding character vector; selecting at least part of the character vectors to carry out semantic extraction on adjacent texts, and splicing to obtain semantic feature vectors; performing feature selection on the semantic feature vectors and fusing to obtain effective word feature vectors; and carrying out flow division decoding on the effective word characteristic vectors to respectively obtain a word segmentation result and an entity identification result. The character vectors are obtained by encoding based on a pre-training model frame, and at least part of the character vectors are fused to extract the semantics of the adjacent text so as to learn the text semantic information, so that the learning capability of the semantics is enhanced, the problem of fuzzy boundary of the finally obtained word segmentation result can be effectively avoided, and the accuracy of text extraction is improved.

Of course, it will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by instructing relevant hardware (such as a processor, a controller, etc.) through a computer program, which may be stored in a non-volatile computer-readable storage medium, and the computer program may include the processes of the above method embodiments when executed. The storage medium may be a memory, a magnetic disk, a floppy disk, a flash memory, an optical memory, etc.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A text extraction method based on fusion pre-training is characterized by comprising the following steps:

acquiring a text to be extracted;

2. The method for extracting text based on fusion pretraining as claimed in claim 1, wherein before the pretraining coding is performed on the text to be extracted through the pretraining model to obtain the corresponding character vector, the method further comprises:

and carrying out countermeasure training on the pre-training model.

3. The method for text extraction based on fusion pretraining as claimed in claim 2, wherein the performing countermeasure training on the pretrained model comprises:

4. The text extraction method based on the pre-fusion training as claimed in claim 3, wherein the constructing of the confrontation sample specifically comprises:

the challenge sample is calculated according to the following formula,

5. The method for extracting text based on fusion pretraining as claimed in claim 4, wherein the performing countertraining on the pretrained model according to the countertraining samples to update model parameters until the updating times reach a preset number and the countertraining is finished specifically comprises:

representing the gradient of a parameter in the loss function;

6. The method for extracting text based on fusion pre-training as claimed in claim 1, wherein said selecting at least part of said character vectors to perform semantic extraction on neighboring text and concatenating to obtain semantic feature vectors comprises:

7. The method for extracting text based on fusion pretraining as claimed in claim 6, wherein said performing feature selection and fusion on said semantic feature vectors to obtain valid word feature vectors specifically comprises:

F_input＝concat(E₁,E₂,E_i…,E_n)，

8. The method of claim 1, wherein the kernel size of the text classification model is 3-7.

9. A system for text extraction based on fusion pretraining, the system comprising at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for text extraction based on pre-fusion training of any one of claims 1-8.

10. A non-transitory computer-readable storage medium having stored thereon computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the method for text extraction based on fusion pretraining of any one of claims 1-8.