CN112989041A - Text data processing method and device based on BERT - Google Patents

Text data processing method and device based on BERT Download PDF

Info

Publication number
CN112989041A
CN112989041A CN202110261106.7A CN202110261106A CN112989041A CN 112989041 A CN112989041 A CN 112989041A CN 202110261106 A CN202110261106 A CN 202110261106A CN 112989041 A CN112989041 A CN 112989041A
Authority
CN
China
Prior art keywords
text data
fusion
processed
bert
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110261106.7A
Other languages
Chinese (zh)
Inventor
张诏泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202110261106.7A priority Critical patent/CN112989041A/en
Publication of CN112989041A publication Critical patent/CN112989041A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text data processing method and a text data processing device based on BERT, wherein the method comprises the following steps: acquiring text data to be processed; inputting an original word vector of the text data to be processed into a pre-trained BERT language model, and outputting a vector representation result of fusion context information of each character or word in the text data to be processed; performing feature fusion on a vector representation result output by the BERT language model and other structural features except context information in text data to be processed to obtain a fused feature vector; and inputting the fused feature vectors into a machine classification model trained in advance, and outputting a classification result of the text data to be processed. The invention can greatly improve the accuracy of text classification. The text data processing method provided by the embodiment of the invention is applied to an auditing service system, and can reduce invalid alarm amount and labor cost.

Description

Text data processing method and device based on BERT
Technical Field
The invention relates to the field of natural language processing, in particular to a text data processing method and device based on BERT.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
In the field of natural language processing, context-free models such as word2vec or GloVe generate a word vector representation for each word in the vocabulary. But because of different words, meaning may vary greatly under different semantic circumstances, e.g., for the word "bank", depending on the context, it may mean "bank" and it may mean "bank" as well. This is clearly not reasonable if a word is represented as the same vector.
The transducer-based Bidirectional Encoder Representation (BERT) is a deep Bidirectional, unsupervised language representation that uses only a model pre-trained with a corpus of plaintext. The model fully considers the context information when the words appear, and can avoid the word ambiguity problem.
In a bank audit business system, a large amount of alarm data may appear, and in the prior art, new alarm data is generally matched with historical alarm data, so that whether to alarm the new alarm data is determined according to the matched historical alarm data. When the new alarm data is matched with the historical alarm data, the comparison of the text data is involved, so that the alarm accuracy of the auditing service system can be greatly improved and a large amount of invalid alarms are avoided by accurately identifying the text content.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a text data processing method based on BERT, which is used for solving the technical problems of low matching efficiency and more invalid alarms in the existing auditing service system and comprises the following steps: acquiring text data to be processed; inputting the original word vector of the text data to be processed into a pre-trained BERT language model, and outputting a vector representation result of fusion context information of each character or word in the text data to be processed; performing feature fusion on the vector representation result output by the BERT language model and other structural features except context information in the text data to be processed to obtain a fused feature vector; and inputting the fused feature vector into a machine classification model trained in advance, and outputting a classification result of the text data to be processed.
Further, the machine classification model is a binary classification model.
Further, the text data to be processed is alarm message data in an auditing service system; the classification result of the two classification models comprises: valid alarms and invalid alarms.
Further, the method further comprises: acquiring first sample data; and learning a BERT model according to the first sample data, and training to obtain the BERT language model.
Further, the method further comprises: acquiring second sample data; and learning a neural network model according to the second sample data, and training to obtain the machine classification model.
Further, a multi-mode feature fusion method is adopted, and the vector representation result output by the BERT language model is subjected to feature fusion with other structural features except context information in the text data to be processed, so that a fused feature vector is obtained.
Further, the multi-modal feature fusion method includes any one of: raw data-based fusion, abstract feature-based fusion, and decision-result-based fusion.
Further, the multi-modal feature fusion method includes any one of: data level fusion, decision level fusion and combination fusion.
The embodiment of the invention also provides a text data processing device based on BERT, which is used for solving the technical problems of low matching efficiency and more invalid alarms in the existing auditing service system, and comprises: the text data acquisition module is used for acquiring text data to be processed; the BERT language model prediction module is used for inputting the original word vector of the text data to be processed into a pre-trained BERT language model and outputting a vector representation result of fusion context information of each character or word in the text data to be processed; the feature fusion module is used for performing feature fusion on the vector representation result output by the BERT language model and other structural features except context information in the text data to be processed to obtain a fused feature vector; and the machine classification module is used for inputting the fused feature vectors into a machine classification model trained in advance and outputting the classification result of the text data to be processed.
Further, the machine classification model is a binary classification model.
Further, the text data to be processed is alarm message data in an auditing service system; the classification result of the two classification models comprises: valid alarms and invalid alarms.
Further, the apparatus further comprises: the first sample acquisition module is used for acquiring first sample data; and the BERT language model training module is used for learning the BERT model according to the first sample data and training to obtain the BERT language model.
Further, the apparatus further comprises: the second sample data acquisition module is used for acquiring second sample data; and the machine classification model learning module is used for learning the neural network model according to the second sample data and training to obtain the machine classification model.
Further, a multi-mode feature fusion device is adopted to perform feature fusion on the vector representation result output by the BERT language model and other structural features except context information in the text data to be processed to obtain a fused feature vector.
Further, the multi-modal feature fusion method includes any one of: raw data-based fusion, abstract feature-based fusion, and decision-result-based fusion.
Further, the multi-modal feature fusion method includes any one of: data level fusion, decision level fusion and combination fusion.
The embodiment of the invention also provides electronic equipment for solving the technical problems of low matching efficiency and more invalid alarms in the existing auditing service system.
The embodiment of the invention also provides a computer readable storage medium, which is used for solving the technical problems of low matching efficiency and more invalid alarms in the existing auditing service system.
According to the text data processing method and device based on the BERT, the computer equipment and the computer readable storage medium provided by the embodiment of the invention, after the text data to be processed is obtained, the original word vector of the text data to be processed is input into a pre-trained BERT language model, the vector representation result of the fusion context information of each word or word in the text data to be processed is output, and then the vector representation result output by the BERT language model is subjected to feature fusion with other structural features except the context information in the text data to be processed to obtain the fused feature vector; and finally, inputting the fused feature vectors into a machine classification model trained in advance, and outputting a classification result of the text data to be processed.
By the embodiment of the invention, the accuracy of text classification can be greatly improved. The text data processing method provided by the embodiment of the invention is applied to an auditing service system, and can reduce invalid alarm amount and labor cost.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
fig. 1 is a flowchart of a text data processing method based on BERT according to an embodiment of the present invention;
FIG. 2 is a flowchart of a BERT language model training process provided in an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a machine classification model training process according to an embodiment of the present invention;
fig. 4 is a flow chart of risk determination of context information of a packet according to a field value in an embodiment of the present invention;
fig. 5 is a schematic diagram of Tokenization encoding of text data of a message provided in an embodiment of the present invention;
FIG. 6 is a schematic diagram of a process for constructing a neural network model for text classification according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a problem data set provided in an embodiment of the present invention;
FIG. 8 is a code implementation diagram of selecting a question text from a question dataset as provided in an embodiment of the present invention;
FIG. 9 is a diagram illustrating a text classification interpretation result provided in an embodiment of the present invention;
fig. 10 is a schematic diagram illustrating a visualization of a classification result provided in an embodiment of the present invention;
FIG. 11 is a flow chart of the process of BERT-based alert text data provided in the embodiment of the present invention;
FIG. 12 is a schematic diagram of a hierarchical multi-modal fusion provided in an embodiment of the present invention;
FIG. 13 is a schematic diagram of a method for partitioning multi-modal fusion according to fusion types provided in an embodiment of the present invention;
fig. 14 is a schematic diagram of a BERT-based text data processing apparatus according to an embodiment of the present invention;
fig. 15 is a schematic diagram of an electronic device provided in an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
Before the embodiments of the present invention are introduced, the implementation principle of BERT is first explained:
to better consider the context information association of words, the BERT pre-training process involves two different pre-training tasks, the Masked Language Model and the Next sequence Prediction task, respectively.
MLM trains a bi-directional language model by randomly masking some words (replaced with uniform markers [ MASK ]), then predicting these masked words, and referencing the representation of each word to context information.
NSP introduces a next sentence prediction task in order to train a model that understands the relationships between sentences. The corpus of this task may be generated by extracting sentences from the corpus to include two sentences a and B, where 50% of the probability B is the next sentence of a and 50% of the probability B is a random sentence in the corpus.
In addition, a Fine-tuning (Fine-tuning) method is also added. And adding a fine tuning Task in the subsequent pre-training work of the BERT, directly carrying out language model training on a deep layer Transformer network, carrying out fine tuning on a downstream target Task after convergence, and not needing to design a Task-specific network for the target Task for training from the beginning.
The word static representation: the method is characterized in that massive unmarked text data are used for training low-dimensional word representation vectors, namely word embedding, but the word embedding is static, and the word embedding does not change along with new context after training, so that the problem of word ambiguity is difficult to deal with, and the meaning of a word depends on the context environment.
The dynamic representation of the word: through pre-training and fine-tuning BERT, 11 NLP task list lists including GLUE, SQuAD and SWAG are refreshed considering two orders of left to right and right to left.
For example, the following two sentences:
sentence 1: apple sells cell-phone.
Sentence 2: i eat an apple.
Static word embedding cannot distinguish semantic differences between the two "apples"; while dynamic word embedding can give different vector representations depending on the context.
Classification algorithm SVM: by defining a decision plane for the decision boundary, the decision plane can separate a set of objects belonging to different classes. With the help of the support vector, the SVM passes through the classification at the hyperplane and maximizes the boundary distance between the two classes. Hyperplane learning of SVMs is accomplished by translating the problem into some linear algebraic transformation problem.
The embodiment of the present invention provides a text data processing method based on BERT, fig. 1 is a flowchart of the text data processing method based on BERT provided in the embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:
s101, text data to be processed is obtained.
It should be noted that the text data to be processed acquired in S101 may be any text data to be classified, and may be, but is not limited to, alarm message data in an audit service system.
S102, inputting the original word vector of the text data to be processed into a pre-trained BERT language model, and outputting a vector representation result of each character or word in the text data to be processed fused with context information.
It should be noted that the input data of the BERT language model is an original vector representation of each character or word in the text data, and the output data is a vector representation result of the fusion context information of each character or word in the text data.
Because the BERT language model considers two orders of left to right and right to left, the context information when the words appear is fully considered, and the word ambiguity problem can be avoided. In the prior art, most of the unsupervised model corpora used in training the BERT language model are derived from the general wikipedia. In the embodiment of the invention, based on a specific financial field, the more vertical BERT model in the financial field is trained by using the Chinese economic newspaper data in the past year, so that the dynamic representation of proper nouns in a financial scene can be improved, and the accuracy of natural language processing in the financial field is improved.
And S103, performing feature fusion on the vector representation result output by the BERT language model and other structural features except context information in the text data to be processed to obtain a fused feature vector.
In a specific implementation, in S103, a multi-modal feature fusion method may be adopted to perform feature fusion on the vector representation result output by the BERT language model and other structural features except context information in the text data to be processed, so as to obtain a fused feature vector.
Optionally, the multi-modal feature fusion method adopted in the embodiment of the present invention includes any one of the following: raw data-based fusion, abstract feature-based fusion, and decision-result-based fusion.
Optionally, the multi-modal feature fusion method adopted in the embodiment of the present invention includes any one of the following: data level fusion, decision level fusion and combination fusion.
And S104, inputting the fused feature vectors into a machine classification model trained in advance, and outputting a classification result of the text data to be processed.
It should be noted that, the machine classification model in the embodiment of the present invention may be a model for classifying text data obtained through machine learning training in advance, and different classification models may be obtained when training data is different. In one embodiment, the machine classification model employed in embodiments of the present invention may be a two-class model.
When the text data to be processed is the alarm message data in the auditing service system, the classification result of the two classification models may include: valid alarms and invalid alarms. By classifying the alarm text data and screening and filtering the alarm text data classified as invalid alarms, the invalid alarms can be greatly reduced.
In one embodiment, as shown in fig. 2, the BERT based text data processing method provided in the embodiment of the present invention is further configured to train a BERT language model by:
s201, acquiring first sample data;
and S202, learning and training the BERT model according to the first sample data to obtain the BERT language model.
In one embodiment, as shown in fig. 3, the BERT based text data processing method provided in the embodiment of the present invention is further used for training a machine classification model by:
s301, acquiring second sample data;
and S302, learning the neural network model according to the second sample data, and training to obtain a machine classification model.
As can be seen from the above, in the text data processing method based on BERT provided in the embodiment of the present invention, after obtaining text data to be processed, an original word vector of the text data to be processed is input into a BERT language model trained in advance, a vector representation result of fusion context information of each word or word in the text data to be processed is output, and then the vector representation result output by the BERT language model is subjected to feature fusion with other structural features except the context information in the text data to be processed, so as to obtain a fused feature vector; and finally, inputting the fused feature vectors into a machine classification model trained in advance, and outputting a classification result of the text data to be processed.
The text data processing method based on BERT provided by the embodiment of the invention can greatly improve the accuracy of text classification. The text data processing method provided by the embodiment of the invention is applied to an auditing service system, and can reduce invalid alarm amount and labor cost.
In the embodiment of the invention, the BERT is used for modeling and learning alarm audit data accumulated during the service life, and then the audit data is converted into simple 0 and 1 classification by a classification algorithm to judge whether to alarm or not, and then risk judgment is carried out on the alarm details generated by matching a real-time service request by combining field value and field context information.
The existing auditing business system has the technical problems of low matching efficiency and more invalid alarms. In order to further improve the accuracy of an auditing system and reduce the problem of invalid alarm by advanced technical means, in the embodiment of the invention, BERT is used for modeling and learning alarm auditing data accumulated during the service life, and then classification algorithm is used for converting the data into simple 0 and 1 classifications, judging whether to alarm or not, and then risk judgment is carried out on the alarm details generated by matching real-time service requests by combining field value and field context information.
Fig. 4 is a flow chart of risk determination for context information of a packet according to a field value provided in an embodiment of the present invention, as shown in fig. 4, specifically including:
tokenization encodes message text data as shown in fig. 5.
And distributing GPU resources.
Thirdly, constructing a text classification neural network model, wherein the construction process is shown in figure 6.
And fourthly, training data.
Analyzing the result.
FIG. 7 shows a problem dataset for a Stackoverflow that includes the problem text and the technology type (sql, python, php, html, etc.) to which the problem belongs.
The code shown in fig. 8 is used to randomly select 1877 th question text from the question data set, and the model in the embodiment of the present invention is used to predict that the type of the question belongs to is sql type, and the actual tag is also sql type.
Further, the view model classifies this question text as an sql type interpretation, the interpretation result being shown in fig. 9.
By looking at the analysis, the model was found to score the question text category up to two categories, sql and python, respectively. For the question text, "sql" this word contributes the highest to the classification of the text as sql type, the model predicts a probability of 100% for the text as sql type, if we remove the word "sql" in the text, the model may classify it as sql type with a probability of only 100% -65% ═ 35%. The word "sql" contributes negatively to the python type and the word "range" may contribute positively to the classification as python type.
Fig. 10 is a schematic diagram of a visual presentation of a classification result provided in an embodiment of the present invention, and as shown in fig. 10, for the visual presentation of interpretability of the classification result, a position of a contributing word in a text and context information can be viewed.
Fig. 11 is a flow chart of processing alert text data based on BERT according to an embodiment of the present invention, and as shown in fig. 11, the flow chart specifically includes:
1) training a language model: the method comprises the steps of training a language model by using non-real hit message data as a sample, inputting an original word vector of each character/word in a text by the model, outputting a vector representation of each character/word in the text after full-text semantic information is fused, learning context information and sequence of the characters/words in the text in the model training process, and finally predicting and correcting the words by the model according to the context information.
2) Feature fusion: and fusing the prediction result of the language model and other features except the context information in the alarm task data into a single feature vector by adopting a multi-mode feature fusion method, and inputting the single feature vector into a machine learning classifier. The samples adopted by the training of the machine classifier comprise historical alarm data and corresponding message data of about three to six months.
3) And (3) a classification algorithm: and establishing a classification model by adopting a machine learning two-classification algorithm, and classifying and judging the fused features.
It should be noted that common machine learning methods can be applied to multi-modal fusion. Multimodal fusion in embodiments of the present invention refers to integrating information from multiple modalities to accomplish classification or regression tasks. Each source or form of information may be referred to as a modality, such as image, video, audio, semantic, etc., and broadly, two different sets of data acquired under different conditions may also be referred to as two modalities.
As shown in fig. 12, according to the fusion hierarchy, the multi-modal fusion can be divided into three categories, i.e., pixel level, feature level (early) and resolution level (late), and the fusion is performed on the original data, the fusion is performed on the abstract features, and the fusion is performed on the decision result.
As shown in fig. 13, according to the type of fusion, the multi-modal fusion can be further divided into: (a) fusing data levels; (b) judging level fusion; (c) and (4) combining and fusing.
Based on the same inventive concept, the embodiment of the present invention further provides a text data processing apparatus based on BERT, as described in the following embodiments. Because the principle of the device for solving the problems is similar to the BERT-based text data processing method, the implementation of the device can refer to the implementation of the BERT-based text data processing method, and repeated parts are not described again.
Fig. 14 is a schematic diagram of a BERT-based text data processing apparatus according to an embodiment of the present invention, and as shown in fig. 14, the apparatus includes: a text data acquisition module 141, a BERT language model prediction module 142, a feature fusion module 143, and a machine classification module 144.
The text data obtaining module 141 is configured to obtain text data to be processed; the BERT language model prediction module 142 is configured to input an original word vector of the to-be-processed text data into a BERT language model trained in advance, and output a vector representation result of fusion context information of each word or word in the to-be-processed text data; the feature fusion module 143 is configured to perform feature fusion on the vector representation result output by the BERT language model and other structural features in the text data to be processed, except for context information, to obtain a fused feature vector; and the machine classification module 144 is configured to input the fused feature vector into a machine classification model trained in advance, and output a classification result of the text data to be processed.
It should be noted here that the text data obtaining module 141, the BERT language model predicting module 142, the feature fusion module 143, and the machine classification module 144 correspond to S101 to S104 in the method embodiment, and the modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the method embodiment. It should be noted that the modules described above as part of an apparatus may be implemented in a computer system such as a set of computer-executable instructions.
In one embodiment, in the BERT-based text data processing apparatus provided in the embodiment of the present invention, the machine classification model adopted by the machine classification module 144 is a binary classification model.
In an embodiment, in the BERT-based text data processing apparatus provided in the embodiment of the present invention, to-be-processed text data is alarm message data in an audit service system; the classification result of the two classification models comprises: valid alarms and invalid alarms.
In one embodiment, the BERT-based text data processing apparatus provided in the embodiment of the present invention further includes: a first sample acquisition module 145 and a BERT language model training module 146.
The first sample obtaining module 145 is configured to obtain first sample data; and the BERT language model training module 146 is used for learning and training the BERT model according to the first sample data to obtain the BERT language model.
It should be noted here that the first sample obtaining module 145 and the BERT language model training module 146 correspond to S201 to S202 in the method embodiment, and the modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the method embodiment. It should be noted that the modules described above as part of an apparatus may be implemented in a computer system such as a set of computer-executable instructions.
In one embodiment, the BERT-based text data processing apparatus provided in the embodiment of the present invention further includes: a second sample data acquisition module 147 and a machine classification model learning module 148.
The second sample data obtaining module 147 is configured to obtain second sample data; and the machine classification model learning module 148 is used for learning the neural network model according to the second sample data and training to obtain a machine classification model.
It should be noted here that the second sample data obtaining module 147 and the machine classification model learning module 148 correspond to S301 to S302 in the method embodiment, and the modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the method embodiment. It should be noted that the modules described above as part of an apparatus may be implemented in a computer system such as a set of computer-executable instructions.
In an embodiment, in the text data processing apparatus based on BERT provided in the embodiment of the present invention, the feature fusion module 143 is further configured to perform feature fusion on a vector representation result output by the BERT language model and other structural features except context information in the text data to be processed by using a multi-modal feature fusion apparatus, so as to obtain a feature vector after fusion.
Optionally, the multi-modal feature fusion method adopted by the feature fusion module 143 includes any one of the following: raw data-based fusion, abstract feature-based fusion, and decision-result-based fusion.
Optionally, the multi-modal feature fusion method adopted by the feature fusion module 143 includes any one of the following: data level fusion, decision level fusion and combination fusion.
As can be seen from the above, in the text data processing device based on BERT provided in the embodiment of the present invention, after obtaining text data to be processed, an original word vector of the text data to be processed is input into a BERT language model trained in advance, a vector representation result of fusion context information of each word or word in the text data to be processed is output, and then the vector representation result output by the BERT language model is subjected to feature fusion with other structural features except the context information in the text data to be processed, so as to obtain a fused feature vector; and finally, inputting the fused feature vectors into a machine classification model trained in advance, and outputting a classification result of the text data to be processed.
The text data processing device based on BERT provided by the embodiment of the invention can greatly improve the accuracy of text classification. The text data processing method provided by the embodiment of the invention is applied to an auditing service system, and can reduce invalid alarm amount and labor cost.
Based on the same inventive concept, the embodiment of the present invention further provides an embodiment of an electronic device for implementing all or part of the content in the text data processing method based on BERT. The electronic device specifically comprises the following contents:
a processor (processor), a memory (memory), a communication Interface (Communications Interface), and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the communication interface is used for realizing information transmission between related devices; the electronic device may be a desktop computer, a tablet computer, a mobile terminal, and the like, but the embodiment is not limited thereto. In this embodiment, the electronic device may be implemented with reference to the embodiment of the method for implementing the text data processing based on BERT and the embodiment of the apparatus for implementing the text data processing based on BERT in the embodiments, and the contents thereof are incorporated herein, and repeated details are not repeated herein.
Fig. 15 is a schematic diagram of a system configuration structure of an electronic device according to an embodiment of the present invention. As shown in fig. 15, the electronic device 150 may include a processor 1501 and a memory 1502; a memory 1502 is coupled to the processor 1501. Notably, this fig. 15 is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.
In one embodiment, the functionality implemented by the BERT based text data processing method may be integrated into processor 1501. Wherein the processor 1501 may be configured to control as follows: acquiring text data to be processed; inputting an original word vector of the text data to be processed into a pre-trained BERT language model, and outputting a vector representation result of fusion context information of each character or word in the text data to be processed; performing feature fusion on a vector representation result output by the BERT language model and other structural features except context information in text data to be processed to obtain a fused feature vector; and inputting the fused feature vectors into a machine classification model trained in advance, and outputting a classification result of the text data to be processed. The invention can greatly improve the accuracy of text classification.
As can be seen from the above, after the text data to be processed is obtained, the electronic device provided in the embodiment of the present invention inputs the original word vector of the text data to be processed into the pre-trained BERT language model, outputs the vector representation result of the fusion context information of each word or word in the text data to be processed, and further performs feature fusion on the vector representation result output by the BERT language model and other structural features except the context information in the text data to be processed, so as to obtain a fused feature vector; and finally, inputting the fused feature vectors into a machine classification model trained in advance, and outputting a classification result of the text data to be processed.
By the electronic equipment provided by the embodiment of the invention, the accuracy of text classification can be greatly improved. The text data processing method provided by the embodiment of the invention is applied to an auditing service system, and can reduce invalid alarm amount and labor cost.
In another embodiment, the BERT based text data processing means may be configured separately from the processor 1501, for example, the BERT based text data processing means may be configured as a chip connected to the processor 1501, and the function of the BERT based text data processing method is realized by the control of the processor.
As shown in fig. 15, the electronic device 150 may further include: a communication module 1503, an input unit 1504, an audio processing unit 1505, a display 1506, and a power supply 1507. It is noted that the electronic device 150 does not necessarily include all of the components shown in fig. 15; furthermore, the electronic device 150 may also include components not shown in fig. 15, which may be referred to in the prior art.
As shown in fig. 15, a processor 1501, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, which processor 1501 receives input and controls the operation of various components of the electronic device 150.
The memory 1502 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the processor 1501 may execute the program stored in the memory 1502 to realize information storage or processing, or the like.
An input unit 1504 provides input to the processor 1501. The input unit 1504 is, for example, a key or a touch input device. The power supply 1507 is used to supply power to the electronic device 150. The display 1506 is used to display objects such as images and characters. The display may be, for example, an LCD display, but is not limited thereto.
The memory 1502 may be a solid state memory such as Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 1502 may also be some other type of device. The memory 1502 includes a buffer memory 15021 (sometimes referred to as a buffer). The memory 1502 may include an application/function storage 15022, the application/function storage 15022 for storing application programs and function programs or for executing a flow of operations of the electronic device 150 by the processor 1501.
The memory 1502 may also include a data store 15023, the data store 15023 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. The driver storage 15024 of the memory 1502 may include various drivers for the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, address book applications, etc.).
The communication module 1503 is a transmitter/receiver that transmits and receives signals via the antenna 1508. A communication module (transmitter/receiver) 1503 is coupled to the processor 1501 to provide input signals and receive output signals, which may be the same as in the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 1503, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be provided in the same electronic device. The communication module (transmitter/receiver) 1503 is also coupled to a speaker 1509 and a microphone 1510 via an audio processing unit 1505 to provide audio output via the speaker 1509 and receive audio input from the microphone 1510, thereby implementing general telecommunication functions. The audio processing unit 1505 may include any suitable buffers, decoders, amplifiers and so forth. In addition, an audio processing unit 1505 is also coupled to the processor 1501, so that sound can be recorded locally through the microphone 1510, and locally stored sound can be played through the speaker 1509.
An embodiment of the present invention further provides a computer-readable storage medium for implementing all the steps in the BERT-based text data processing method in the above-described embodiment, wherein the computer-readable storage medium stores thereon a computer program that implements all the steps of the BERT-based text data processing method in the above-described embodiment when executed by a processor, for example, the processor implements the following steps when executing the computer program: acquiring text data to be processed; inputting an original word vector of the text data to be processed into a pre-trained BERT language model, and outputting a vector representation result of fusion context information of each character or word in the text data to be processed; performing feature fusion on a vector representation result output by the BERT language model and other structural features except context information in text data to be processed to obtain a fused feature vector; and inputting the fused feature vectors into a machine classification model trained in advance, and outputting a classification result of the text data to be processed. The invention can greatly improve the accuracy of text classification.
As can be seen from the above, in the computer-readable storage medium provided in the embodiment of the present invention, after the text data to be processed is obtained, the original word vector of the text data to be processed is input into the pre-trained BERT language model, the vector representation result of the fusion context information of each word or word in the text data to be processed is output, and then the vector representation result output by the BERT language model is feature-fused with other structural features except the context information in the text data to be processed, so as to obtain a fused feature vector; and finally, inputting the fused feature vectors into a machine classification model trained in advance, and outputting a classification result of the text data to be processed.
By the computer-readable storage medium provided by the embodiment of the invention, the accuracy of text classification can be greatly improved. The text data processing method provided by the embodiment of the invention is applied to an auditing service system, and can reduce invalid alarm amount and labor cost.
Although the present invention provides method steps as described in the examples or flowcharts, more or fewer steps may be included based on routine or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or client product executes, it may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, apparatus (system) or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "upper", "lower", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Unless expressly stated or limited otherwise, the terms "mounted," "connected," and "connected" are intended to be inclusive and mean, for example, that they may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention is not limited to any single aspect, nor is it limited to any single embodiment, nor is it limited to any combination and/or permutation of these aspects and/or embodiments. Each aspect and/or embodiment of the invention can be used alone or in combination with one or more other aspects and/or embodiments.
The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (18)

1. A text data processing method based on BERT is characterized by comprising the following steps:
acquiring text data to be processed;
inputting the original word vector of the text data to be processed into a pre-trained BERT language model, and outputting a vector representation result of fusion context information of each character or word in the text data to be processed;
performing feature fusion on the vector representation result output by the BERT language model and other structural features except context information in the text data to be processed to obtain a fused feature vector;
and inputting the fused feature vector into a machine classification model trained in advance, and outputting a classification result of the text data to be processed.
2. The method of claim 1, wherein the machine classification model is a binary classification model.
3. The method according to claim 2, wherein the text data to be processed is alarm message data in an audit service system; the classification result of the two classification models comprises: valid alarms and invalid alarms.
4. The method of claim 1, wherein the method further comprises:
acquiring first sample data;
and learning a BERT model according to the first sample data, and training to obtain the BERT language model.
5. The method of claim 1, wherein the method further comprises:
acquiring second sample data;
and learning a neural network model according to the second sample data, and training to obtain the machine classification model.
6. The method according to claim 1, wherein a multi-modal feature fusion method is adopted to perform feature fusion on the vector representation result output by the BERT language model and other structural features except context information in the text data to be processed to obtain a fused feature vector.
7. The method of claim 6, wherein the multi-modal feature fusion method comprises any one of: raw data-based fusion, abstract feature-based fusion, and decision-result-based fusion.
8. The method of claim 6, wherein the multi-modal feature fusion method comprises any one of: data level fusion, decision level fusion and combination fusion.
9. A BERT-based text data processing apparatus, comprising:
the text data acquisition module is used for acquiring text data to be processed;
the BERT language model prediction module is used for inputting the original word vector of the text data to be processed into a pre-trained BERT language model and outputting a vector representation result of fusion context information of each character or word in the text data to be processed;
the feature fusion module is used for performing feature fusion on the vector representation result output by the BERT language model and other structural features except context information in the text data to be processed to obtain a fused feature vector;
and the machine classification module is used for inputting the fused feature vectors into a machine classification model trained in advance and outputting the classification result of the text data to be processed.
10. The apparatus of claim 9, in which the machine classification model is a binary classification model.
11. The apparatus of claim 10, wherein the text data to be processed is an alarm message data in an audit service system; the classification result of the two classification models comprises: valid alarms and invalid alarms.
12. The apparatus of claim 9, wherein the apparatus further comprises:
the first sample acquisition module is used for acquiring first sample data;
and the BERT language model training module is used for learning the BERT model according to the first sample data and training to obtain the BERT language model.
13. The apparatus of claim 9, wherein the apparatus further comprises:
the second sample data acquisition module is used for acquiring second sample data;
and the machine classification model learning module is used for learning the neural network model according to the second sample data and training to obtain the machine classification model.
14. The apparatus according to claim 9, wherein a multi-modal feature fusion apparatus is adopted to perform feature fusion on the vector representation result output by the BERT language model and other structured features except context information in the text data to be processed to obtain a fused feature vector.
15. The apparatus of claim 14, wherein the multi-modal feature fusion method comprises any one of: raw data-based fusion, abstract feature-based fusion, and decision-result-based fusion.
16. The apparatus of claim 15, wherein the multi-modal feature fusion method comprises any one of: data level fusion, decision level fusion and combination fusion.
17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the BERT based text data processing method according to any one of claims 1 to 8 when executing the computer program.
18. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the BERT-based text data processing method according to any one of claims 1 to 8.
CN202110261106.7A 2021-03-10 2021-03-10 Text data processing method and device based on BERT Pending CN112989041A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110261106.7A CN112989041A (en) 2021-03-10 2021-03-10 Text data processing method and device based on BERT

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110261106.7A CN112989041A (en) 2021-03-10 2021-03-10 Text data processing method and device based on BERT

Publications (1)

Publication Number Publication Date
CN112989041A true CN112989041A (en) 2021-06-18

Family

ID=76334799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110261106.7A Pending CN112989041A (en) 2021-03-10 2021-03-10 Text data processing method and device based on BERT

Country Status (1)

Country Link
CN (1) CN112989041A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449808A (en) * 2021-07-13 2021-09-28 广州华多网络科技有限公司 Multi-source image-text information classification method and corresponding device, equipment and medium
CN113674866A (en) * 2021-06-23 2021-11-19 江苏天瑞精准医疗科技有限公司 Medical text oriented pre-training method
CN114926150A (en) * 2022-06-18 2022-08-19 国网辽宁省电力有限公司电力科学研究院 Digital intelligent auditing method and device for transformer technology conformance assessment
CN115473856A (en) * 2022-09-07 2022-12-13 中国银行股份有限公司 Message checking method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113674866A (en) * 2021-06-23 2021-11-19 江苏天瑞精准医疗科技有限公司 Medical text oriented pre-training method
CN113674866B (en) * 2021-06-23 2024-06-14 江苏天瑞精准医疗科技有限公司 Pre-training method for medical text
CN113449808A (en) * 2021-07-13 2021-09-28 广州华多网络科技有限公司 Multi-source image-text information classification method and corresponding device, equipment and medium
CN114926150A (en) * 2022-06-18 2022-08-19 国网辽宁省电力有限公司电力科学研究院 Digital intelligent auditing method and device for transformer technology conformance assessment
CN114926150B (en) * 2022-06-18 2024-05-14 国网辽宁省电力有限公司电力科学研究院 Digital intelligent auditing method and device for transformer technology compliance assessment
CN115473856A (en) * 2022-09-07 2022-12-13 中国银行股份有限公司 Message checking method and device

Similar Documents

Publication Publication Date Title
US20240078386A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
AU2019360080B2 (en) Image captioning with weakly-supervised attention penalty
CN112989041A (en) Text data processing method and device based on BERT
US9792534B2 (en) Semantic natural language vector space
AU2016256753A1 (en) Image captioning using weak supervision and semantic natural language vector space
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN111581375A (en) Dialog intention type identification method, multi-turn dialog method, device and computing equipment
CN108304376B (en) Text vector determination method and device, storage medium and electronic device
CN111353028A (en) Method and device for determining customer service call cluster
CN112101042A (en) Text emotion recognition method and device, terminal device and storage medium
CN115269781A (en) Modal association degree prediction method, device, equipment, storage medium and program product
CN110909768A (en) Method and device for acquiring marked data
CN113780365A (en) Sample generation method and device
CN115292439A (en) Data processing method and related equipment
CN117349402A (en) Emotion cause pair identification method and system based on machine reading understanding
CN116543798A (en) Emotion recognition method and device based on multiple classifiers, electronic equipment and medium
CN116580704A (en) Training method of voice recognition model, voice recognition method, equipment and medium
CN113342981A (en) Demand document classification method and device based on machine learning
CN114547308A (en) Text processing method and device, electronic equipment and storage medium
CN114333772A (en) Speech recognition method, device, equipment, readable storage medium and product
Kwon et al. Detecting textual adversarial examples through text modification on text classification systems
CN112364131A (en) Corpus processing method and related device thereof
CN111311197A (en) Travel data processing method and device
CN114579740B (en) Text classification method, device, electronic equipment and storage medium
CN117351324B (en) Method and device for processing document image through neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination