CN113449081A - Text feature extraction method and device, computer equipment and storage medium - Google Patents
Text feature extraction method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN113449081A CN113449081A CN202110775005.1A CN202110775005A CN113449081A CN 113449081 A CN113449081 A CN 113449081A CN 202110775005 A CN202110775005 A CN 202110775005A CN 113449081 A CN113449081 A CN 113449081A
- Authority
- CN
- China
- Prior art keywords
- text
- vectors
- vector
- file
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 79
- 239000013598 vector Substances 0.000 claims abstract description 243
- 238000000034 method Methods 0.000 claims abstract description 34
- 230000004927 fusion Effects 0.000 claims abstract description 29
- 230000011218 segmentation Effects 0.000 claims abstract description 29
- 230000007246 mechanism Effects 0.000 claims abstract description 23
- 230000002829 reductive effect Effects 0.000 claims description 30
- 238000012549 training Methods 0.000 claims description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 5
- 238000007499 fusion processing Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims 1
- 238000007405 data analysis Methods 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 description 21
- 230000008569 process Effects 0.000 description 9
- 230000002457 bidirectional effect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 230000000873 masking effect Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002779 inactivation Effects 0.000 description 1
- 230000007786 learning performance Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 125000006850 spacer group Chemical group 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The application relates to the technical field of data analysis, and discloses a text feature extraction method, a text feature extraction device, computer equipment and a storage medium, wherein the method comprises the following steps: by acquiring a text file, segmenting the text content of the text file by using a word segmentation device to obtain a character sequence; respectively extracting the characteristics of the character sequence through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1; fusing the N characteristic vectors based on an attention mechanism to obtain fused vectors; the fusion vector is used for describing text features of the text file. According to the method and the device, the obtained N characteristic vectors are fused, and the N characteristic vectors can respectively carry out characteristic description on the text file from different angles, so that the obtained fusion vectors can more comprehensively represent the text characteristics of the text file, and the accuracy of the BERT model in the classification task is improved.
Description
Technical Field
The present application relates to the field of data analysis technologies, and in particular, to a method and an apparatus for extracting text features, a computer device, and a storage medium.
Background
At present, the application of Natural Language Processing (NLP) has been popularized in various fields, and one of the main reasons for the popularization of natural language processing is the use of a pre-training model by natural language processing, and the pre-training model solves the problem that the natural language processing model needs to be trained a lot before being used, and can adapt to different data sets to execute different natural language processing operations without building the model from the beginning.
In a common pre-training model, the accuracy of a classification task in NLP is significantly improved by the appearance of the BERT model, which can be obtained by pre-training a sample file without a label. However, in the existing technical solution, in the text feature extraction process of the BERT model, the BERT model is to extract features of different dimensions for each character according to the weight of each character in the character sequence after word segmentation, and this feature extraction manner ignores the integrity of the text content, which results in an inaccurate classification result of the output text features, thereby reducing the accuracy of the BERT model in the classification task.
Disclosure of Invention
The application provides a text feature extraction method, a text feature extraction device, computer equipment and a storage medium, and solves the technical problem that in the text feature extraction process of a BERT model, multidimensional feature extraction is respectively carried out on each character, the integrity of text content is ignored, and the accuracy of the classification result of output text features is reduced.
A text feature extraction method comprises the following steps:
acquiring a text file, and segmenting the text content of the text file through a word segmentation device to obtain a character sequence;
respectively extracting the characteristics of the character sequence through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1;
fusing the N characteristic vectors based on an attention mechanism to obtain fused vectors; the fusion vector is used for describing text features of the text file.
An extraction apparatus of text features, comprising:
the segmentation module is used for acquiring a text file and segmenting the text content of the text file through a word segmentation device to obtain a character sequence;
the extraction module is used for respectively extracting the characteristics of the character sequence through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1;
the fusion module is used for fusing the N characteristic vectors based on an attention mechanism to obtain fusion vectors; the fusion vector is used for describing text features of the text file.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above text feature extraction method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned text feature extraction method.
According to the text feature extraction method, the text feature extraction device, the computer equipment and the storage medium, the text content of the text file is segmented by a word segmentation device through obtaining the text file, and a character sequence is obtained; inputting the character sequence into a pre-trained BERT model for feature extraction to obtain a plurality of feature vectors; fusing the N characteristic vectors based on an attention mechanism to obtain fused vectors; the fusion vector is used for describing text features of the text file. When the N characteristic extraction layers extract the characteristics of the character sequence, the specific characteristic contents are different, so that the obtained N characteristic vectors can respectively carry out characteristic description on the text file from different angles, the N characteristic vectors are fused, the obtained fusion vector can more comprehensively represent the text characteristics of the text file, and the accuracy of the BERT model in the classification task is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a schematic application environment diagram of a text feature extraction method according to an embodiment of the present application;
fig. 2 is a flowchart illustrating an implementation of a text feature extraction method according to an embodiment of the present application;
fig. 3 is a flowchart of step S10 in the text feature extraction method according to an embodiment of the present application;
fig. 4 is a flowchart of step S30 in the text feature extraction method according to an embodiment of the present application;
fig. 5 is a flowchart of step S302 in the text feature extraction method according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an apparatus for extracting text features according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a computer device provided by an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The method for extracting the text features provided by the embodiment of the application can be applied to the application environment shown in fig. 1. As shown in fig. 1, a client (computer device) communicates with a server through a network. The client (computer device) includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, cameras, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers. The text feature extraction method provided in this embodiment may be executed by a server, for example, a user sends a text file to be processed to the server through the server, and the server executes the text feature extraction method provided in this embodiment based on the text file to be processed, so as to obtain a text feature of a target text, and finally, may also send the text feature to a client.
In some scenarios other than fig. 1, the client may also execute the text feature extraction method, directly obtain the text feature of the target text according to the determined target text by executing the text feature extraction method provided in this embodiment, and then send the text feature of the target text to the server for storage. .
It is understood that, in order to save computation resources and time cost of Natural Language Processing (NLP), the natural language processing uses a pre-training model, which includes a plurality of types, and common pre-training models include BERT models (BERTs), which are transform-based bi-directional coding representation models.
The method for extracting text features provided in this embodiment is a further improvement of a text feature extraction process based on a BERT model in a pre-training model for natural language processing.
Fig. 2 shows a flowchart of an implementation of a text feature extraction method according to an embodiment of the present application. As shown in fig. 2, a method for extracting text features is provided, which mainly comprises the following steps S10-S30:
and S10, acquiring a text file, and segmenting the text content of the text file through a word segmentation device to obtain a character sequence.
In step S10, the text file is a file in which text content is recorded, for example, a txt file, a Word file, or the like. The word segmentation device is configured to perform character segmentation on the text content of the text file. Here, after the text content of the text file is segmented by the word segmenter, the obtained character sequence corresponds to the text file.
In implementation, if there are a plurality of text files, serial numbers may be configured for different text files, and a character sequence obtained by segmenting text content based on the text file is also configured with a serial number the same as that of the text file, so as to implement a correspondence between the text file and the character sequence.
It should be noted that, in the natural language processing, because the natural language processing model cannot directly process text characters, the natural language processing model can process the input text only after the obtained text file is segmented and converted by the segmenter and the characters or words in the text are encoded and converted into corresponding character vectors in the dictionary according to the dictionary.
Here, the word segmenter may convert words into character vectors through word2vec algorithm, or may convert texts into character vectors through basicotakenizer algorithm in BERT model. The basic tokenizer algorithm performs the operations of code conversion, punctuation segmentation, lowercase conversion, Chinese character segmentation, accent symbol removal and the like on the text, and finally returns an array related to the words.
Fig. 3 shows a flowchart for obtaining a text file and segmenting text contents of the text file by a word segmenter to obtain a character sequence according to an embodiment of the present application. As shown in fig. 3, as one embodiment, step S10 includes:
s101, obtaining a text file, segmenting the text content of the text file through a word segmentation device, and inserting separators.
S102, recognizing each text divided by the separator through text coding in a query dictionary, and taking the obtained character vector, segment vector and position vector as a character sequence.
In step S101, the text file required to be input by the BERT model may be one or two text files. The text content of the text file is segmented through a word segmentation device, the segmented text is inserted into separators, and the segmented text is marked according to different requirements. The vocabulary recognition mode divided by the word segmentation device is used for recognizing and dividing the text code corresponding to the text content according to the text code preset in the dictionary preset by the word segmentation device, and the inserted separators comprise [ CLS ], [ SEP ] and [ PAD ], wherein [ CLS ] is used for representing the whole character sequence and carrying out the sign of the front section of the text; [ SEP ] is used to separate the text of different text files; [ PAD ] is used as a placeholder to serve as a fill-in space, for example: when only one text file exists, the content is 'good weather today' and the separator is inserted after the text is divided into '[ CLS ] good weather today'; when the text file has two and has context, wherein the content is "weather good today" and "phishing can be removed", the separator inserted after text segmentation is "[ CLS ] weather good today [ SEP ] phishing can be removed [ SEP ]", and the above example uses space to represent the segmentation relationship between characters.
In one embodiment, the language type of the text content in the text file is obtained according to the requirement, such as the text of different languages, such as chinese text, english text, and the like. When the text of the text file is a Chinese text, segmenting each character in the Chinese text by a word segmentation device; when the text of the text file is an English text, segmenting each English vocabulary in the English text through a word segmentation device; after the english vocabulary is segmented, the word segmenter further comprises a WordPieceTokenizer algorithm, wherein the WordPieceTokenizer is an entry segmentation algorithm and can perform subword segmentation (subword segmentation) on the english vocabulary, so that in a dictionary preset by the word segmenter, the english vocabulary dictionary can add new word semantics through the combination of a root word and an affix word, and the english vocabulary dictionary is simplified.
In step S102, in the natural language processing, since the natural language processing model cannot directly process text characters, each text divided by the separator is usually recognized as a character vector according to text encoding in a query dictionary as an input, and according to the input vector of the BERT model, vectors of a plurality of dimensions are required in addition to the character vector (token embedding), a segment vector (segment embedding) and a position vector (position embedding) are added according to the character vector, and the above three-dimensional vectors are input as a character sequence. Converting Chinese characters or words into character vectors by text coding in a query dictionary; the value of the segment vector can be automatically learned after being input into the model, is used for depicting the global semantic information of the text and is fused with the semantic information of the single character/word; the position vector is used for distinguishing by adding a position vector to the characters/words at different positions respectively by the BERT model, because semantic information carried by the characters/words appearing at different positions of the text is different (for example: "I make you" and "you make me").
In one embodiment, the character length of the character sequence is a fixed length set in advance, and when the text length segmented by only one text file exceeds, the last exceeding part is cut off; when the length of the text divided by the two text files exceeds the length, firstly deleting the text tail separators of the longer text files, and if the lengths of the two text files are equal, deleting the separators at the tail of the two text files in turn until the total length meets the requirement; when the text length of the segmented text file is insufficient, a separator [ PAD ] is added at the end of the segmented text for filling, so that the length reaches the requirement.
In one embodiment, the tokenizer converts a chinese word or word into a character vector by using the dictionary vocab.
S20, respectively extracting the features of the character sequence through N feature extraction layers of a pre-trained BERT model to obtain N feature vectors; wherein N is an integer greater than 1.
In step S20, the plurality of feature extraction layers of the BERT model trained in advance perform feature extraction on the character sequence, respectively. The feature vectors are used for describing feature contents of the text file, and because the character sequences are obtained by segmenting characters of the text file, the character sequences are subjected to feature extraction by using a pre-trained BERT model to obtain a plurality of feature vectors which are used for embodying features of the text content.
In one embodiment, a character sequence is input into a pre-trained BERT model, and feature extraction is respectively carried out on the character sequence through N feature extraction layers of the pre-trained BERT model to obtain N feature vectors; wherein N is an integer greater than 1.
In one embodiment, the BERT model is based on a bidirectional coding representation of a Transformer, which is a model in the NLP field that utilizes an Attention (Attention) mechanism to improve the training speed of the model, and the BERT model constructs a multi-layer bidirectional encoder network by using a Transformer structure. The BERT model is composed of encoder parts in a plurality of transform structures, and an encoder unit of one transform is generated by superposition of multi-head-Attention (multi-head-Attention) and Layer Normalization (Layer Normalization); multi-head-Attention (multi-head-Attention) is composed of multiple Self-Attention (Self-Attention); layer Normalization (Layer Normalization) standardizes the 0 mean 1 variance of a certain Layer of neural network nodes; the structure of the transformer is utilized to predict the character (token) covered (mask) through the context, thereby capturing the bidirectional relation of the character vector.
It should be understood that the BERT model is composed of a plurality of the feature extraction layers, and in practical applications, the BERT model may include a plurality of feature extraction layers, each feature extraction layer has one encoder unit, in a larger BERT model, there are 24 feature extraction layers, each layer has 16 orientations, and the dimension of the feature vector is 1024; in the smaller BERT model, there are 12 feature extraction layers, each layer having 12 orientations, with a feature vector dimension of 768.
And each feature extraction layer respectively extracts different features of the character sequence to obtain a plurality of feature vectors. The feature vectors of different features include feature vectors of lexical features, feature vectors of syntactic features, and feature vectors of semantic features. Since the features extracted by different feature extraction layers for the character sequence are different, the character sequence is subjected to feature extraction by using the pre-trained BERT model, and a plurality of feature vectors with different features can be obtained. By way of example with a BERT model of 12 feature extraction layers: layer _1 to Layer _4 are low layers, and the lexical characteristics are learned, such as: whether a word is a verb or an adjective, which characters the word consists of, etc.; layer _5 to Layer _8 are middle layers, and the syntax characteristics are learned, such as: the number of words in the sentence, the dependency between the words and the words in the sentence, etc.; layer _9 to Layer _12 are high-level, and what is learned is semantic features, such as: what the semantics of the sentence expression are, what in the sentence are keywords, etc.
As an example, the feature vectors extracted in the feature extraction layer are calculated according to requirements, and the calculation modes include linear transformation, activation function (activation function), multi-head self-attention (multi-head self-attention), skip connection (skip connection), layer normalization and random inactivation.
S30, fusing the N feature vectors based on an attention mechanism to obtain fused vectors; the fusion vector is used for describing text features of the text file.
In step S30, each of the N feature vectors can be used to characterize a feature of the text file in a certain dimension, the characterized features including lexical, syntactic and semantic features. Here, the fusion vector obtained by fusing the N feature vectors can be directly used to describe the overall features of the text file.
It can be understood that, in the BERT model, not only the extracted features of each feature extraction layer are different, but also the utilization weights of the extracted feature vectors are different, although the feature extraction layers of the BERT model are connected in series, all the feature extraction layers need to be utilized when extracting the feature vectors, but the feature vectors are different and the feature vectors extracted by each layer do not share parameters, so that the features of the character sequence cannot be well extracted by each feature extraction layer alone, and therefore, the feature vectors of each layer need to be fused and dimension reduced, and then the BERT model is used to learn the weights of each feature extraction layer to obtain a fused vector. The fusion vector is used for describing text features of the text file so as to be used by downstream tasks.
Fig. 4 illustrates that the N feature vectors are fused based on the attention mechanism to obtain a fused vector according to an embodiment of the present application; the fusion vector is used for describing a flow chart of text characteristics of the text file. As shown in fig. 4, as one embodiment, step S30 includes:
s301, respectively carrying out vector reduction operation on each feature vector in the N feature vectors to obtain N reduced vectors corresponding to the N feature vectors one by one;
s302, adopting the attention mechanism to measure and calculate the weight of each simplified vector;
and S303, based on the weight of each simplified vector, carrying out fusion processing on the N simplified vectors to obtain a fusion vector.
In step S30, feature extraction is performed on the character sequence by using N feature extraction layers of a pre-trained BERT model, respectively, to obtain N feature vectors, where N is an integer greater than 1; and calculating the weight of each simplified vector by adopting the attention mechanism based on the attention mechanism, and fusing the N simplified vectors based on the weight of each simplified vector to obtain fused vectors. When the N characteristic extraction layers extract the characteristics of the character sequence, the specific characteristic contents are different, so that the obtained N characteristic vectors can respectively carry out characteristic description on the text file from different angles. The N feature vectors are fused based on the attention mechanism to measure and calculate the weight of each simplified vector, so that the utilization of the simplified vectors with higher correlation weights is improved, the obtained fused vectors can more comprehensively represent the text features of the text file, and the accuracy of the BERT model in the classification task is improved.
S301, respectively carrying out vector reduction operation on each feature vector in the N feature vectors to obtain N reduced vectors corresponding to the N feature vectors one by one;
in step S301, the character sequence of the BERT model is input as having a plurality of separators and corresponding character vectors, and a plurality of N feature vectors extracted by the N feature extraction layers also include a plurality of separators and corresponding character vectors, and are respectively subjected to vector reduction operation on each feature vector in the N feature vectors, so as to obtain N reduced vectors corresponding to the N feature vectors one to one. Before the feature vectors are fused, the feature vectors are simplified to obtain the simplified vectors, so that the interference of character vectors of stop words such as ' the ' word ' and the corresponding feature vectors can be reduced as much as possible, the semantic features of the high layer in the feature extraction layer are ensured,
in one embodiment, the BERT model reduces the feature vector in the process, and reserves the feature vector part of the separator [ CLS ] as a reduced vector. By utilizing the characteristic vector part of the first separator [ CLS ] in each characteristic extraction layer, the accuracy can be ensured and a certain amount of calculation can be reduced to a certain extent.
S302, the attention mechanism is adopted to measure and calculate the weight of each simplified vector.
In step S302, since the reduced vector is obtained by reducing the feature vector, the reduced vector still retains the features of the corresponding feature vector, and the utilization of the reduced vector with a higher relevance weight is facilitated by using the attention mechanism, so that the obtained fusion vector can more comprehensively represent the text features of the text file.
Fig. 5 shows a flowchart for calculating a weight of each of the reduced vectors by using the attention mechanism according to an embodiment of the present application. As shown in fig. 5, as one embodiment, step S302 includes: using the attention mechanism to measure and calculate the weight of each reduced vector, comprising:
s3021, combining the multiple simplified vectors, and processing the combined simplified vectors by using linear transformation to obtain corresponding query vectors, key vectors and value vectors.
And S3022, similarity calculation is carried out on the query vector and the key vector, and a softmax function is adopted to carry out normalization processing on a similarity calculation result to obtain the weight of the simplified vector. .
In step S3021, the reduced vectors with multiple dimensions are merged, but the merged reduced vectors need to be reduced in dimension through linear transformation (also called full connected), and a query (query) vector, a key (key) vector, and a value (value) vector corresponding to the merged reduced vectors are obtained before being used for calculating the weight.
In step S3022, the similarity between the query (query) vector and the key (key) vector is calculated, and a softmax function is used to normalize (softmax) the similarity calculation result, so as to obtain the weight of the reduced vector. Wherein the similarity function comprises a dot product, a splicing machine and a perceptron; and the similarity value of the query vector and the key vector is normalized and converted into data with the mean value of 0 and the variance of 1, namely the range of the data is mapped to [0,1], so that the weight of the simplified vector is obtained. The normalization process ensures that the output value of the weight of the reduced vector is changed into a probability distribution, prevents the disappearance of the gradient or the explosion of the gradient of the Attention (Attention) mechanism, and accelerates the convergence.
In an embodiment, assuming that the reduced vector is a, the dimensionality of the original feature vector is 768, N feature extraction layers, and N × a × 768 is obtained by performing merging processing (Merge) on a 768 of the N layers; performing dimensionality reduction through linear transformation (FC, also called full connection) to obtain N × a 1 and a (query) vector, a key (key) vector and a value (value) vector corresponding to the a vector; calculating similarity between a query (query) vector and a key vector, and mapping the value range of the a vector weight to [0,1] through softmax (normalization) to obtain the weight of the corresponding a vector.
And S303, based on the weight of each simplified vector, carrying out fusion processing on the N simplified vectors to obtain a fusion vector.
In step S303, since the reduced vector is obtained by reducing the feature vector, the reduced vector still retains the features of the corresponding feature vector, and then the weight of the reduced vector is used to perform weighted average on the weight of the reduced vector and the value vector, so as to obtain a weighted average value as a fusion vector. Through Attention (Attention) mechanism participation, fine-grained feature association among more feature extraction layers is learned than an original BERT model. The weighted average can be averaged by using a number-times (multiplex) method.
In an embodiment, the weight of the reduced vector and the value vector are weighted-averaged, and the weighted-averaged vector may be multiplied by number (Multiply) to obtain a fused vector a × 768.
It is understood that the extracted fusion vector is used to describe the text feature of the text file, and the text feature needs to be mapped into the character sequence for output by linear transformation (FC, also called full concatenation) of the fusion vector for the downstream task.
In an embodiment, for the task downstream of the BERT model, the text features output by the BERT model are Fine-tuned (Fine-tuning) with different utilization of the output of the NLP task model. Whether fine-tune is selected or not can be selected, if fine-tune is not selected, the fine-tune is used as a text feature extractor, and only the weight of the text feature output by the BERT model is needed; if fine-tune is used, it is equivalent to fine-tuning the weights given to different character vectors in the BERT model during the training process to adapt to our current task.
And S40, pre-training the BERT model before the step of respectively performing feature extraction on the character sequence through the N feature extraction layers of the pre-trained BERT model to obtain N feature vectors.
It can be understood that the BERT model belongs to a model of natural language processing, so that the BERT model also needs to be trained in advance through a large number of sample files to ensure the accuracy of text file processing.
S401, data enhancement is carried out on the preset sample file by adopting a data enhancement algorithm to obtain an addendum file.
It can be understood that the BERT model needs to be trained in advance before processing the text file, and the BERT model is input into the text file for training by using an unlabeled text file as a sample file, but the pretraining of the BERT model needs a large number of sample files to ensure the robustness and the learning performance of the BERT model. The sample files comprise a plurality of preset sample files and addendum files obtained by performing data enhancement on the preset sample files by adopting a data enhancement algorithm.
The data enhancement method comprises but is not limited to vocabulary replacement, text retranslation, a Mixup data enhancement algorithm and the like, wherein the vocabulary replacement replaces original words in a text by means of similar words, so that a new expression mode is obtained on the premise of keeping the text semantics unchanged as much as possible; the text retranslation is realized by translating an original document into texts in other languages and then translating the texts back to obtain an addendum file in the original language; the Mixup data enhancement algorithm originates from image enhancement algorithms in the field of computer vision, combining two different classes of image pixels together to generate a synthetic example for training, whereas the Mixup data enhancement algorithm is rarely used for natural language processing.
In an embodiment a text mixing for natural language processing based on the Mixup data enhancement algorithm is provided resulting in an addendum file. The method comprises the steps of wordMixup, sendMixup and tokenMixup. The method comprises the steps that word Mixup obtains two random sentences based on word grades, zero-fills the two random sentences into the same length, and carries out interpolation mixing on input word vectors according to a certain proportion; two sentences are adopted by the sendMixup, zero filling is carried out on the two sentences to obtain the same length, characteristic coding is carried out through a BERT coder, and interpolation mixing is carried out according to a certain proportion after sentence vectors are obtained; unlike word mixup, word mixup is a replacement of words, and token mux is an embedded vector fusion based on character level, for example, an english word "playing" is divided into characters and then "playing # # ing" is used to better solve the problem of unknown words (out of vocabulary words, OOV words).
The specific process of the MixUp data enhancement algorithm is, for example, 2 text files are randomly selected from a plurality of text files, and the contents are as follows:
1. [ CLS ] they went to the restaurant, found their own seat, and were quite large, with a certain capacity, next to the long table with twenty people [ SEP ]. [ SEP ]
2. The [ CLS ] is not back, so how to do the [ SEP ] Beijing with the rain of Beijing. [ SEP ]
Each sample file is separated by a spacer for easy understanding, and the two text sentences in the sample file are semantically connected respectively).
Then respectively converting text parts in the sample file 1 and the sample file 2 into a text A (namely, "[ CLS ] in the sample file 1 goes to a restaurant to find a seat of the user, the [ SEP ] beside a long table of twenty people is quite large and has a certain capacity) and a text B (namely," [ CLS ] in the sample file 2 is not returned, the rainstorm of Beijing converts the how we put outside into a large rain under Beijing [ SEP ] ") into character sequences respectively, inputting the feature vector into a BERT model for processing, extracting the features of the BERT model to obtain two feature vectors A and B respectively, and then, giving weights to the two feature vectors, mixing the two feature vectors according to the weights to obtain a new feature vector C, and finally training the obtained feature vector C and a separator [ SEP ] as a new sample file. The feature vectors extracted according to the BERT model feature are fused, so that the interference of the character vectors of stop words such as ' the ' word ' and the corresponding feature vectors can be reduced as much as possible.
S402, training an initial BERT model by using a sample file formed by the preset sample file and the addendum file to obtain the pre-trained BERT model.
In step S402, an initial BERT model is trained using a sample file composed of the preset sample file and the addendum file, so as to obtain the pre-trained BERT model. The pre-training tasks performed on the BERT Model using the sample file include NSP (Next sequence Prediction, text file classification task) and MLM (Masked Language Model, random masking, training bidirectional features).
In one embodiment, a text file classification task is performed on text features output by a sample file, and for single text classification, compared with other existing characters in a text, the symbols without obvious semantic information can more fairly fuse the semantic information of each character in the text; the double-text file (with connection) classification task has practical application scenes including question answering (judging whether a question is matched with an answer), sentence matching (whether two sentences express the same meaning) and the like, and two different text vectors are respectively added to the two sentences for distinguishing.
In one embodiment, to train the bidirectional features of a character sequence, partial characters (tokens) in the character sequence are randomly masked by the pre-training method of MLM, and then only those masked character vectors are predicted. Randomly masking 15% of character vectors in the corpus, and then sending feature vectors output at the positions of the masked character vectors into normalization processing to predict the masked character vectors. When all the models are affected by replacing the character vectors with the tokens [ MASK ], the following strategy is adopted in the case of random masking: 80% of words are converted into "my dog is [ MASK ]" by replacing "my dog is history" with [ MASK ] character vectors; the 10% word is converted into "my dog is apple" by replacing "my dog is hairpin" with an arbitrary word; the 10% word unchanged "my dog is hairpin" translates to "my dog is hairpin".
In an embodiment, an extraction device of text features is provided, and the extraction device of text features corresponds to the extraction method of text features in the above embodiments one to one. As shown in fig. 6, the text feature extraction device includes a segmentation module 11, an extraction module 12, and a fusion module 13, and each functional module is described in detail as follows:
the segmentation module 11 is used for acquiring a text file, and segmenting the text content of the text file through a word segmentation device to obtain a character sequence;
the extraction module 12 is used for respectively extracting the characteristics of the character sequences through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1;
a fusion module 13 for fusing the N feature vectors based on an attention mechanism to obtain a fusion vector; the fusion vector is used for describing text features of the text file;
for the specific definition of the text feature extraction device, reference may be made to the above definition of the text feature extraction method, which is not described herein again. The modules in the text feature extraction device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a client or a server, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a readable storage medium and an internal memory. The readable storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the readable storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of extracting text features.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the extraction method of text features in the above embodiments is implemented.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the method for extracting text features in the above-described embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.
Claims (10)
1. A method for extracting text features is characterized by comprising the following steps:
acquiring a text file, and segmenting the text content of the text file through a word segmentation device to obtain a character sequence;
respectively extracting the characteristics of the character sequence through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1;
fusing the N characteristic vectors based on an attention mechanism to obtain fused vectors; the fusion vector is used for describing text features of the text file.
2. The method for extracting text features of claim 1, wherein the obtaining a text file, segmenting text contents of the text file through a word segmenter to obtain a character sequence, comprises:
acquiring a text file, segmenting the text content of the text file through a word segmentation device, and inserting separators;
and identifying each text divided by the separator by text coding in a query dictionary, and taking the obtained character vector, segment vector and position vector as a character sequence.
3. The method for extracting text features according to claim 1, wherein before the step of extracting features of the character sequence through N feature extraction layers of a pre-trained BERT model to obtain N feature vectors, the method further comprises:
performing data enhancement on a preset sample file by adopting a data enhancement algorithm to obtain an addendum file;
and training an initial BERT model by using a sample file consisting of the preset sample file and the addendum file to obtain the pre-trained BERT model.
4. The method for extracting text features according to claim 1, wherein the fusing the N feature vectors based on the attention mechanism to obtain a fused vector comprises:
respectively carrying out vector simplification operation on each feature vector in the N feature vectors to obtain N simplified vectors corresponding to the N feature vectors one by one;
measuring and calculating the weight of each simplified vector by adopting the attention mechanism;
and based on the weight of each simplified vector, carrying out fusion processing on the N simplified vectors to obtain a fusion vector.
5. The method for extracting text features according to claim 4, wherein the performing a vector reduction operation on each feature vector of the N feature vectors to obtain N reduced vectors corresponding to the N feature vectors one to one includes:
and screening the character vectors in each feature vector, and reserving the simplification operation of the key character vectors to obtain N simplified vectors corresponding to the N feature vectors one by one.
6. The method for extracting text features according to claim 4, wherein the calculating the weight of each reduced vector by using the attention mechanism comprises:
merging a plurality of simplified vectors, and processing the merged simplified vectors by utilizing linear transformation to obtain corresponding query vectors, key vectors and value vectors;
and performing similarity calculation on the query vector and the key vector, and performing normalization processing on a similarity calculation result by adopting a softmax function to obtain the weight of the simplified vector.
7. The method for extracting text features as claimed in claim 4 or claim 6, wherein the fusing the N reduced vectors based on the weight of each reduced vector to obtain a fused vector comprises:
and carrying out weighted average on the weight and the value vector of the simplified vector to obtain a weighted average value as a fusion vector.
8. An apparatus for extracting text features, comprising:
the segmentation module is used for acquiring a text file and segmenting the text content of the text file through a word segmentation device to obtain a character sequence;
the extraction module is used for respectively extracting the characteristics of the character sequence through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1;
the fusion module is used for fusing the N characteristic vectors based on an attention mechanism to obtain fusion vectors; the fusion vector is used for describing text features of the text file.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of extracting text features according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements a method of extracting text features according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110775005.1A CN113449081A (en) | 2021-07-08 | 2021-07-08 | Text feature extraction method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110775005.1A CN113449081A (en) | 2021-07-08 | 2021-07-08 | Text feature extraction method and device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113449081A true CN113449081A (en) | 2021-09-28 |
Family
ID=77815554
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110775005.1A Pending CN113449081A (en) | 2021-07-08 | 2021-07-08 | Text feature extraction method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113449081A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115187996A (en) * | 2022-09-09 | 2022-10-14 | 中电科新型智慧城市研究院有限公司 | Semantic recognition method and device, terminal equipment and storage medium |
CN116842932A (en) * | 2023-08-30 | 2023-10-03 | 腾讯科技(深圳)有限公司 | Text feature decoding method and device, storage medium and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160180163A1 (en) * | 2014-12-19 | 2016-06-23 | Konica Minolta Laboratory U.S.A., Inc. | Method for segmenting text words in document images using vertical projections of center zones of characters |
CN110489424A (en) * | 2019-08-26 | 2019-11-22 | 北京香侬慧语科技有限责任公司 | A kind of method, apparatus, storage medium and the electronic equipment of tabular information extraction |
CN110674892A (en) * | 2019-10-24 | 2020-01-10 | 北京航空航天大学 | Fault feature screening method based on weighted multi-feature fusion and SVM classification |
CN110781312A (en) * | 2019-09-19 | 2020-02-11 | 平安科技(深圳)有限公司 | Text classification method and device based on semantic representation model and computer equipment |
CN112733866A (en) * | 2021-01-27 | 2021-04-30 | 西安理工大学 | Network construction method for improving text description correctness of controllable image |
CN113051371A (en) * | 2021-04-12 | 2021-06-29 | 平安国际智慧城市科技股份有限公司 | Chinese machine reading understanding method and device, electronic equipment and storage medium |
-
2021
- 2021-07-08 CN CN202110775005.1A patent/CN113449081A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160180163A1 (en) * | 2014-12-19 | 2016-06-23 | Konica Minolta Laboratory U.S.A., Inc. | Method for segmenting text words in document images using vertical projections of center zones of characters |
CN110489424A (en) * | 2019-08-26 | 2019-11-22 | 北京香侬慧语科技有限责任公司 | A kind of method, apparatus, storage medium and the electronic equipment of tabular information extraction |
CN110781312A (en) * | 2019-09-19 | 2020-02-11 | 平安科技(深圳)有限公司 | Text classification method and device based on semantic representation model and computer equipment |
CN110674892A (en) * | 2019-10-24 | 2020-01-10 | 北京航空航天大学 | Fault feature screening method based on weighted multi-feature fusion and SVM classification |
CN112733866A (en) * | 2021-01-27 | 2021-04-30 | 西安理工大学 | Network construction method for improving text description correctness of controllable image |
CN113051371A (en) * | 2021-04-12 | 2021-06-29 | 平安国际智慧城市科技股份有限公司 | Chinese machine reading understanding method and device, electronic equipment and storage medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115187996A (en) * | 2022-09-09 | 2022-10-14 | 中电科新型智慧城市研究院有限公司 | Semantic recognition method and device, terminal equipment and storage medium |
CN116842932A (en) * | 2023-08-30 | 2023-10-03 | 腾讯科技(深圳)有限公司 | Text feature decoding method and device, storage medium and electronic equipment |
CN116842932B (en) * | 2023-08-30 | 2023-11-14 | 腾讯科技(深圳)有限公司 | Text feature decoding method and device, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111460807B (en) | Sequence labeling method, device, computer equipment and storage medium | |
CN111291195B (en) | Data processing method, device, terminal and readable storage medium | |
CN111931517B (en) | Text translation method, device, electronic equipment and storage medium | |
CN110162624B (en) | Text processing method and device and related equipment | |
CN113449081A (en) | Text feature extraction method and device, computer equipment and storage medium | |
CN111597807B (en) | Word segmentation data set generation method, device, equipment and storage medium thereof | |
CN113705315A (en) | Video processing method, device, equipment and storage medium | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN116050352A (en) | Text encoding method and device, computer equipment and storage medium | |
CN113221553A (en) | Text processing method, device and equipment and readable storage medium | |
CN114492661A (en) | Text data classification method and device, computer equipment and storage medium | |
CN114139551A (en) | Method and device for training intention recognition model and method and device for recognizing intention | |
CN110717316B (en) | Topic segmentation method and device for subtitle dialog flow | |
CN116595023A (en) | Address information updating method and device, electronic equipment and storage medium | |
CN115115432B (en) | Product information recommendation method and device based on artificial intelligence | |
CN116432705A (en) | Text generation model construction method, text generation device, equipment and medium | |
CN114611529B (en) | Intention recognition method and device, electronic equipment and storage medium | |
CN114398903B (en) | Intention recognition method, device, electronic equipment and storage medium | |
CN115620726A (en) | Voice text generation method, and training method and device of voice text generation model | |
CN115687607A (en) | Text label identification method and system | |
CN114298032A (en) | Text punctuation detection method, computer device and storage medium | |
CN114067362A (en) | Sign language recognition method, device, equipment and medium based on neural network model | |
CN113591493A (en) | Translation model training method and translation model device | |
CN111967253A (en) | Entity disambiguation method and device, computer equipment and storage medium | |
CN111814496A (en) | Text processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210928 |
|
RJ01 | Rejection of invention patent application after publication |