CN113449081A - Text feature extraction method and device, computer equipment and storage medium - Google Patents

Text feature extraction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113449081A
CN113449081A CN202110775005.1A CN202110775005A CN113449081A CN 113449081 A CN113449081 A CN 113449081A CN 202110775005 A CN202110775005 A CN 202110775005A CN 113449081 A CN113449081 A CN 113449081A
Authority
CN
China
Prior art keywords
text
vectors
vector
file
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110775005.1A
Other languages
Chinese (zh)
Inventor
吴晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202110775005.1A priority Critical patent/CN113449081A/en
Publication of CN113449081A publication Critical patent/CN113449081A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The application relates to the technical field of data analysis, and discloses a text feature extraction method, a text feature extraction device, computer equipment and a storage medium, wherein the method comprises the following steps: by acquiring a text file, segmenting the text content of the text file by using a word segmentation device to obtain a character sequence; respectively extracting the characteristics of the character sequence through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1; fusing the N characteristic vectors based on an attention mechanism to obtain fused vectors; the fusion vector is used for describing text features of the text file. According to the method and the device, the obtained N characteristic vectors are fused, and the N characteristic vectors can respectively carry out characteristic description on the text file from different angles, so that the obtained fusion vectors can more comprehensively represent the text characteristics of the text file, and the accuracy of the BERT model in the classification task is improved.

Description

Text feature extraction method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of data analysis technologies, and in particular, to a method and an apparatus for extracting text features, a computer device, and a storage medium.
Background
At present, the application of Natural Language Processing (NLP) has been popularized in various fields, and one of the main reasons for the popularization of natural language processing is the use of a pre-training model by natural language processing, and the pre-training model solves the problem that the natural language processing model needs to be trained a lot before being used, and can adapt to different data sets to execute different natural language processing operations without building the model from the beginning.
In a common pre-training model, the accuracy of a classification task in NLP is significantly improved by the appearance of the BERT model, which can be obtained by pre-training a sample file without a label. However, in the existing technical solution, in the text feature extraction process of the BERT model, the BERT model is to extract features of different dimensions for each character according to the weight of each character in the character sequence after word segmentation, and this feature extraction manner ignores the integrity of the text content, which results in an inaccurate classification result of the output text features, thereby reducing the accuracy of the BERT model in the classification task.
Disclosure of Invention
The application provides a text feature extraction method, a text feature extraction device, computer equipment and a storage medium, and solves the technical problem that in the text feature extraction process of a BERT model, multidimensional feature extraction is respectively carried out on each character, the integrity of text content is ignored, and the accuracy of the classification result of output text features is reduced.
A text feature extraction method comprises the following steps:
acquiring a text file, and segmenting the text content of the text file through a word segmentation device to obtain a character sequence;
respectively extracting the characteristics of the character sequence through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1;
fusing the N characteristic vectors based on an attention mechanism to obtain fused vectors; the fusion vector is used for describing text features of the text file.
An extraction apparatus of text features, comprising:
the segmentation module is used for acquiring a text file and segmenting the text content of the text file through a word segmentation device to obtain a character sequence;
the extraction module is used for respectively extracting the characteristics of the character sequence through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1;
the fusion module is used for fusing the N characteristic vectors based on an attention mechanism to obtain fusion vectors; the fusion vector is used for describing text features of the text file.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above text feature extraction method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned text feature extraction method.
According to the text feature extraction method, the text feature extraction device, the computer equipment and the storage medium, the text content of the text file is segmented by a word segmentation device through obtaining the text file, and a character sequence is obtained; inputting the character sequence into a pre-trained BERT model for feature extraction to obtain a plurality of feature vectors; fusing the N characteristic vectors based on an attention mechanism to obtain fused vectors; the fusion vector is used for describing text features of the text file. When the N characteristic extraction layers extract the characteristics of the character sequence, the specific characteristic contents are different, so that the obtained N characteristic vectors can respectively carry out characteristic description on the text file from different angles, the N characteristic vectors are fused, the obtained fusion vector can more comprehensively represent the text characteristics of the text file, and the accuracy of the BERT model in the classification task is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a schematic application environment diagram of a text feature extraction method according to an embodiment of the present application;
fig. 2 is a flowchart illustrating an implementation of a text feature extraction method according to an embodiment of the present application;
fig. 3 is a flowchart of step S10 in the text feature extraction method according to an embodiment of the present application;
fig. 4 is a flowchart of step S30 in the text feature extraction method according to an embodiment of the present application;
fig. 5 is a flowchart of step S302 in the text feature extraction method according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an apparatus for extracting text features according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a computer device provided by an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The method for extracting the text features provided by the embodiment of the application can be applied to the application environment shown in fig. 1. As shown in fig. 1, a client (computer device) communicates with a server through a network. The client (computer device) includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, cameras, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers. The text feature extraction method provided in this embodiment may be executed by a server, for example, a user sends a text file to be processed to the server through the server, and the server executes the text feature extraction method provided in this embodiment based on the text file to be processed, so as to obtain a text feature of a target text, and finally, may also send the text feature to a client.
In some scenarios other than fig. 1, the client may also execute the text feature extraction method, directly obtain the text feature of the target text according to the determined target text by executing the text feature extraction method provided in this embodiment, and then send the text feature of the target text to the server for storage. .
It is understood that, in order to save computation resources and time cost of Natural Language Processing (NLP), the natural language processing uses a pre-training model, which includes a plurality of types, and common pre-training models include BERT models (BERTs), which are transform-based bi-directional coding representation models.
The method for extracting text features provided in this embodiment is a further improvement of a text feature extraction process based on a BERT model in a pre-training model for natural language processing.
Fig. 2 shows a flowchart of an implementation of a text feature extraction method according to an embodiment of the present application. As shown in fig. 2, a method for extracting text features is provided, which mainly comprises the following steps S10-S30:
and S10, acquiring a text file, and segmenting the text content of the text file through a word segmentation device to obtain a character sequence.
In step S10, the text file is a file in which text content is recorded, for example, a txt file, a Word file, or the like. The word segmentation device is configured to perform character segmentation on the text content of the text file. Here, after the text content of the text file is segmented by the word segmenter, the obtained character sequence corresponds to the text file.
In implementation, if there are a plurality of text files, serial numbers may be configured for different text files, and a character sequence obtained by segmenting text content based on the text file is also configured with a serial number the same as that of the text file, so as to implement a correspondence between the text file and the character sequence.
It should be noted that, in the natural language processing, because the natural language processing model cannot directly process text characters, the natural language processing model can process the input text only after the obtained text file is segmented and converted by the segmenter and the characters or words in the text are encoded and converted into corresponding character vectors in the dictionary according to the dictionary.
Here, the word segmenter may convert words into character vectors through word2vec algorithm, or may convert texts into character vectors through basicotakenizer algorithm in BERT model. The basic tokenizer algorithm performs the operations of code conversion, punctuation segmentation, lowercase conversion, Chinese character segmentation, accent symbol removal and the like on the text, and finally returns an array related to the words.
Fig. 3 shows a flowchart for obtaining a text file and segmenting text contents of the text file by a word segmenter to obtain a character sequence according to an embodiment of the present application. As shown in fig. 3, as one embodiment, step S10 includes:
s101, obtaining a text file, segmenting the text content of the text file through a word segmentation device, and inserting separators.
S102, recognizing each text divided by the separator through text coding in a query dictionary, and taking the obtained character vector, segment vector and position vector as a character sequence.
In step S101, the text file required to be input by the BERT model may be one or two text files. The text content of the text file is segmented through a word segmentation device, the segmented text is inserted into separators, and the segmented text is marked according to different requirements. The vocabulary recognition mode divided by the word segmentation device is used for recognizing and dividing the text code corresponding to the text content according to the text code preset in the dictionary preset by the word segmentation device, and the inserted separators comprise [ CLS ], [ SEP ] and [ PAD ], wherein [ CLS ] is used for representing the whole character sequence and carrying out the sign of the front section of the text; [ SEP ] is used to separate the text of different text files; [ PAD ] is used as a placeholder to serve as a fill-in space, for example: when only one text file exists, the content is 'good weather today' and the separator is inserted after the text is divided into '[ CLS ] good weather today'; when the text file has two and has context, wherein the content is "weather good today" and "phishing can be removed", the separator inserted after text segmentation is "[ CLS ] weather good today [ SEP ] phishing can be removed [ SEP ]", and the above example uses space to represent the segmentation relationship between characters.
In one embodiment, the language type of the text content in the text file is obtained according to the requirement, such as the text of different languages, such as chinese text, english text, and the like. When the text of the text file is a Chinese text, segmenting each character in the Chinese text by a word segmentation device; when the text of the text file is an English text, segmenting each English vocabulary in the English text through a word segmentation device; after the english vocabulary is segmented, the word segmenter further comprises a WordPieceTokenizer algorithm, wherein the WordPieceTokenizer is an entry segmentation algorithm and can perform subword segmentation (subword segmentation) on the english vocabulary, so that in a dictionary preset by the word segmenter, the english vocabulary dictionary can add new word semantics through the combination of a root word and an affix word, and the english vocabulary dictionary is simplified.
In step S102, in the natural language processing, since the natural language processing model cannot directly process text characters, each text divided by the separator is usually recognized as a character vector according to text encoding in a query dictionary as an input, and according to the input vector of the BERT model, vectors of a plurality of dimensions are required in addition to the character vector (token embedding), a segment vector (segment embedding) and a position vector (position embedding) are added according to the character vector, and the above three-dimensional vectors are input as a character sequence. Converting Chinese characters or words into character vectors by text coding in a query dictionary; the value of the segment vector can be automatically learned after being input into the model, is used for depicting the global semantic information of the text and is fused with the semantic information of the single character/word; the position vector is used for distinguishing by adding a position vector to the characters/words at different positions respectively by the BERT model, because semantic information carried by the characters/words appearing at different positions of the text is different (for example: "I make you" and "you make me").
In one embodiment, the character length of the character sequence is a fixed length set in advance, and when the text length segmented by only one text file exceeds, the last exceeding part is cut off; when the length of the text divided by the two text files exceeds the length, firstly deleting the text tail separators of the longer text files, and if the lengths of the two text files are equal, deleting the separators at the tail of the two text files in turn until the total length meets the requirement; when the text length of the segmented text file is insufficient, a separator [ PAD ] is added at the end of the segmented text for filling, so that the length reaches the requirement.
In one embodiment, the tokenizer converts a chinese word or word into a character vector by using the dictionary vocab.
S20, respectively extracting the features of the character sequence through N feature extraction layers of a pre-trained BERT model to obtain N feature vectors; wherein N is an integer greater than 1.
In step S20, the plurality of feature extraction layers of the BERT model trained in advance perform feature extraction on the character sequence, respectively. The feature vectors are used for describing feature contents of the text file, and because the character sequences are obtained by segmenting characters of the text file, the character sequences are subjected to feature extraction by using a pre-trained BERT model to obtain a plurality of feature vectors which are used for embodying features of the text content.
In one embodiment, a character sequence is input into a pre-trained BERT model, and feature extraction is respectively carried out on the character sequence through N feature extraction layers of the pre-trained BERT model to obtain N feature vectors; wherein N is an integer greater than 1.
In one embodiment, the BERT model is based on a bidirectional coding representation of a Transformer, which is a model in the NLP field that utilizes an Attention (Attention) mechanism to improve the training speed of the model, and the BERT model constructs a multi-layer bidirectional encoder network by using a Transformer structure. The BERT model is composed of encoder parts in a plurality of transform structures, and an encoder unit of one transform is generated by superposition of multi-head-Attention (multi-head-Attention) and Layer Normalization (Layer Normalization); multi-head-Attention (multi-head-Attention) is composed of multiple Self-Attention (Self-Attention); layer Normalization (Layer Normalization) standardizes the 0 mean 1 variance of a certain Layer of neural network nodes; the structure of the transformer is utilized to predict the character (token) covered (mask) through the context, thereby capturing the bidirectional relation of the character vector.
It should be understood that the BERT model is composed of a plurality of the feature extraction layers, and in practical applications, the BERT model may include a plurality of feature extraction layers, each feature extraction layer has one encoder unit, in a larger BERT model, there are 24 feature extraction layers, each layer has 16 orientations, and the dimension of the feature vector is 1024; in the smaller BERT model, there are 12 feature extraction layers, each layer having 12 orientations, with a feature vector dimension of 768.
And each feature extraction layer respectively extracts different features of the character sequence to obtain a plurality of feature vectors. The feature vectors of different features include feature vectors of lexical features, feature vectors of syntactic features, and feature vectors of semantic features. Since the features extracted by different feature extraction layers for the character sequence are different, the character sequence is subjected to feature extraction by using the pre-trained BERT model, and a plurality of feature vectors with different features can be obtained. By way of example with a BERT model of 12 feature extraction layers: layer _1 to Layer _4 are low layers, and the lexical characteristics are learned, such as: whether a word is a verb or an adjective, which characters the word consists of, etc.; layer _5 to Layer _8 are middle layers, and the syntax characteristics are learned, such as: the number of words in the sentence, the dependency between the words and the words in the sentence, etc.; layer _9 to Layer _12 are high-level, and what is learned is semantic features, such as: what the semantics of the sentence expression are, what in the sentence are keywords, etc.
As an example, the feature vectors extracted in the feature extraction layer are calculated according to requirements, and the calculation modes include linear transformation, activation function (activation function), multi-head self-attention (multi-head self-attention), skip connection (skip connection), layer normalization and random inactivation.
S30, fusing the N feature vectors based on an attention mechanism to obtain fused vectors; the fusion vector is used for describing text features of the text file.
In step S30, each of the N feature vectors can be used to characterize a feature of the text file in a certain dimension, the characterized features including lexical, syntactic and semantic features. Here, the fusion vector obtained by fusing the N feature vectors can be directly used to describe the overall features of the text file.
It can be understood that, in the BERT model, not only the extracted features of each feature extraction layer are different, but also the utilization weights of the extracted feature vectors are different, although the feature extraction layers of the BERT model are connected in series, all the feature extraction layers need to be utilized when extracting the feature vectors, but the feature vectors are different and the feature vectors extracted by each layer do not share parameters, so that the features of the character sequence cannot be well extracted by each feature extraction layer alone, and therefore, the feature vectors of each layer need to be fused and dimension reduced, and then the BERT model is used to learn the weights of each feature extraction layer to obtain a fused vector. The fusion vector is used for describing text features of the text file so as to be used by downstream tasks.
Fig. 4 illustrates that the N feature vectors are fused based on the attention mechanism to obtain a fused vector according to an embodiment of the present application; the fusion vector is used for describing a flow chart of text characteristics of the text file. As shown in fig. 4, as one embodiment, step S30 includes:
s301, respectively carrying out vector reduction operation on each feature vector in the N feature vectors to obtain N reduced vectors corresponding to the N feature vectors one by one;
s302, adopting the attention mechanism to measure and calculate the weight of each simplified vector;
and S303, based on the weight of each simplified vector, carrying out fusion processing on the N simplified vectors to obtain a fusion vector.
In step S30, feature extraction is performed on the character sequence by using N feature extraction layers of a pre-trained BERT model, respectively, to obtain N feature vectors, where N is an integer greater than 1; and calculating the weight of each simplified vector by adopting the attention mechanism based on the attention mechanism, and fusing the N simplified vectors based on the weight of each simplified vector to obtain fused vectors. When the N characteristic extraction layers extract the characteristics of the character sequence, the specific characteristic contents are different, so that the obtained N characteristic vectors can respectively carry out characteristic description on the text file from different angles. The N feature vectors are fused based on the attention mechanism to measure and calculate the weight of each simplified vector, so that the utilization of the simplified vectors with higher correlation weights is improved, the obtained fused vectors can more comprehensively represent the text features of the text file, and the accuracy of the BERT model in the classification task is improved.
S301, respectively carrying out vector reduction operation on each feature vector in the N feature vectors to obtain N reduced vectors corresponding to the N feature vectors one by one;
in step S301, the character sequence of the BERT model is input as having a plurality of separators and corresponding character vectors, and a plurality of N feature vectors extracted by the N feature extraction layers also include a plurality of separators and corresponding character vectors, and are respectively subjected to vector reduction operation on each feature vector in the N feature vectors, so as to obtain N reduced vectors corresponding to the N feature vectors one to one. Before the feature vectors are fused, the feature vectors are simplified to obtain the simplified vectors, so that the interference of character vectors of stop words such as ' the ' word ' and the corresponding feature vectors can be reduced as much as possible, the semantic features of the high layer in the feature extraction layer are ensured,
in one embodiment, the BERT model reduces the feature vector in the process, and reserves the feature vector part of the separator [ CLS ] as a reduced vector. By utilizing the characteristic vector part of the first separator [ CLS ] in each characteristic extraction layer, the accuracy can be ensured and a certain amount of calculation can be reduced to a certain extent.
S302, the attention mechanism is adopted to measure and calculate the weight of each simplified vector.
In step S302, since the reduced vector is obtained by reducing the feature vector, the reduced vector still retains the features of the corresponding feature vector, and the utilization of the reduced vector with a higher relevance weight is facilitated by using the attention mechanism, so that the obtained fusion vector can more comprehensively represent the text features of the text file.
Fig. 5 shows a flowchart for calculating a weight of each of the reduced vectors by using the attention mechanism according to an embodiment of the present application. As shown in fig. 5, as one embodiment, step S302 includes: using the attention mechanism to measure and calculate the weight of each reduced vector, comprising:
s3021, combining the multiple simplified vectors, and processing the combined simplified vectors by using linear transformation to obtain corresponding query vectors, key vectors and value vectors.
And S3022, similarity calculation is carried out on the query vector and the key vector, and a softmax function is adopted to carry out normalization processing on a similarity calculation result to obtain the weight of the simplified vector. .
In step S3021, the reduced vectors with multiple dimensions are merged, but the merged reduced vectors need to be reduced in dimension through linear transformation (also called full connected), and a query (query) vector, a key (key) vector, and a value (value) vector corresponding to the merged reduced vectors are obtained before being used for calculating the weight.
In step S3022, the similarity between the query (query) vector and the key (key) vector is calculated, and a softmax function is used to normalize (softmax) the similarity calculation result, so as to obtain the weight of the reduced vector. Wherein the similarity function comprises a dot product, a splicing machine and a perceptron; and the similarity value of the query vector and the key vector is normalized and converted into data with the mean value of 0 and the variance of 1, namely the range of the data is mapped to [0,1], so that the weight of the simplified vector is obtained. The normalization process ensures that the output value of the weight of the reduced vector is changed into a probability distribution, prevents the disappearance of the gradient or the explosion of the gradient of the Attention (Attention) mechanism, and accelerates the convergence.
In an embodiment, assuming that the reduced vector is a, the dimensionality of the original feature vector is 768, N feature extraction layers, and N × a × 768 is obtained by performing merging processing (Merge) on a 768 of the N layers; performing dimensionality reduction through linear transformation (FC, also called full connection) to obtain N × a 1 and a (query) vector, a key (key) vector and a value (value) vector corresponding to the a vector; calculating similarity between a query (query) vector and a key vector, and mapping the value range of the a vector weight to [0,1] through softmax (normalization) to obtain the weight of the corresponding a vector.
And S303, based on the weight of each simplified vector, carrying out fusion processing on the N simplified vectors to obtain a fusion vector.
In step S303, since the reduced vector is obtained by reducing the feature vector, the reduced vector still retains the features of the corresponding feature vector, and then the weight of the reduced vector is used to perform weighted average on the weight of the reduced vector and the value vector, so as to obtain a weighted average value as a fusion vector. Through Attention (Attention) mechanism participation, fine-grained feature association among more feature extraction layers is learned than an original BERT model. The weighted average can be averaged by using a number-times (multiplex) method.
In an embodiment, the weight of the reduced vector and the value vector are weighted-averaged, and the weighted-averaged vector may be multiplied by number (Multiply) to obtain a fused vector a × 768.
It is understood that the extracted fusion vector is used to describe the text feature of the text file, and the text feature needs to be mapped into the character sequence for output by linear transformation (FC, also called full concatenation) of the fusion vector for the downstream task.
In an embodiment, for the task downstream of the BERT model, the text features output by the BERT model are Fine-tuned (Fine-tuning) with different utilization of the output of the NLP task model. Whether fine-tune is selected or not can be selected, if fine-tune is not selected, the fine-tune is used as a text feature extractor, and only the weight of the text feature output by the BERT model is needed; if fine-tune is used, it is equivalent to fine-tuning the weights given to different character vectors in the BERT model during the training process to adapt to our current task.
And S40, pre-training the BERT model before the step of respectively performing feature extraction on the character sequence through the N feature extraction layers of the pre-trained BERT model to obtain N feature vectors.
It can be understood that the BERT model belongs to a model of natural language processing, so that the BERT model also needs to be trained in advance through a large number of sample files to ensure the accuracy of text file processing.
S401, data enhancement is carried out on the preset sample file by adopting a data enhancement algorithm to obtain an addendum file.
It can be understood that the BERT model needs to be trained in advance before processing the text file, and the BERT model is input into the text file for training by using an unlabeled text file as a sample file, but the pretraining of the BERT model needs a large number of sample files to ensure the robustness and the learning performance of the BERT model. The sample files comprise a plurality of preset sample files and addendum files obtained by performing data enhancement on the preset sample files by adopting a data enhancement algorithm.
The data enhancement method comprises but is not limited to vocabulary replacement, text retranslation, a Mixup data enhancement algorithm and the like, wherein the vocabulary replacement replaces original words in a text by means of similar words, so that a new expression mode is obtained on the premise of keeping the text semantics unchanged as much as possible; the text retranslation is realized by translating an original document into texts in other languages and then translating the texts back to obtain an addendum file in the original language; the Mixup data enhancement algorithm originates from image enhancement algorithms in the field of computer vision, combining two different classes of image pixels together to generate a synthetic example for training, whereas the Mixup data enhancement algorithm is rarely used for natural language processing.
In an embodiment a text mixing for natural language processing based on the Mixup data enhancement algorithm is provided resulting in an addendum file. The method comprises the steps of wordMixup, sendMixup and tokenMixup. The method comprises the steps that word Mixup obtains two random sentences based on word grades, zero-fills the two random sentences into the same length, and carries out interpolation mixing on input word vectors according to a certain proportion; two sentences are adopted by the sendMixup, zero filling is carried out on the two sentences to obtain the same length, characteristic coding is carried out through a BERT coder, and interpolation mixing is carried out according to a certain proportion after sentence vectors are obtained; unlike word mixup, word mixup is a replacement of words, and token mux is an embedded vector fusion based on character level, for example, an english word "playing" is divided into characters and then "playing # # ing" is used to better solve the problem of unknown words (out of vocabulary words, OOV words).
The specific process of the MixUp data enhancement algorithm is, for example, 2 text files are randomly selected from a plurality of text files, and the contents are as follows:
1. [ CLS ] they went to the restaurant, found their own seat, and were quite large, with a certain capacity, next to the long table with twenty people [ SEP ]. [ SEP ]
2. The [ CLS ] is not back, so how to do the [ SEP ] Beijing with the rain of Beijing. [ SEP ]
Each sample file is separated by a spacer for easy understanding, and the two text sentences in the sample file are semantically connected respectively).
Then respectively converting text parts in the sample file 1 and the sample file 2 into a text A (namely, "[ CLS ] in the sample file 1 goes to a restaurant to find a seat of the user, the [ SEP ] beside a long table of twenty people is quite large and has a certain capacity) and a text B (namely," [ CLS ] in the sample file 2 is not returned, the rainstorm of Beijing converts the how we put outside into a large rain under Beijing [ SEP ] ") into character sequences respectively, inputting the feature vector into a BERT model for processing, extracting the features of the BERT model to obtain two feature vectors A and B respectively, and then, giving weights to the two feature vectors, mixing the two feature vectors according to the weights to obtain a new feature vector C, and finally training the obtained feature vector C and a separator [ SEP ] as a new sample file. The feature vectors extracted according to the BERT model feature are fused, so that the interference of the character vectors of stop words such as ' the ' word ' and the corresponding feature vectors can be reduced as much as possible.
S402, training an initial BERT model by using a sample file formed by the preset sample file and the addendum file to obtain the pre-trained BERT model.
In step S402, an initial BERT model is trained using a sample file composed of the preset sample file and the addendum file, so as to obtain the pre-trained BERT model. The pre-training tasks performed on the BERT Model using the sample file include NSP (Next sequence Prediction, text file classification task) and MLM (Masked Language Model, random masking, training bidirectional features).
In one embodiment, a text file classification task is performed on text features output by a sample file, and for single text classification, compared with other existing characters in a text, the symbols without obvious semantic information can more fairly fuse the semantic information of each character in the text; the double-text file (with connection) classification task has practical application scenes including question answering (judging whether a question is matched with an answer), sentence matching (whether two sentences express the same meaning) and the like, and two different text vectors are respectively added to the two sentences for distinguishing.
In one embodiment, to train the bidirectional features of a character sequence, partial characters (tokens) in the character sequence are randomly masked by the pre-training method of MLM, and then only those masked character vectors are predicted. Randomly masking 15% of character vectors in the corpus, and then sending feature vectors output at the positions of the masked character vectors into normalization processing to predict the masked character vectors. When all the models are affected by replacing the character vectors with the tokens [ MASK ], the following strategy is adopted in the case of random masking: 80% of words are converted into "my dog is [ MASK ]" by replacing "my dog is history" with [ MASK ] character vectors; the 10% word is converted into "my dog is apple" by replacing "my dog is hairpin" with an arbitrary word; the 10% word unchanged "my dog is hairpin" translates to "my dog is hairpin".
In an embodiment, an extraction device of text features is provided, and the extraction device of text features corresponds to the extraction method of text features in the above embodiments one to one. As shown in fig. 6, the text feature extraction device includes a segmentation module 11, an extraction module 12, and a fusion module 13, and each functional module is described in detail as follows:
the segmentation module 11 is used for acquiring a text file, and segmenting the text content of the text file through a word segmentation device to obtain a character sequence;
the extraction module 12 is used for respectively extracting the characteristics of the character sequences through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1;
a fusion module 13 for fusing the N feature vectors based on an attention mechanism to obtain a fusion vector; the fusion vector is used for describing text features of the text file;
for the specific definition of the text feature extraction device, reference may be made to the above definition of the text feature extraction method, which is not described herein again. The modules in the text feature extraction device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a client or a server, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a readable storage medium and an internal memory. The readable storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the readable storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of extracting text features.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the extraction method of text features in the above embodiments is implemented.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the method for extracting text features in the above-described embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method for extracting text features is characterized by comprising the following steps:
acquiring a text file, and segmenting the text content of the text file through a word segmentation device to obtain a character sequence;
respectively extracting the characteristics of the character sequence through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1;
fusing the N characteristic vectors based on an attention mechanism to obtain fused vectors; the fusion vector is used for describing text features of the text file.
2. The method for extracting text features of claim 1, wherein the obtaining a text file, segmenting text contents of the text file through a word segmenter to obtain a character sequence, comprises:
acquiring a text file, segmenting the text content of the text file through a word segmentation device, and inserting separators;
and identifying each text divided by the separator by text coding in a query dictionary, and taking the obtained character vector, segment vector and position vector as a character sequence.
3. The method for extracting text features according to claim 1, wherein before the step of extracting features of the character sequence through N feature extraction layers of a pre-trained BERT model to obtain N feature vectors, the method further comprises:
performing data enhancement on a preset sample file by adopting a data enhancement algorithm to obtain an addendum file;
and training an initial BERT model by using a sample file consisting of the preset sample file and the addendum file to obtain the pre-trained BERT model.
4. The method for extracting text features according to claim 1, wherein the fusing the N feature vectors based on the attention mechanism to obtain a fused vector comprises:
respectively carrying out vector simplification operation on each feature vector in the N feature vectors to obtain N simplified vectors corresponding to the N feature vectors one by one;
measuring and calculating the weight of each simplified vector by adopting the attention mechanism;
and based on the weight of each simplified vector, carrying out fusion processing on the N simplified vectors to obtain a fusion vector.
5. The method for extracting text features according to claim 4, wherein the performing a vector reduction operation on each feature vector of the N feature vectors to obtain N reduced vectors corresponding to the N feature vectors one to one includes:
and screening the character vectors in each feature vector, and reserving the simplification operation of the key character vectors to obtain N simplified vectors corresponding to the N feature vectors one by one.
6. The method for extracting text features according to claim 4, wherein the calculating the weight of each reduced vector by using the attention mechanism comprises:
merging a plurality of simplified vectors, and processing the merged simplified vectors by utilizing linear transformation to obtain corresponding query vectors, key vectors and value vectors;
and performing similarity calculation on the query vector and the key vector, and performing normalization processing on a similarity calculation result by adopting a softmax function to obtain the weight of the simplified vector.
7. The method for extracting text features as claimed in claim 4 or claim 6, wherein the fusing the N reduced vectors based on the weight of each reduced vector to obtain a fused vector comprises:
and carrying out weighted average on the weight and the value vector of the simplified vector to obtain a weighted average value as a fusion vector.
8. An apparatus for extracting text features, comprising:
the segmentation module is used for acquiring a text file and segmenting the text content of the text file through a word segmentation device to obtain a character sequence;
the extraction module is used for respectively extracting the characteristics of the character sequence through N characteristic extraction layers of a pre-trained BERT model to obtain N characteristic vectors; wherein N is an integer greater than 1;
the fusion module is used for fusing the N characteristic vectors based on an attention mechanism to obtain fusion vectors; the fusion vector is used for describing text features of the text file.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of extracting text features according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements a method of extracting text features according to any one of claims 1 to 7.
CN202110775005.1A 2021-07-08 2021-07-08 Text feature extraction method and device, computer equipment and storage medium Pending CN113449081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110775005.1A CN113449081A (en) 2021-07-08 2021-07-08 Text feature extraction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110775005.1A CN113449081A (en) 2021-07-08 2021-07-08 Text feature extraction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113449081A true CN113449081A (en) 2021-09-28

Family

ID=77815554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110775005.1A Pending CN113449081A (en) 2021-07-08 2021-07-08 Text feature extraction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113449081A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187996A (en) * 2022-09-09 2022-10-14 中电科新型智慧城市研究院有限公司 Semantic recognition method and device, terminal equipment and storage medium
CN116842932A (en) * 2023-08-30 2023-10-03 腾讯科技(深圳)有限公司 Text feature decoding method and device, storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160180163A1 (en) * 2014-12-19 2016-06-23 Konica Minolta Laboratory U.S.A., Inc. Method for segmenting text words in document images using vertical projections of center zones of characters
CN110489424A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus, storage medium and the electronic equipment of tabular information extraction
CN110674892A (en) * 2019-10-24 2020-01-10 北京航空航天大学 Fault feature screening method based on weighted multi-feature fusion and SVM classification
CN110781312A (en) * 2019-09-19 2020-02-11 平安科技(深圳)有限公司 Text classification method and device based on semantic representation model and computer equipment
CN112733866A (en) * 2021-01-27 2021-04-30 西安理工大学 Network construction method for improving text description correctness of controllable image
CN113051371A (en) * 2021-04-12 2021-06-29 平安国际智慧城市科技股份有限公司 Chinese machine reading understanding method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160180163A1 (en) * 2014-12-19 2016-06-23 Konica Minolta Laboratory U.S.A., Inc. Method for segmenting text words in document images using vertical projections of center zones of characters
CN110489424A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus, storage medium and the electronic equipment of tabular information extraction
CN110781312A (en) * 2019-09-19 2020-02-11 平安科技(深圳)有限公司 Text classification method and device based on semantic representation model and computer equipment
CN110674892A (en) * 2019-10-24 2020-01-10 北京航空航天大学 Fault feature screening method based on weighted multi-feature fusion and SVM classification
CN112733866A (en) * 2021-01-27 2021-04-30 西安理工大学 Network construction method for improving text description correctness of controllable image
CN113051371A (en) * 2021-04-12 2021-06-29 平安国际智慧城市科技股份有限公司 Chinese machine reading understanding method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187996A (en) * 2022-09-09 2022-10-14 中电科新型智慧城市研究院有限公司 Semantic recognition method and device, terminal equipment and storage medium
CN116842932A (en) * 2023-08-30 2023-10-03 腾讯科技(深圳)有限公司 Text feature decoding method and device, storage medium and electronic equipment
CN116842932B (en) * 2023-08-30 2023-11-14 腾讯科技(深圳)有限公司 Text feature decoding method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN111460807B (en) Sequence labeling method, device, computer equipment and storage medium
CN111291195B (en) Data processing method, device, terminal and readable storage medium
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
CN110162624B (en) Text processing method and device and related equipment
CN113449081A (en) Text feature extraction method and device, computer equipment and storage medium
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN113705315A (en) Video processing method, device, equipment and storage medium
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN116050352A (en) Text encoding method and device, computer equipment and storage medium
CN113221553A (en) Text processing method, device and equipment and readable storage medium
CN114492661A (en) Text data classification method and device, computer equipment and storage medium
CN114139551A (en) Method and device for training intention recognition model and method and device for recognizing intention
CN110717316B (en) Topic segmentation method and device for subtitle dialog flow
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN115115432B (en) Product information recommendation method and device based on artificial intelligence
CN116432705A (en) Text generation model construction method, text generation device, equipment and medium
CN114611529B (en) Intention recognition method and device, electronic equipment and storage medium
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium
CN115620726A (en) Voice text generation method, and training method and device of voice text generation model
CN115687607A (en) Text label identification method and system
CN114298032A (en) Text punctuation detection method, computer device and storage medium
CN114067362A (en) Sign language recognition method, device, equipment and medium based on neural network model
CN113591493A (en) Translation model training method and translation model device
CN111967253A (en) Entity disambiguation method and device, computer equipment and storage medium
CN111814496A (en) Text processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210928

RJ01 Rejection of invention patent application after publication