CN111553147A

CN111553147A - BERT model based on N-gram and semantic segmentation method

Info

Publication number: CN111553147A
Application number: CN202010230482.5A
Authority: CN
Inventors: 徐思昊; 张帆
Original assignee: Nanjing Moshen Information Technology Co ltd; Nanjing Tech University
Current assignee: Nanjing Moshen Information Technology Co ltd; Nanjing Tech University
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2020-08-18

Abstract

The invention provides a BERT model based on N-gram and a semantic segmentation method, wherein the model comprises the following steps: the semantic segmentation method comprises a word splitting unit, a word coding unit, a word splicing input unit and a discarding unit, and semantic segmentation results corresponding to training data samples are obtained. The invention enhances the semantic representation of the words by constructing a new model, and effectively improves the interpretation capability of the BERT model. The accuracy of the novel N-gram BERT model is higher than that of the original BERT model on the test result, and the matching precision of Chinese semantic similarity is improved. Meanwhile, the method has good generalization capability on other languages such as English, French and the like.

Description

BERT model based on N-gram and semantic segmentation method

Technical Field

The invention relates to the field of natural language processing, in particular to a BERT model based on an N-gram and a semantic segmentation method.

Background

In the present period, natural language processing techniques based on deep learning have been widely applied to various scenarios for assisting human beings in understanding and processing natural language and text, and typical application scenarios include text classification, dialog system, question and answer system, text translation and natural language reasoning.

The proposal of BERT is undoubtedly one of the superior models in the field of natural language processing, and the model integrates the advantages of other models before, abandons the disadvantages of the models and achieves good effect on subsequent specific tasks of natural language processing. BERT concatenates two sentence sequences and serves as a model input portion, with a symbol mark placed at the beginning and end of each sentence. For each Word, BERT carries out three different embedding operations, namely 1, converts each Word into a vector with fixed dimensionality, carries out Word2Vec coding, 2, splices two sentences when sentence pairs are input, assigns 0 to each Word in the first sentence, and assigns 1 to each Word in the second sentence. 3. And coding the position information of the sentence in which the word is positioned. And splicing the three embedding result vectors to obtain the BERT word vector.

Compared with the traditional language model, the BERT is a deep neural network with about one hundred layers, and model parameter learning is carried out by utilizing large-scale linguistic data, so that more grammars, lexical and semantic information are merged into a BERT Word vector, and meanwhile, the BERT is trained by Word units, so that the problem of the unknown words faced by Word2Vec can be solved to a certain extent. However, the model of BERT is a working mechanism that learns the context representation through word-level representations. To date, little work has been done to understand language groups other than a single word in BERT. And interpreting elements such as phrases using word-level attention matrices or word-level context embedding, attention is not intuitive enough. Word-level coding and attention mechanisms limit the interpretation capabilities of BERTs.

Disclosure of Invention

The invention aims to solve the problems existing in the processing of word levels in a BERT model, and provides an N-gram-based BERT model and a semantic segmentation method. The invention combines the thought of N-gram to modify the BERT model, so that two words or even three words are used as input when the BERT model is input, thereby enhancing the semantic information between the words.

The technical scheme of the invention is as follows:

the invention provides an N-Gram-based BERT model, which comprises the following components:

a word splitting unit: splitting a training data sample into a plurality of words;

a word encoding unit: converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;

word concatenation input unit: taking n words as input, taking the word i and i + (n-1) words behind the word i as input for training;

a discarding unit: and setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample.

A semantic segmentation method based on an N-Gram BERT model comprises the following steps:

s1, acquiring a training data set:

selecting a corpus as an original data set; preprocessing an original data set by adopting an extractor to obtain a plurality of training data samples, and performing semantic segmentation on the training data samples by taking the training data samples as a training data set;

s2, establishing an N-gram BERT model:

splitting a training data sample into a plurality of words; converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;

training by taking n words as input models and taking the word i and i + (n-1) words behind the word i as input; setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample;

traversing the training data samples by adopting the steps until the matching rate of the semantic segmentation result output by the model and the semantic segmentation condition of the corresponding training data sample reaches the preset requirement, finishing training and obtaining an N-gram BERT model;

s3, performing semantic segmentation by adopting an N-gram BERT model:

for the sentence to be recognized, preprocessing the sentence according to the step S1 to obtain a plurality of data samples, inputting the data samples into an N-gram BERT model, and obtaining a semantic segmentation result; comparing BERT with n-gramBERT, testing on LCQMC data set, the n-gramBERT model has better effect and more accurate semantic segmentation.

Further, in step S1, the corpus is chinese wikipedia.

Further, in step S1, the extractor is a wikipedia extractor; the pretreatment step comprises the following steps: firstly, processing an original data set by using a Wikipedia extractor, removing a title, adjusting a character format and removing blank lines to obtain an original article paragraph data set;

secondly, converting the original article paragraph data set into original article paragraph data sets in simplified Chinese and traditional Chinese forms;

and finally, respectively carrying out paragraph splitting on original article paragraph data sets of simplified Chinese and traditional Chinese to form a pre-training data set.

Further, adjusting the text format includes: unifying font size and removing code annotation symbols; such as < doc \ >, < \, etc., the sentence symbols are preserved, for example: comma, period, exclamation point, etc.

5. The method for parsing text semantics based on an N-gram BERT model according to claim 3, wherein the paragraph splitting specifically comprises: and setting a character threshold value of each training sample, namely each line, and if any sentence exceeds the character threshold value, truncating the sentence by using the character threshold value, and putting the truncated part into the next line as the next sentence.

Further, the character threshold is 128 characters.

Further, an OpenCC tool is used for converting simplified Chinese and traditional Chinese.

The invention has the beneficial effects that:

the invention enhances the semantic representation of the words by constructing a new model, and effectively improves the interpretation capability of the BERT model. The accuracy of the novel N-gram BERT model is higher than that of the original BERT model on the test result, and the matching precision of Chinese semantic similarity is improved. Meanwhile, the method has good generalization capability on other languages such as English, French and the like.

Additional features and advantages of the invention will be set forth in the detailed description which follows.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 shows a schematic diagram of the model structure of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein.

As shown in fig. 1, an N-Gram based BERT model, the model comprising:

A semantic segmentation method based on an N-Gram BERT model comprises the following steps: s1, acquiring a training data set:

selecting a corpus as an original data set, taking Chinese Wikipedia as an example; preprocessing an original data set by adopting an extractor to obtain a plurality of training data samples, and performing semantic segmentation on the training data samples by taking the training data samples as a training data set;

the extractor adopts a Wikipedia extractor; the pretreatment step comprises the following steps: firstly, processing an original data set by using a Wikipedia extractor, removing a title, adjusting a character format (unifying the font size, removing code annotation symbols such as < doc >, <' > and the like, reserving period symbols such as comma, period symbol, exclamation mark and the like), and removing blank lines to obtain an original article paragraph data set; secondly, converting the original article paragraph data set into original article paragraph data sets in two forms of simplified Chinese and traditional Chinese by utilizing an OpenCC tool; and finally, respectively carrying out paragraph splitting on original article paragraph data sets of simplified Chinese and traditional Chinese to form a pre-training data set.

The paragraph splitting is specifically as follows: a threshold of 128 characters per training sample, i.e., per line, is set, and if any sentence exceeds the character threshold, the sentence is truncated by the character threshold, and the truncated portion is put into the next line as the next sentence.

S2, establishing an N-gram BERT model:

s3, performing semantic segmentation by adopting an N-gram BERT model:

for the sentence to be recognized, preprocessing the sentence according to the step S1 to obtain a plurality of data samples, inputting the data samples into an N-gram BERT model, and obtaining a semantic segmentation result; the BERT and the N-gram BERT are compared, the test is carried out on the LCQMC data set, the test result can accurately reflect that the novel N-gram BERT model brings higher accuracy than the original BERT model, and the matching precision of Chinese semantic similarity is improved. Meanwhile, the method has good generalization capability on other languages such as English, French and the like.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments.

Claims

1. An N-Gram based BERT model, the model comprising:

2. A semantic segmentation method based on an N-Gram BERT model, applying the N-Gram BERT model of claim 1, comprising the steps of:

s1, acquiring a training data set:

s2, establishing an N-gram BERT model:

s3, performing semantic segmentation by adopting an N-gram BERT model:

and for the sentence to be recognized, preprocessing the sentence according to the step S1 to obtain a plurality of data samples, inputting the data samples into an N-gram BERT model, and obtaining a semantic segmentation result.

3. The N-Gram BERT model-based semantic segmentation method according to claim 2, wherein in step S1, the corpus is Chinese Wikipedia.

4. The N-Gram BERT model-based semantic segmentation method according to claim 2, wherein in step S1, the extractor is a wikipedia extractor; the pretreatment step comprises the following steps: firstly, processing an original data set by using a Wikipedia extractor, removing a title, adjusting a character format and removing blank lines to obtain an original article paragraph data set;

5. The N-Gram BERT model-based semantic segmentation method according to claim 4, wherein adjusting the text format comprises: unifying font size and removing code annotation symbols.

6. The N-Gram BERT model-based semantic segmentation method according to claim 4, wherein the paragraph splitting is specifically as follows: and setting a character threshold value of each training sample, namely each line, and if any sentence exceeds the character threshold value, truncating the sentence by using the character threshold value, and putting the truncated part into the next line as the next sentence.

7. The N-Gram BERT model-based semantic segmentation method according to claim 6, wherein the character threshold is 128 characters.

8. The N-Gram BERT model-based semantic segmentation method of claim 4, wherein OpenCC tools are used for conversion between simplified Chinese and traditional Chinese.