CN111553147A - BERT model based on N-gram and semantic segmentation method - Google Patents
BERT model based on N-gram and semantic segmentation method Download PDFInfo
- Publication number
- CN111553147A CN111553147A CN202010230482.5A CN202010230482A CN111553147A CN 111553147 A CN111553147 A CN 111553147A CN 202010230482 A CN202010230482 A CN 202010230482A CN 111553147 A CN111553147 A CN 111553147A
- Authority
- CN
- China
- Prior art keywords
- word
- gram
- semantic segmentation
- bert model
- training data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 61
- 239000013598 vector Substances 0.000 claims description 16
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims 1
- 238000012360 testing method Methods 0.000 abstract description 5
- 230000006870 function Effects 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention provides a BERT model based on N-gram and a semantic segmentation method, wherein the model comprises the following steps: the semantic segmentation method comprises a word splitting unit, a word coding unit, a word splicing input unit and a discarding unit, and semantic segmentation results corresponding to training data samples are obtained. The invention enhances the semantic representation of the words by constructing a new model, and effectively improves the interpretation capability of the BERT model. The accuracy of the novel N-gram BERT model is higher than that of the original BERT model on the test result, and the matching precision of Chinese semantic similarity is improved. Meanwhile, the method has good generalization capability on other languages such as English, French and the like.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a BERT model based on an N-gram and a semantic segmentation method.
Background
In the present period, natural language processing techniques based on deep learning have been widely applied to various scenarios for assisting human beings in understanding and processing natural language and text, and typical application scenarios include text classification, dialog system, question and answer system, text translation and natural language reasoning.
The proposal of BERT is undoubtedly one of the superior models in the field of natural language processing, and the model integrates the advantages of other models before, abandons the disadvantages of the models and achieves good effect on subsequent specific tasks of natural language processing. BERT concatenates two sentence sequences and serves as a model input portion, with a symbol mark placed at the beginning and end of each sentence. For each Word, BERT carries out three different embedding operations, namely 1, converts each Word into a vector with fixed dimensionality, carries out Word2Vec coding, 2, splices two sentences when sentence pairs are input, assigns 0 to each Word in the first sentence, and assigns 1 to each Word in the second sentence. 3. And coding the position information of the sentence in which the word is positioned. And splicing the three embedding result vectors to obtain the BERT word vector.
Compared with the traditional language model, the BERT is a deep neural network with about one hundred layers, and model parameter learning is carried out by utilizing large-scale linguistic data, so that more grammars, lexical and semantic information are merged into a BERT Word vector, and meanwhile, the BERT is trained by Word units, so that the problem of the unknown words faced by Word2Vec can be solved to a certain extent. However, the model of BERT is a working mechanism that learns the context representation through word-level representations. To date, little work has been done to understand language groups other than a single word in BERT. And interpreting elements such as phrases using word-level attention matrices or word-level context embedding, attention is not intuitive enough. Word-level coding and attention mechanisms limit the interpretation capabilities of BERTs.
Disclosure of Invention
The invention aims to solve the problems existing in the processing of word levels in a BERT model, and provides an N-gram-based BERT model and a semantic segmentation method. The invention combines the thought of N-gram to modify the BERT model, so that two words or even three words are used as input when the BERT model is input, thereby enhancing the semantic information between the words.
The technical scheme of the invention is as follows:
the invention provides an N-Gram-based BERT model, which comprises the following components:
a word splitting unit: splitting a training data sample into a plurality of words;
a word encoding unit: converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
word concatenation input unit: taking n words as input, taking the word i and i + (n-1) words behind the word i as input for training;
a discarding unit: and setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample.
A semantic segmentation method based on an N-Gram BERT model comprises the following steps:
s1, acquiring a training data set:
selecting a corpus as an original data set; preprocessing an original data set by adopting an extractor to obtain a plurality of training data samples, and performing semantic segmentation on the training data samples by taking the training data samples as a training data set;
s2, establishing an N-gram BERT model:
splitting a training data sample into a plurality of words; converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
training by taking n words as input models and taking the word i and i + (n-1) words behind the word i as input; setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample;
traversing the training data samples by adopting the steps until the matching rate of the semantic segmentation result output by the model and the semantic segmentation condition of the corresponding training data sample reaches the preset requirement, finishing training and obtaining an N-gram BERT model;
s3, performing semantic segmentation by adopting an N-gram BERT model:
for the sentence to be recognized, preprocessing the sentence according to the step S1 to obtain a plurality of data samples, inputting the data samples into an N-gram BERT model, and obtaining a semantic segmentation result; comparing BERT with n-gramBERT, testing on LCQMC data set, the n-gramBERT model has better effect and more accurate semantic segmentation.
Further, in step S1, the corpus is chinese wikipedia.
Further, in step S1, the extractor is a wikipedia extractor; the pretreatment step comprises the following steps: firstly, processing an original data set by using a Wikipedia extractor, removing a title, adjusting a character format and removing blank lines to obtain an original article paragraph data set;
secondly, converting the original article paragraph data set into original article paragraph data sets in simplified Chinese and traditional Chinese forms;
and finally, respectively carrying out paragraph splitting on original article paragraph data sets of simplified Chinese and traditional Chinese to form a pre-training data set.
Further, adjusting the text format includes: unifying font size and removing code annotation symbols; such as < doc \ >, < \, etc., the sentence symbols are preserved, for example: comma, period, exclamation point, etc.
5. The method for parsing text semantics based on an N-gram BERT model according to claim 3, wherein the paragraph splitting specifically comprises: and setting a character threshold value of each training sample, namely each line, and if any sentence exceeds the character threshold value, truncating the sentence by using the character threshold value, and putting the truncated part into the next line as the next sentence.
Further, the character threshold is 128 characters.
Further, an OpenCC tool is used for converting simplified Chinese and traditional Chinese.
The invention has the beneficial effects that:
the invention enhances the semantic representation of the words by constructing a new model, and effectively improves the interpretation capability of the BERT model. The accuracy of the novel N-gram BERT model is higher than that of the original BERT model on the test result, and the matching precision of Chinese semantic similarity is improved. Meanwhile, the method has good generalization capability on other languages such as English, French and the like.
Additional features and advantages of the invention will be set forth in the detailed description which follows.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.
Fig. 1 shows a schematic diagram of the model structure of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein.
As shown in fig. 1, an N-Gram based BERT model, the model comprising:
a word splitting unit: splitting a training data sample into a plurality of words;
a word encoding unit: converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
word concatenation input unit: taking n words as input, taking the word i and i + (n-1) words behind the word i as input for training;
a discarding unit: and setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample.
A semantic segmentation method based on an N-Gram BERT model comprises the following steps: s1, acquiring a training data set:
selecting a corpus as an original data set, taking Chinese Wikipedia as an example; preprocessing an original data set by adopting an extractor to obtain a plurality of training data samples, and performing semantic segmentation on the training data samples by taking the training data samples as a training data set;
the extractor adopts a Wikipedia extractor; the pretreatment step comprises the following steps: firstly, processing an original data set by using a Wikipedia extractor, removing a title, adjusting a character format (unifying the font size, removing code annotation symbols such as < doc >, <' > and the like, reserving period symbols such as comma, period symbol, exclamation mark and the like), and removing blank lines to obtain an original article paragraph data set; secondly, converting the original article paragraph data set into original article paragraph data sets in two forms of simplified Chinese and traditional Chinese by utilizing an OpenCC tool; and finally, respectively carrying out paragraph splitting on original article paragraph data sets of simplified Chinese and traditional Chinese to form a pre-training data set.
The paragraph splitting is specifically as follows: a threshold of 128 characters per training sample, i.e., per line, is set, and if any sentence exceeds the character threshold, the sentence is truncated by the character threshold, and the truncated portion is put into the next line as the next sentence.
S2, establishing an N-gram BERT model:
splitting a training data sample into a plurality of words; converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
training by taking n words as input models and taking the word i and i + (n-1) words behind the word i as input; setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample;
traversing the training data samples by adopting the steps until the matching rate of the semantic segmentation result output by the model and the semantic segmentation condition of the corresponding training data sample reaches the preset requirement, finishing training and obtaining an N-gram BERT model;
s3, performing semantic segmentation by adopting an N-gram BERT model:
for the sentence to be recognized, preprocessing the sentence according to the step S1 to obtain a plurality of data samples, inputting the data samples into an N-gram BERT model, and obtaining a semantic segmentation result; the BERT and the N-gram BERT are compared, the test is carried out on the LCQMC data set, the test result can accurately reflect that the novel N-gram BERT model brings higher accuracy than the original BERT model, and the matching precision of Chinese semantic similarity is improved. Meanwhile, the method has good generalization capability on other languages such as English, French and the like.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments.
Claims (8)
1. An N-Gram based BERT model, the model comprising:
a word splitting unit: splitting a training data sample into a plurality of words;
a word encoding unit: converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
word concatenation input unit: taking n words as input, taking the word i and i + (n-1) words behind the word i as input for training;
a discarding unit: and setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample.
2. A semantic segmentation method based on an N-Gram BERT model, applying the N-Gram BERT model of claim 1, comprising the steps of:
s1, acquiring a training data set:
selecting a corpus as an original data set; preprocessing an original data set by adopting an extractor to obtain a plurality of training data samples, and performing semantic segmentation on the training data samples by taking the training data samples as a training data set;
s2, establishing an N-gram BERT model:
splitting a training data sample into a plurality of words; converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
training by taking n words as input models and taking the word i and i + (n-1) words behind the word i as input; setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample;
traversing the training data samples by adopting the steps until the matching rate of the semantic segmentation result output by the model and the semantic segmentation condition of the corresponding training data sample reaches the preset requirement, finishing training and obtaining an N-gram BERT model;
s3, performing semantic segmentation by adopting an N-gram BERT model:
and for the sentence to be recognized, preprocessing the sentence according to the step S1 to obtain a plurality of data samples, inputting the data samples into an N-gram BERT model, and obtaining a semantic segmentation result.
3. The N-Gram BERT model-based semantic segmentation method according to claim 2, wherein in step S1, the corpus is Chinese Wikipedia.
4. The N-Gram BERT model-based semantic segmentation method according to claim 2, wherein in step S1, the extractor is a wikipedia extractor; the pretreatment step comprises the following steps: firstly, processing an original data set by using a Wikipedia extractor, removing a title, adjusting a character format and removing blank lines to obtain an original article paragraph data set;
secondly, converting the original article paragraph data set into original article paragraph data sets in simplified Chinese and traditional Chinese forms;
and finally, respectively carrying out paragraph splitting on original article paragraph data sets of simplified Chinese and traditional Chinese to form a pre-training data set.
5. The N-Gram BERT model-based semantic segmentation method according to claim 4, wherein adjusting the text format comprises: unifying font size and removing code annotation symbols.
6. The N-Gram BERT model-based semantic segmentation method according to claim 4, wherein the paragraph splitting is specifically as follows: and setting a character threshold value of each training sample, namely each line, and if any sentence exceeds the character threshold value, truncating the sentence by using the character threshold value, and putting the truncated part into the next line as the next sentence.
7. The N-Gram BERT model-based semantic segmentation method according to claim 6, wherein the character threshold is 128 characters.
8. The N-Gram BERT model-based semantic segmentation method of claim 4, wherein OpenCC tools are used for conversion between simplified Chinese and traditional Chinese.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010230482.5A CN111553147A (en) | 2020-03-27 | 2020-03-27 | BERT model based on N-gram and semantic segmentation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010230482.5A CN111553147A (en) | 2020-03-27 | 2020-03-27 | BERT model based on N-gram and semantic segmentation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111553147A true CN111553147A (en) | 2020-08-18 |
Family
ID=71998027
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010230482.5A Pending CN111553147A (en) | 2020-03-27 | 2020-03-27 | BERT model based on N-gram and semantic segmentation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111553147A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257459A (en) * | 2020-10-16 | 2021-01-22 | 北京有竹居网络技术有限公司 | Language translation model training method, translation method, device and electronic equipment |
CN112749544A (en) * | 2020-12-28 | 2021-05-04 | 苏州思必驰信息科技有限公司 | Training method and system for paragraph segmentation model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104793724A (en) * | 2014-01-16 | 2015-07-22 | 北京三星通信技术研究有限公司 | Sky-writing processing method and device |
CN108960207A (en) * | 2018-08-08 | 2018-12-07 | 广东工业大学 | A kind of method of image recognition, system and associated component |
CN108984724A (en) * | 2018-07-10 | 2018-12-11 | 凯尔博特信息科技(昆山)有限公司 | It indicates to improve particular community emotional semantic classification accuracy rate method using higher-dimension |
CN109740406A (en) * | 2018-08-16 | 2019-05-10 | 大连民族大学 | Non-division block letter language of the Manchus word recognition methods and identification network |
CN110175246A (en) * | 2019-04-09 | 2019-08-27 | 山东科技大学 | A method of extracting notional word from video caption |
CN110457701A (en) * | 2019-08-08 | 2019-11-15 | 南京邮电大学 | Dual training method based on interpretation confrontation text |
-
2020
- 2020-03-27 CN CN202010230482.5A patent/CN111553147A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104793724A (en) * | 2014-01-16 | 2015-07-22 | 北京三星通信技术研究有限公司 | Sky-writing processing method and device |
CN108984724A (en) * | 2018-07-10 | 2018-12-11 | 凯尔博特信息科技(昆山)有限公司 | It indicates to improve particular community emotional semantic classification accuracy rate method using higher-dimension |
CN108960207A (en) * | 2018-08-08 | 2018-12-07 | 广东工业大学 | A kind of method of image recognition, system and associated component |
CN109740406A (en) * | 2018-08-16 | 2019-05-10 | 大连民族大学 | Non-division block letter language of the Manchus word recognition methods and identification network |
CN110175246A (en) * | 2019-04-09 | 2019-08-27 | 山东科技大学 | A method of extracting notional word from video caption |
CN110457701A (en) * | 2019-08-08 | 2019-11-15 | 南京邮电大学 | Dual training method based on interpretation confrontation text |
Non-Patent Citations (1)
Title |
---|
李莉贞;: "基于深度学习的无线网络数据时空建模和预测研究" * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257459A (en) * | 2020-10-16 | 2021-01-22 | 北京有竹居网络技术有限公司 | Language translation model training method, translation method, device and electronic equipment |
CN112749544A (en) * | 2020-12-28 | 2021-05-04 | 苏州思必驰信息科技有限公司 | Training method and system for paragraph segmentation model |
CN112749544B (en) * | 2020-12-28 | 2024-04-30 | 思必驰科技股份有限公司 | Training method and system of paragraph segmentation model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110096698B (en) | Topic-considered machine reading understanding model generation method and system | |
CN109960728B (en) | Method and system for identifying named entities of open domain conference information | |
CN111339750B (en) | Spoken language text processing method for removing stop words and predicting sentence boundaries | |
CN109918681B (en) | Chinese character-pinyin-based fusion problem semantic matching method | |
CN112037773B (en) | N-optimal spoken language semantic recognition method and device and electronic equipment | |
US20240005093A1 (en) | Device, method and program for natural language processing | |
CN114153971B (en) | Error correction recognition and classification equipment for Chinese text containing errors | |
CN115292463B (en) | Information extraction-based method for joint multi-intention detection and overlapping slot filling | |
CN111414745A (en) | Text punctuation determination method and device, storage medium and electronic equipment | |
CN115292461B (en) | Man-machine interaction learning method and system based on voice recognition | |
CN111858888A (en) | Multi-round dialogue system of check-in scene | |
CN113268576B (en) | Deep learning-based department semantic information extraction method and device | |
CN112417823B (en) | Chinese text word order adjustment and word completion method and system | |
CN112052319B (en) | Intelligent customer service method and system based on multi-feature fusion | |
KR20230009564A (en) | Learning data correction method and apparatus thereof using ensemble score | |
CN112528649A (en) | English pinyin identification method and system for multi-language mixed text | |
CN111553147A (en) | BERT model based on N-gram and semantic segmentation method | |
CN113761883A (en) | Text information identification method and device, electronic equipment and storage medium | |
CN109815497B (en) | Character attribute extraction method based on syntactic dependency | |
CN111444720A (en) | Named entity recognition method for English text | |
CN115064154A (en) | Method and device for generating mixed language voice recognition model | |
CN114927177A (en) | Medical entity identification method and system fusing Chinese medical field characteristics | |
CN114757184A (en) | Method and system for realizing knowledge question answering in aviation field | |
CN112183060B (en) | Reference resolution method of multi-round dialogue system | |
Ajees et al. | A named entity recognition system for Malayalam using neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200818 |