CN111553147A - BERT model based on N-gram and semantic segmentation method - Google Patents

BERT model based on N-gram and semantic segmentation method Download PDF

Info

Publication number
CN111553147A
CN111553147A CN202010230482.5A CN202010230482A CN111553147A CN 111553147 A CN111553147 A CN 111553147A CN 202010230482 A CN202010230482 A CN 202010230482A CN 111553147 A CN111553147 A CN 111553147A
Authority
CN
China
Prior art keywords
word
gram
semantic segmentation
bert model
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010230482.5A
Other languages
Chinese (zh)
Inventor
徐思昊
张帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Moshen Information Technology Co ltd
Nanjing Tech University
Original Assignee
Nanjing Moshen Information Technology Co ltd
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Moshen Information Technology Co ltd, Nanjing Tech University filed Critical Nanjing Moshen Information Technology Co ltd
Priority to CN202010230482.5A priority Critical patent/CN111553147A/en
Publication of CN111553147A publication Critical patent/CN111553147A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a BERT model based on N-gram and a semantic segmentation method, wherein the model comprises the following steps: the semantic segmentation method comprises a word splitting unit, a word coding unit, a word splicing input unit and a discarding unit, and semantic segmentation results corresponding to training data samples are obtained. The invention enhances the semantic representation of the words by constructing a new model, and effectively improves the interpretation capability of the BERT model. The accuracy of the novel N-gram BERT model is higher than that of the original BERT model on the test result, and the matching precision of Chinese semantic similarity is improved. Meanwhile, the method has good generalization capability on other languages such as English, French and the like.

Description

BERT model based on N-gram and semantic segmentation method
Technical Field
The invention relates to the field of natural language processing, in particular to a BERT model based on an N-gram and a semantic segmentation method.
Background
In the present period, natural language processing techniques based on deep learning have been widely applied to various scenarios for assisting human beings in understanding and processing natural language and text, and typical application scenarios include text classification, dialog system, question and answer system, text translation and natural language reasoning.
The proposal of BERT is undoubtedly one of the superior models in the field of natural language processing, and the model integrates the advantages of other models before, abandons the disadvantages of the models and achieves good effect on subsequent specific tasks of natural language processing. BERT concatenates two sentence sequences and serves as a model input portion, with a symbol mark placed at the beginning and end of each sentence. For each Word, BERT carries out three different embedding operations, namely 1, converts each Word into a vector with fixed dimensionality, carries out Word2Vec coding, 2, splices two sentences when sentence pairs are input, assigns 0 to each Word in the first sentence, and assigns 1 to each Word in the second sentence. 3. And coding the position information of the sentence in which the word is positioned. And splicing the three embedding result vectors to obtain the BERT word vector.
Compared with the traditional language model, the BERT is a deep neural network with about one hundred layers, and model parameter learning is carried out by utilizing large-scale linguistic data, so that more grammars, lexical and semantic information are merged into a BERT Word vector, and meanwhile, the BERT is trained by Word units, so that the problem of the unknown words faced by Word2Vec can be solved to a certain extent. However, the model of BERT is a working mechanism that learns the context representation through word-level representations. To date, little work has been done to understand language groups other than a single word in BERT. And interpreting elements such as phrases using word-level attention matrices or word-level context embedding, attention is not intuitive enough. Word-level coding and attention mechanisms limit the interpretation capabilities of BERTs.
Disclosure of Invention
The invention aims to solve the problems existing in the processing of word levels in a BERT model, and provides an N-gram-based BERT model and a semantic segmentation method. The invention combines the thought of N-gram to modify the BERT model, so that two words or even three words are used as input when the BERT model is input, thereby enhancing the semantic information between the words.
The technical scheme of the invention is as follows:
the invention provides an N-Gram-based BERT model, which comprises the following components:
a word splitting unit: splitting a training data sample into a plurality of words;
a word encoding unit: converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
word concatenation input unit: taking n words as input, taking the word i and i + (n-1) words behind the word i as input for training;
a discarding unit: and setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample.
A semantic segmentation method based on an N-Gram BERT model comprises the following steps:
s1, acquiring a training data set:
selecting a corpus as an original data set; preprocessing an original data set by adopting an extractor to obtain a plurality of training data samples, and performing semantic segmentation on the training data samples by taking the training data samples as a training data set;
s2, establishing an N-gram BERT model:
splitting a training data sample into a plurality of words; converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
training by taking n words as input models and taking the word i and i + (n-1) words behind the word i as input; setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample;
traversing the training data samples by adopting the steps until the matching rate of the semantic segmentation result output by the model and the semantic segmentation condition of the corresponding training data sample reaches the preset requirement, finishing training and obtaining an N-gram BERT model;
s3, performing semantic segmentation by adopting an N-gram BERT model:
for the sentence to be recognized, preprocessing the sentence according to the step S1 to obtain a plurality of data samples, inputting the data samples into an N-gram BERT model, and obtaining a semantic segmentation result; comparing BERT with n-gramBERT, testing on LCQMC data set, the n-gramBERT model has better effect and more accurate semantic segmentation.
Further, in step S1, the corpus is chinese wikipedia.
Further, in step S1, the extractor is a wikipedia extractor; the pretreatment step comprises the following steps: firstly, processing an original data set by using a Wikipedia extractor, removing a title, adjusting a character format and removing blank lines to obtain an original article paragraph data set;
secondly, converting the original article paragraph data set into original article paragraph data sets in simplified Chinese and traditional Chinese forms;
and finally, respectively carrying out paragraph splitting on original article paragraph data sets of simplified Chinese and traditional Chinese to form a pre-training data set.
Further, adjusting the text format includes: unifying font size and removing code annotation symbols; such as < doc \ >, < \, etc., the sentence symbols are preserved, for example: comma, period, exclamation point, etc.
5. The method for parsing text semantics based on an N-gram BERT model according to claim 3, wherein the paragraph splitting specifically comprises: and setting a character threshold value of each training sample, namely each line, and if any sentence exceeds the character threshold value, truncating the sentence by using the character threshold value, and putting the truncated part into the next line as the next sentence.
Further, the character threshold is 128 characters.
Further, an OpenCC tool is used for converting simplified Chinese and traditional Chinese.
The invention has the beneficial effects that:
the invention enhances the semantic representation of the words by constructing a new model, and effectively improves the interpretation capability of the BERT model. The accuracy of the novel N-gram BERT model is higher than that of the original BERT model on the test result, and the matching precision of Chinese semantic similarity is improved. Meanwhile, the method has good generalization capability on other languages such as English, French and the like.
Additional features and advantages of the invention will be set forth in the detailed description which follows.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.
Fig. 1 shows a schematic diagram of the model structure of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein.
As shown in fig. 1, an N-Gram based BERT model, the model comprising:
a word splitting unit: splitting a training data sample into a plurality of words;
a word encoding unit: converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
word concatenation input unit: taking n words as input, taking the word i and i + (n-1) words behind the word i as input for training;
a discarding unit: and setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample.
A semantic segmentation method based on an N-Gram BERT model comprises the following steps: s1, acquiring a training data set:
selecting a corpus as an original data set, taking Chinese Wikipedia as an example; preprocessing an original data set by adopting an extractor to obtain a plurality of training data samples, and performing semantic segmentation on the training data samples by taking the training data samples as a training data set;
the extractor adopts a Wikipedia extractor; the pretreatment step comprises the following steps: firstly, processing an original data set by using a Wikipedia extractor, removing a title, adjusting a character format (unifying the font size, removing code annotation symbols such as < doc >, <' > and the like, reserving period symbols such as comma, period symbol, exclamation mark and the like), and removing blank lines to obtain an original article paragraph data set; secondly, converting the original article paragraph data set into original article paragraph data sets in two forms of simplified Chinese and traditional Chinese by utilizing an OpenCC tool; and finally, respectively carrying out paragraph splitting on original article paragraph data sets of simplified Chinese and traditional Chinese to form a pre-training data set.
The paragraph splitting is specifically as follows: a threshold of 128 characters per training sample, i.e., per line, is set, and if any sentence exceeds the character threshold, the sentence is truncated by the character threshold, and the truncated portion is put into the next line as the next sentence.
S2, establishing an N-gram BERT model:
splitting a training data sample into a plurality of words; converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
training by taking n words as input models and taking the word i and i + (n-1) words behind the word i as input; setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample;
traversing the training data samples by adopting the steps until the matching rate of the semantic segmentation result output by the model and the semantic segmentation condition of the corresponding training data sample reaches the preset requirement, finishing training and obtaining an N-gram BERT model;
s3, performing semantic segmentation by adopting an N-gram BERT model:
for the sentence to be recognized, preprocessing the sentence according to the step S1 to obtain a plurality of data samples, inputting the data samples into an N-gram BERT model, and obtaining a semantic segmentation result; the BERT and the N-gram BERT are compared, the test is carried out on the LCQMC data set, the test result can accurately reflect that the novel N-gram BERT model brings higher accuracy than the original BERT model, and the matching precision of Chinese semantic similarity is improved. Meanwhile, the method has good generalization capability on other languages such as English, French and the like.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments.

Claims (8)

1. An N-Gram based BERT model, the model comprising:
a word splitting unit: splitting a training data sample into a plurality of words;
a word encoding unit: converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
word concatenation input unit: taking n words as input, taking the word i and i + (n-1) words behind the word i as input for training;
a discarding unit: and setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample.
2. A semantic segmentation method based on an N-Gram BERT model, applying the N-Gram BERT model of claim 1, comprising the steps of:
s1, acquiring a training data set:
selecting a corpus as an original data set; preprocessing an original data set by adopting an extractor to obtain a plurality of training data samples, and performing semantic segmentation on the training data samples by taking the training data samples as a training data set;
s2, establishing an N-gram BERT model:
splitting a training data sample into a plurality of words; converting each Word into a vector with fixed dimensionality, and carrying out Word2Vec coding to obtain a vector corresponding to each Word; splicing every two training data samples which are adjacent in sequence to form a group, assigning 0 to each word in a first sample, and assigning 1 to each word in a second sample; sequentially coding the positions of all words in the group, wherein i represents the word position;
training by taking n words as input models and taking the word i and i + (n-1) words behind the word i as input; setting a discarding coefficient to be 0.5, screening and discarding the training result by adopting a discarding model, namely a dropout function, and obtaining a semantic segmentation result corresponding to the training data sample;
traversing the training data samples by adopting the steps until the matching rate of the semantic segmentation result output by the model and the semantic segmentation condition of the corresponding training data sample reaches the preset requirement, finishing training and obtaining an N-gram BERT model;
s3, performing semantic segmentation by adopting an N-gram BERT model:
and for the sentence to be recognized, preprocessing the sentence according to the step S1 to obtain a plurality of data samples, inputting the data samples into an N-gram BERT model, and obtaining a semantic segmentation result.
3. The N-Gram BERT model-based semantic segmentation method according to claim 2, wherein in step S1, the corpus is Chinese Wikipedia.
4. The N-Gram BERT model-based semantic segmentation method according to claim 2, wherein in step S1, the extractor is a wikipedia extractor; the pretreatment step comprises the following steps: firstly, processing an original data set by using a Wikipedia extractor, removing a title, adjusting a character format and removing blank lines to obtain an original article paragraph data set;
secondly, converting the original article paragraph data set into original article paragraph data sets in simplified Chinese and traditional Chinese forms;
and finally, respectively carrying out paragraph splitting on original article paragraph data sets of simplified Chinese and traditional Chinese to form a pre-training data set.
5. The N-Gram BERT model-based semantic segmentation method according to claim 4, wherein adjusting the text format comprises: unifying font size and removing code annotation symbols.
6. The N-Gram BERT model-based semantic segmentation method according to claim 4, wherein the paragraph splitting is specifically as follows: and setting a character threshold value of each training sample, namely each line, and if any sentence exceeds the character threshold value, truncating the sentence by using the character threshold value, and putting the truncated part into the next line as the next sentence.
7. The N-Gram BERT model-based semantic segmentation method according to claim 6, wherein the character threshold is 128 characters.
8. The N-Gram BERT model-based semantic segmentation method of claim 4, wherein OpenCC tools are used for conversion between simplified Chinese and traditional Chinese.
CN202010230482.5A 2020-03-27 2020-03-27 BERT model based on N-gram and semantic segmentation method Pending CN111553147A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010230482.5A CN111553147A (en) 2020-03-27 2020-03-27 BERT model based on N-gram and semantic segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010230482.5A CN111553147A (en) 2020-03-27 2020-03-27 BERT model based on N-gram and semantic segmentation method

Publications (1)

Publication Number Publication Date
CN111553147A true CN111553147A (en) 2020-08-18

Family

ID=71998027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010230482.5A Pending CN111553147A (en) 2020-03-27 2020-03-27 BERT model based on N-gram and semantic segmentation method

Country Status (1)

Country Link
CN (1) CN111553147A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257459A (en) * 2020-10-16 2021-01-22 北京有竹居网络技术有限公司 Language translation model training method, translation method, device and electronic equipment
CN112749544A (en) * 2020-12-28 2021-05-04 苏州思必驰信息科技有限公司 Training method and system for paragraph segmentation model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104793724A (en) * 2014-01-16 2015-07-22 北京三星通信技术研究有限公司 Sky-writing processing method and device
CN108960207A (en) * 2018-08-08 2018-12-07 广东工业大学 A kind of method of image recognition, system and associated component
CN108984724A (en) * 2018-07-10 2018-12-11 凯尔博特信息科技(昆山)有限公司 It indicates to improve particular community emotional semantic classification accuracy rate method using higher-dimension
CN109740406A (en) * 2018-08-16 2019-05-10 大连民族大学 Non-division block letter language of the Manchus word recognition methods and identification network
CN110175246A (en) * 2019-04-09 2019-08-27 山东科技大学 A method of extracting notional word from video caption
CN110457701A (en) * 2019-08-08 2019-11-15 南京邮电大学 Dual training method based on interpretation confrontation text

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104793724A (en) * 2014-01-16 2015-07-22 北京三星通信技术研究有限公司 Sky-writing processing method and device
CN108984724A (en) * 2018-07-10 2018-12-11 凯尔博特信息科技(昆山)有限公司 It indicates to improve particular community emotional semantic classification accuracy rate method using higher-dimension
CN108960207A (en) * 2018-08-08 2018-12-07 广东工业大学 A kind of method of image recognition, system and associated component
CN109740406A (en) * 2018-08-16 2019-05-10 大连民族大学 Non-division block letter language of the Manchus word recognition methods and identification network
CN110175246A (en) * 2019-04-09 2019-08-27 山东科技大学 A method of extracting notional word from video caption
CN110457701A (en) * 2019-08-08 2019-11-15 南京邮电大学 Dual training method based on interpretation confrontation text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李莉贞;: "基于深度学习的无线网络数据时空建模和预测研究" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257459A (en) * 2020-10-16 2021-01-22 北京有竹居网络技术有限公司 Language translation model training method, translation method, device and electronic equipment
CN112749544A (en) * 2020-12-28 2021-05-04 苏州思必驰信息科技有限公司 Training method and system for paragraph segmentation model
CN112749544B (en) * 2020-12-28 2024-04-30 思必驰科技股份有限公司 Training method and system of paragraph segmentation model

Similar Documents

Publication Publication Date Title
CN110096698B (en) Topic-considered machine reading understanding model generation method and system
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN111339750B (en) Spoken language text processing method for removing stop words and predicting sentence boundaries
CN109918681B (en) Chinese character-pinyin-based fusion problem semantic matching method
CN112037773B (en) N-optimal spoken language semantic recognition method and device and electronic equipment
US20240005093A1 (en) Device, method and program for natural language processing
CN114153971B (en) Error correction recognition and classification equipment for Chinese text containing errors
CN115292463B (en) Information extraction-based method for joint multi-intention detection and overlapping slot filling
CN111414745A (en) Text punctuation determination method and device, storage medium and electronic equipment
CN115292461B (en) Man-machine interaction learning method and system based on voice recognition
CN111858888A (en) Multi-round dialogue system of check-in scene
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN112417823B (en) Chinese text word order adjustment and word completion method and system
CN112052319B (en) Intelligent customer service method and system based on multi-feature fusion
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
CN112528649A (en) English pinyin identification method and system for multi-language mixed text
CN111553147A (en) BERT model based on N-gram and semantic segmentation method
CN113761883A (en) Text information identification method and device, electronic equipment and storage medium
CN109815497B (en) Character attribute extraction method based on syntactic dependency
CN111444720A (en) Named entity recognition method for English text
CN115064154A (en) Method and device for generating mixed language voice recognition model
CN114927177A (en) Medical entity identification method and system fusing Chinese medical field characteristics
CN114757184A (en) Method and system for realizing knowledge question answering in aviation field
CN112183060B (en) Reference resolution method of multi-round dialogue system
Ajees et al. A named entity recognition system for Malayalam using neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200818