CN110543639B - English sentence simplification algorithm based on pre-training transducer language model - Google Patents

English sentence simplification algorithm based on pre-training transducer language model Download PDF

Info

Publication number
CN110543639B
CN110543639B CN201910863529.9A CN201910863529A CN110543639B CN 110543639 B CN110543639 B CN 110543639B CN 201910863529 A CN201910863529 A CN 201910863529A CN 110543639 B CN110543639 B CN 110543639B
Authority
CN
China
Prior art keywords
word
words
sentence
content
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910863529.9A
Other languages
Chinese (zh)
Other versions
CN110543639A (en
Inventor
强继朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou University
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN201910863529.9A priority Critical patent/CN110543639B/en
Publication of CN110543639A publication Critical patent/CN110543639A/en
Application granted granted Critical
Publication of CN110543639B publication Critical patent/CN110543639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an English sentence simplification algorithm based on a pre-training transducer language model, which comprises the following steps: step 1, counting word frequency by using the disclosed wikipedia corpus; step 2, utilizing a disclosed pre-trained word embedding model to obtain vectorized expression of words; step 3, preprocessing sentences to be simplified to obtain content words; step 4, obtaining a candidate substituted word set of the content word in the sentence by using a published pre-training transducer language model Bert; step 5, sequencing the candidate alternative word sets of each content word by utilizing a plurality of characteristics; step 6, comparing the word frequency of the highest-ranking candidate word with that of the original content word to determine a final alternative word; and 7, processing other content words in the sentence according to the steps 4 to 6 in turn to obtain a final simplified sentence.

Description

English sentence simplification algorithm based on pre-training transducer language model
Technical Field
The invention relates to the field of English text simplification, in particular to an English sentence simplification algorithm based on a pre-training transducer language model.
Background
In recent years, more and more English materials on the Internet are written in English, for example, many professional papers are published in English journals, and many people start to like to directly read English materials, rather than firstly translating English materials into Chinese and reading Chinese materials. Many studies have demonstrated that if 90% of the words in text can be understood, even long and complex text, the meaning of the text can be understood more easily.
The vocabulary simplification algorithm in sentence simplification aims to replace complex words in sentences with simple synonyms, so that the vocabulary requirement on users can be greatly reduced. The steps of the vocabulary simplification algorithm in the existing sentence simplification can be roughly divided into: complex word recognition, candidate substitution words generating complex words, candidate substitution word sequencing and candidate substitution word selection. Lexical reduction algorithms can be broadly divided into three categories according to the generation of candidate surrogate words: the first class is a dictionary-based simplified algorithm, which mainly utilizes a dictionary (such as WortNet) to generate synonyms of complex words as candidate alternative words; the second class of algorithms are algorithms based on parallel corpus, the most common parallel corpus is obtained from normal wikipedia and children version of English wikipedia, sentences are selected from two different wikipedia respectively to be used as parallel sentence pairs through a matching algorithm, and then rules are obtained from the parallel sentence pairs and are used for generating candidate substituted words of complex words; the third class algorithm is based on a word embedding model, obtains vector representation of words from the word submerging model, and searches a word set with the most similar complex words as candidate substituent words by using a word similarity calculation method. The first two types of algorithms have great constraint, firstly, the construction cost of a dictionary is great, and the parallel corpus extraction with high quality is very difficult, and secondly, the coverage of complex words by the two types of algorithms is limited. The most main problem of the three algorithms is that in the process of generating candidate words, only complex words are considered, the context of the complex words is ignored, a plurality of unsuitable candidate words are inevitably generated, and great interference is brought to the following steps of a system.
Disclosure of Invention
The invention aims to overcome the defects of the existing vocabulary simplification algorithm, only utilizes complex words to generate candidate substituted words, ignores the context of the complex words, and provides an English sentence simplification algorithm based on a pre-training transducer language model.
The purpose of the invention is realized in the following way: an English sentence simplification algorithm based on a pre-training transducer language model is carried out according to the following steps:
step 1, counting the frequency f (w) of each word w by using a public English wikipedia corpus D, wherein f (w) represents the occurrence times of the word w in the D;
step 2, acquiring a disclosed word embedding model which adopts a word vector model fastText to perform pre-training; by using the word embedding model, the vector representation v of the word w can be obtained w
Step 3, assuming that the sentence to be simplified is s, firstly removing stop words in the sentence s, then using a word segmentation tool to segment the s and label the part of speech, and obtaining a content word (noun, verb, adjective and adverb) set { w } 1 ,…,w i ,…,w n -a }; the initial value of i is 1;
step 4, obtaining the content word w in the sentence s by using the disclosed pre-training transducer language model Bert i (1≤i.ltoreq.n) set of candidate surrogate words CS i
Step 5, adopting a plurality of characteristics to CS i Ranking the candidate words in the list; selecting the top candidate word c by averaging the plurality of ranked results i
Step 6, if candidate word c i Frequency f (c) i ) Is larger than the original content word w i Frequency f (w) i ) Then select candidate word c i As a surrogate; otherwise, the original content word w is still reserved i
Step 7, let i=i+1, and sequentially execute step 4 to step 6; when all the content words in the sentence s are processed, the original content words are replaced, and a simplified sentence of the sentence s is obtained.
As a further definition of the present invention, step 4 specifically includes:
step 4.1, obtaining a published pre-training transducer language model Bert;
step 4.2, use "[ MASK ]]"symbol replaces content word w in sentence s i The substituted sentence is defined as s';
step 4.3, sequentially connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", and defining a combined sequence as S;
step 4.4, utilizing a word segmentation device BertTokenizer in Bert to segment the S, wherein the segmented set is called T;
step 4.5, then converting T into corresponding ID features by using a BertTokenizer;
step 4.6, acquiring the length len of the set T, defining an array with the length len, and enabling all values to be 1, namely Mask features;
step 4.7, defining an array with the length of len, wherein the content before the corresponding position of the first symbol [ SEP ] is assigned 0, and the rest content is assigned 1, which is called as Type feature;
step 4.8, transmitting the three features (ID feature, mask feature and Type feature) to a Mask language model (Masked Language Model) of the Bert, and obtaining the scores SC of all words in a vocabulary corresponding to the symbol "[ MASK ]";
step 4.9, excluding the original content word w and the corresponding morphological derivative words thereof, and selecting 10 words with high scores from the SC as candidate substitution word set CS i
As a further definition of the present invention, step 5 specifically includes:
step 5.1, using four features for each candidate surrogate CS i Sorting, namely Bert output, language model characteristics, semantic similarity and word frequency characteristics; defining a variable all_ranges, wherein an initial value is an empty set; assume CS i ={c 1 ,c 2 ,…,c j ,…,c 10 };
Step 5.2, scoring SC of all words contained in the Bert output feature, and comparing CS according to the score i The midwords being ordered, i.e. rank 1 = {1,2, …,10}; increasing rank 1 Into the set all_ranks;
step 5.3, calculating CS by using the mask language model of Bert i The sequence probability after the original word w is replaced by the word in the sequence probability, and the rank is obtained 2 And added to the set all_ranks; selecting the context of the content word W from the sentence s, composing a new sequence w=w -m ,…,w -1 ,w,w 1 ,…,w m The method comprises the steps of carrying out a first treatment on the surface of the Let the initial value of j be 1;
step 5.4, adopting semantic similarity characteristics to CS i All words in the sequence are sequenced to obtain rank 3 And to the set all_ranks; let the initial value of j be 1;
step 5.5, adopting word frequency characteristics to CS i Ordering all words in the list; acquiring CS in the process of acquiring by using the word frequency acquired in the step 1 i Frequency { f (c) 1 ),f(c 2 ),...,f(c 10 ) -a }; sequencing according to word frequency to obtain rank 4 Wherein the largest value ranks first, and so on; obtaining rank 4 And to the set all_ranks;
step 5.6, calculating CS using the ranking of the four features in all_ranks i ={c 1 ,c 2 ,…,c j ,…,c 10 An average ranking value for each word in,the top ranked word is selected as the candidate word.
As a further definition of the invention, step 5.3 specifically comprises:
step 5.3.1, use c j Instead of the content word W in W, a new sequence W' =w is composed -m ,…,w -1 ,c j ,w 1 ,…,w m
Step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating a cross entropy loss value of the hidden sequence by using a mask language model of the Bert; finally, the average value loss of the cross entropy loss values of all the words of the W' is calculated i
Step 5.3.3, j=j+1, steps 5.3.1 and 5.3.2 are repeated until CS i The calculation of all words in the list is completed;
step 5.3.4 for all loss values { loss 1 ,loss 2 ,...,loss 10 Sequencing to obtain rank 2 Where the smallest value ranks first, and so on.
As a further definition of the invention, step 5.4 specifically comprises:
step 5.4.1, obtaining c from the word vector model j Vector representation of sum w
Figure GDA0004059463560000051
And v w
Step 5.4.2, calculating by adopting a cosine similarity calculation method
Figure GDA0004059463560000052
And v w Similarity value of (2)
Figure GDA0004059463560000053
Figure GDA0004059463560000054
In the formula (1), g is the dimension of a vector in a word vector model;
step 5.4.3, j=j+1, steps 5.4.1 and 5.4.2 are repeated until CS i The calculation of all words in the list is completed;
step 5.4.4 for all loss values { cos 1 ,cos 2 ,...,cos 10 Sequencing to obtain rank 3 Where the largest value is ranked first, and so on.
Compared with the prior art, the invention has the beneficial effects that:
1. the method simplifies the steps, does not adopt the recognition of complex words, and only carries out simple stop word removal and word segmentation on sentences needing to be simplified; each content word (noun, verb, adjective and adverb) in the sentence serves as a complex word, and then candidate replacement words are generated and selected for each complex word. Only if the frequency of the finally generated substitution word is larger than the frequency of the original word, the original word is replaced. The method can simplify the vocabulary simplification steps and improve the model efficiency.
2. The invention utilizes word candidates that generate words based on a pre-trained transducer language model Bert. Bert is trained with a masking language model (Masked Language modeling, MLM) using a massive corpus of text. The MLM performs optimization training by randomly hiding a small portion of the vocabulary in the sentence, predicting the hidden vocabulary. For the vocabulary simplification algorithm, the MLM predicts the probability of the hidden words in the vocabulary by hiding the complex words, and then selects the word with the highest probability as the candidate substitute. Compared with the existing algorithm, the method does not only utilize the complex word, but also generates candidate substituted words of the complex word on the basis of the original sentence, and can better acquire the candidate substituted words of the complex word, thereby overcoming the defect that the traditional method only aims at the complex word to generate candidates.
3. The invention utilizes the candidate substitution word generated by Bert, the generated candidate word already considers the context of the complex word, the generation of the candidate word also considers the language environment of the context, the ordering of the candidate word and the morphological change of the substitution word can be omitted, and the sentence simplification algorithm is greatly simplified.
4. The invention selects the candidate words by utilizing four characteristics of the Bert output, the Bert mask language model, the word frequency and the semantic similarity, not only considers the relevance of the candidate words and the complex words and the consistency between the candidate words and the original context, but also considers the simplification of the candidate words, thereby being capable of more accurately finding the most suitable substitute words.
Detailed Description
The invention will be further illustrated with reference to specific examples.
An English sentence simplification algorithm based on a pre-training transducer language model is carried out according to the following steps:
step 1, using the disclosed English wikipedia corpus D, downloading from 'https:// dump. Wikipedia. Org/enwiki/', and counting the frequency f (w) of each word w, wherein f (w) represents the occurrence times of the word w in the D; in the text reduction field, word complexity measurement is performed by considering the frequency of words; in general, the higher the frequency of a word, the more easily the word is understood; thus, word frequency can be used to find the most easily understood word from a set of highly similar words of word t.
Step 2, obtaining a disclosed word embedding model which adopts a word vector model fastText to conduct pre-training, wherein the word embedding model can be obtained from
"https:// dl.fbaipfublicfiles.com/fastbots/vectors-englist/crawl-300 d-2M-subword.zip" download, where fastbext is an open source algorithm that can be used to train word embedding models, which can be referred to the paper "Enriching Word Vectors with Subword Information" written by Bojanowski et al, publication time 2017; by using the word embedding model, the vector representation v of the word w can be obtained w Wherein each vector is 300 dimensions.
Step 3, assuming that the sentence to be simplified is s, firstly removing stop words in the sentence s, then using a word segmentation tool to segment the s and label the part of speech, and obtaining a content word (noun, verb, adjective and adverb) set { w } 1 ,…,w i ,…,w n Here, stop words and english segmentation both select the ntk package of the python language; the initial value of i is 1.
Step 4, utilizing the disclosed pre-trained transducer language model Bert, obtaining content word w in sentence s i (1. Ltoreq.i.ltoreq.n) candidate set of surrogate words CS i The method comprises the steps of carrying out a first treatment on the surface of the Bert is a Pre-trained transducer language model, and the algorithm can refer to the paper "Bert: pre-training of deep bidirectional transformers for language understanding" written by Devlin et al, publication time 2018; bert is trained by masking language models (Masked Language modeling, MLM) using a common massive text corpus; the MLM randomly masks a small part of vocabulary in the text, predicts the masked vocabulary and performs optimization training; for a vocabulary simplification algorithm in text simplification, the MLM predicts the probability that all words in the vocabulary belong to the hidden words by hiding complex words, and then selects the word with the highest probability as a candidate substitution.
Step 4.1, obtaining a published pretrained converter language model Bert, wherein a Bert algorithm realized by a pyrach is selected, and the pretrained Bert model can be downloaded from https:// github.com/google-research/Bert as BERT-Larger Uncased (Whole Word Masking);
step 4.2, use "[ MASK ]]"symbol replaces content word w in sentence s i The substituted sentence is defined as s', here "[ MASK"]"symbol represents a hidden symbol, and the MLM model optimizes the Bert model by predicting the symbol and comparing the predicted value with the original word;
step 4.3, sequentially connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", and defining the combined sequence as S, wherein "[ CLS ]" and "[ SEP ]" are two special symbols in Bert, "[ CLS ]" is generally added at the forefront and is used as a category identification number, and "[ SEP ]" is used as a sentence separator; there are two benefits to using S 'without directly using S' here: the first benefit is that the influence of the complex word on the prediction result of the 'MASK' is considered, and the second benefit is that Bert is good at handling the double sentence problem, and is attributed to the fact that Bert also adopts the next sentence prediction (Next sentence prediction) to perform an optimization model;
step 4.4, utilizing a word segmentation device BertTokenizer in Bert to segment the S, wherein the segmented set is called T;
step 4.5, converting T into corresponding ID features by using the BertTokenizer, wherein the ID features are numerical numbers corresponding to each word in the Bert;
step 4.6, acquiring the length len of the set T, defining an array with the length len, and enabling all values to be 1, namely Mask features; mask features are used to identify the location of useful information;
step 4.7, defining an array with the length of len, wherein the content before the corresponding position of the first symbol [ SEP ] is assigned 0, and the rest content is assigned 1, which is called as Type feature; the Type feature is used to distinguish between two sentences;
step 4.8, transmitting the three features (ID feature, mask feature and Type feature) to a Mask language model (Masked Language Model) of the Bert, and obtaining the scores SC of all words in a vocabulary corresponding to the symbol "[ MASK ]";
step 4.9, excluding the original content word w and the corresponding morphological derivative words thereof, and selecting 10 words with high scores from the SC as candidate substitution word set CS i
Step 5, adopting a plurality of characteristics including Bert output, bert mask language model, word frequency and semantic similarity, and CS i Ranking the candidate words in the list; selecting the top candidate word c by averaging the plurality of ranked results i
Step 5.1, using four features for each candidate surrogate CS i Sorting, namely Bert output, language model characteristics, semantic similarity and word frequency characteristics; defining a variable all_ranges, wherein an initial value is an empty set; assume CS i ={c 1 ,c 2 ,…,c j ,…,c 10 };
Step 5.2, scoring SC of all words contained in the Bert output feature, and comparing CS according to the score i The midwords being ordered, i.e. rank 1 = {1,2, …,10}; the Bert output feature itself contains the relationship between the candidate word and the original complex word and context; increasing rank 1 Into the set all_ranks;
step 5.3, calculating CS by using the mask language model of Bert i In (a) and (b)Sequence probability after the word replaces the original word w is obtained 2 And added to the set all_ranks; here, mainly consider the consistency of the context of candidate words and complex words; selecting the context of the content word W from the sentence s, composing a new sequence w=w -m ,…,w -1 ,w,w 1 ,…,w m Where m is selected to be 5, i.e., a maximum of 5 words before and after the complex word is taken; let the initial value of j be 1;
step 5.3.1, use c j Instead of the content word W in W, a new sequence W' =w is composed -m ,…,w -1 ,c j ,w 1 ,…,w m
Step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating a cross entropy loss value of the hidden sequence by using a mask language model of the Bert; finally, the average value loss of the cross entropy loss values of all the words of the W' is calculated i
Step 5.3.3, j=j+1, steps 5.3.1 and 5.3.2 are repeated until CS i The calculation of all words in the list is completed;
step 5.3.4 for all loss values { loss 1 ,loss 2 ,...,loss 10 Sequencing to obtain rank 2 Wherein the smallest value ranks first, and so on;
step 5.4, adopting semantic similarity characteristics to CS i All words in the sequence are sequenced to obtain rank 3 And to the set all_ranks; the higher the similarity here means that the semantics of the candidate word and the original complex word are closer; let the initial value of j be 1;
step 5.4.1, obtaining c from the word vector model j Vector representation of sum w
Figure GDA0004059463560000101
And v w
Step 5.4.2, calculating by adopting a cosine similarity calculation method
Figure GDA0004059463560000102
And v w Similarity value of (2)
Figure GDA0004059463560000103
In the formula (1), g is the dimension of a vector in a word vector model;
step 5.4.3, j=j+1, steps 5.4.1 and 5.4.2 are repeated until CS i The calculation of all words in the list is completed;
step 5.4.4 for all loss values { cos 1 ,cos 2 ,...,cos 10 Sequencing to obtain rank 3 Where the largest value is ranked first, and so on.
Step 5.5, adopting word frequency characteristics to CS i Ordering all words in the list; acquiring CS in the process of acquiring by using the word frequency acquired in the step 1 i Frequency { f (c) 1 ),f(c 2 ),...,f(c 10 ) -a }; sequencing according to word frequency to obtain rank 4 Wherein the largest value ranks first, and so on; obtaining rank 4 And to the set all_ranks; here, the word frequency characteristic is utilized again, and the higher the word frequency, the more frequently the use is, the easier the understanding is;
step 5.6, calculating CS using the ranking of the four features in all_ranks i ={c 1 ,c 2 ,…,c j ,…,c 10 The average ranking value of each word in the list, and the top word is selected as the candidate word.
Step 6, if candidate word c i Frequency f (c) i ) Is larger than the original content word w i Frequency f (w) i ) Then select candidate word c i As a surrogate; otherwise, the original content word w is still reserved i The method comprises the steps of carrying out a first treatment on the surface of the Here again the frequency of the words is utilized.
Step 7, let i=i+1, and sequentially execute step 4 to step 6; when all the content words in the sentence s are processed, the original content words are replaced, and a simplified sentence of the sentence s is obtained; by utilizing the Bert model pre-trained by big data, a plurality of effective features are combined, and the example can effectively acquire synonymous simplified words of complex words, thereby achieving the purpose of sentence simplification.
The invention is not limited to the above embodiments, and based on the technical solution disclosed in the invention, a person skilled in the art may make some substitutions and modifications to some technical features thereof without creative effort according to the technical content disclosed, and all the substitutions and modifications are within the protection scope of the invention.

Claims (5)

1. A method for simplifying an English sentence based on a pre-training transducer language model is characterized by comprising the following steps:
step 1, counting the frequency f (w) of each word w by using a public English wikipedia corpus D, wherein f (w) represents the occurrence times of the word w in the D;
step 2, acquiring a disclosed word embedding model which adopts a word vector model fastText to perform pre-training; obtaining a vector representation v of a word w using the word embedding model w
Step 3, assuming that a sentence needing to be simplified is s, firstly removing stop words in the sentence s, then using a word segmentation tool to segment the s and label the parts of speech, and obtaining content words, wherein the content words comprise nouns, verbs, adjectives and adverbs, and the set { w } 1 ,…,w i ,…,w n -a }; the initial value of i is 1;
step 4, obtaining the content word w in the sentence s by using the disclosed pre-training transducer language model Bert i Wherein, i is more than or equal to 1 and n is less than or equal to the candidate substitution word set CS i
Step 5, adopting a plurality of characteristics to CS i Ranking the candidate words in the list; selecting the top candidate word c by averaging the plurality of ranked results i
Step 6, if candidate word c i Frequency f (c) i ) Is larger than the original content word w i Frequency f (w) i ) Then select candidate word c i As a surrogate; otherwise, the original content word w is still reserved;
step 7, let i=i+1, and sequentially execute step 4 to step 6; when all the content words in the sentence s are processed, the original content words are replaced, and a simplified sentence of the sentence s is obtained.
2. The method of claim 1, wherein the step 4 specifically comprises:
step 4.1, obtaining a published pre-training transducer language model Bert;
step 4.2, use "[ MASK ]]"symbol replaces content word w in sentence s i The substituted sentence is defined as s';
step 4.3, sequentially connecting a symbol "[ CLS ]", a sentence S, a symbol "[ SEP ]", a sentence S' and a symbol "[ SEP ]", and defining a combined sequence as S;
step 4.4, utilizing a word segmentation device BertTokenizer in Bert to segment the S, wherein the segmented set is called T;
step 4.5, then converting T into corresponding ID features by using a BertTokenizer;
step 4.6, acquiring the length len of the set T, defining an array with the length len, and enabling all values to be 1, namely Mask features;
step 4.7, defining an array with the length of len, wherein the content before the corresponding position of the first symbol [ SEP ] is assigned 0, and the rest content is assigned 1, which is called as Type feature;
step 4.8, transmitting the ID feature, mask feature and Type feature to a Mask language model of Bert, and obtaining scores SC of all words in a vocabulary corresponding to a symbol "[ MASK ]";
step 4.9, excluding the original content word w and the corresponding morphological derivative words thereof, and selecting 10 words with high scores from the SC as candidate substitution word set CS i
3. The method of claim 1, wherein the step 5 specifically comprises:
step 5.1, using four features for each candidate surrogate CS i Sequencing, namely Bert output, language model characteristics, semantic similarity and word frequency characteristicsThe method comprises the steps of carrying out a first treatment on the surface of the Defining a variable all_ranges, wherein an initial value is an empty set; assume CS i ={c 1 ,c 2 ,…,c j ,…,c 10 };
Step 5.2, scoring SC of all words contained in the Bert output feature, and comparing CS according to the score i The midwords being ordered, i.e. rank 1 = {1,2, …,10}; increasing rank 1 Into the set all_ranks;
step 5.3, calculating CS by using the mask language model of Bert i The sequence probability after the original word w is replaced by the word in the sequence probability, and the rank is obtained 2 And added to the set all_ranks; selecting the context of the content word W from the sentence s, composing a new sequence w=w -m ,…,w -1 ,w,w 1 ,…,w m The method comprises the steps of carrying out a first treatment on the surface of the Let the initial value of j be 1;
step 5.4, adopting semantic similarity characteristics to CS i All words in the sequence are sequenced to obtain rank 3 And to the set all_ranks; let the initial value of j be 1;
step 5.5, adopting word frequency characteristics to CS i Ordering all words in the list; acquiring CS in the process of acquiring by using the word frequency acquired in the step 1 i Frequency { f (c) 1 ),f(c 2 ),...,f(c 10 ) -a }; sequencing according to word frequency to obtain rank 4 Wherein the largest value ranks first, and so on; obtaining rank 4 And to the set all_ranks;
step 5.6, calculating CS using the ranking of the four features in all_ranks i ={c 1 ,c 2 ,…,c j ,…,c 10 The average ranking value of each word in the list, and the top word is selected as the candidate word.
4. A method according to claim 3, wherein the step 5.3 specifically comprises:
step 5.3.1, use c j Instead of the content word W in W, a new sequence W' =w is composed -m ,…,w -1 ,c j ,w 1 ,…,w m
Step 5.3.2, sequentially hiding each word of the W' from front to back, and calculating a cross entropy loss value of the hidden sequence by using a mask language model of the Bert; finally, the average value loss of the cross entropy loss values of all the words of the W' is calculated i
Step 5.3.3, j=j+1, steps 5.3.1 and 5.3.2 are repeated until CS i The calculation of all words in the list is completed;
step 5.3.4 for all loss values { loss 1 ,loss 2 ,...,loss 10 Sequencing to obtain rank 2 Where the smallest value ranks first, and so on.
5. A method according to claim 3, wherein the step 5.4 specifically comprises:
step 5.4.1, obtaining c from the word vector model j Vector representation of sum w
Figure FDA0004059463540000031
And v w
Step 5.4.2, calculating by adopting a cosine similarity calculation method
Figure FDA0004059463540000032
And v w Similarity value +.>
Figure FDA0004059463540000041
Figure FDA0004059463540000042
In the formula (1), g is the dimension of a vector in a word vector model;
step 5.4.3, j=j+1, steps 5.4.1 and 5.4.2 are repeated until CS i The calculation of all words in the list is completed;
step 5.4.4 for all loss values { cos 1 ,cos 2 ,...,cos 10 Sequencing to obtain rank 3 Where the largest value is ranked first, and so on.
CN201910863529.9A 2019-09-12 2019-09-12 English sentence simplification algorithm based on pre-training transducer language model Active CN110543639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910863529.9A CN110543639B (en) 2019-09-12 2019-09-12 English sentence simplification algorithm based on pre-training transducer language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910863529.9A CN110543639B (en) 2019-09-12 2019-09-12 English sentence simplification algorithm based on pre-training transducer language model

Publications (2)

Publication Number Publication Date
CN110543639A CN110543639A (en) 2019-12-06
CN110543639B true CN110543639B (en) 2023-06-02

Family

ID=68713486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910863529.9A Active CN110543639B (en) 2019-09-12 2019-09-12 English sentence simplification algorithm based on pre-training transducer language model

Country Status (1)

Country Link
CN (1) CN110543639B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949284B (en) * 2019-12-11 2022-11-04 上海大学 Text semantic similarity prediction method based on Transformer model
CN111125350B (en) * 2019-12-17 2023-05-12 传神联合(北京)信息技术有限公司 Method and device for generating LDA topic model based on bilingual parallel corpus
CN111144131B (en) * 2019-12-25 2021-04-30 北京中科研究院 Network rumor detection method based on pre-training language model
CN111611790B (en) * 2020-04-13 2022-09-16 华为技术有限公司 Data processing method and device
CN111651986B (en) * 2020-04-28 2024-04-02 银江技术股份有限公司 Event keyword extraction method, device, equipment and medium
CN111695338A (en) * 2020-04-29 2020-09-22 平安科技(深圳)有限公司 Interview content refining method, device, equipment and medium based on artificial intelligence
CN111444721B (en) * 2020-05-27 2022-09-23 南京大学 Chinese text key information extraction method based on pre-training language model
CN111563166B (en) * 2020-05-28 2024-02-13 浙江学海教育科技有限公司 Pre-training model method for classifying mathematical problems
CN111768001B (en) * 2020-06-30 2024-01-23 平安国际智慧城市科技股份有限公司 Language model training method and device and computer equipment
CN112016319B (en) * 2020-09-08 2023-12-15 平安科技(深圳)有限公司 Pre-training model acquisition and disease entity labeling method, device and storage medium
US11749264B2 (en) 2020-11-03 2023-09-05 Salesforce, Inc. System and methods for training task-oriented dialogue (TOD) language models
CN112528669B (en) * 2020-12-01 2023-08-11 北京百度网讯科技有限公司 Training method and device for multilingual model, electronic equipment and readable storage medium
CN112906372A (en) * 2021-02-20 2021-06-04 北京有竹居网络技术有限公司 Text simplification method, device, equipment and storage medium
CN113177402B (en) * 2021-04-26 2024-03-01 平安科技(深圳)有限公司 Word replacement method, device, electronic equipment and storage medium
CN113158695A (en) * 2021-05-06 2021-07-23 上海极链网络科技有限公司 Semantic auditing method and system for multi-language mixed text
CN114330276B (en) * 2022-01-04 2024-06-25 四川新网银行股份有限公司 Deep learning-based short message template generation method and system and electronic device
CN115329784B (en) * 2022-10-12 2023-04-07 之江实验室 Sentence repeat generating system based on pre-training model
CN116227484B (en) * 2023-05-09 2023-07-28 腾讯科技(深圳)有限公司 Model training method, apparatus, device, storage medium and computer program product
CN117556814A (en) * 2023-07-26 2024-02-13 西藏大学 Tibetan word segmentation and part-of-speech tagging integrated method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106257441B (en) * 2016-06-30 2019-03-15 电子科技大学 A kind of training method of the skip language model based on word frequency
CN108509474B (en) * 2017-09-15 2022-01-07 腾讯科技(深圳)有限公司 Synonym expansion method and device for search information

Also Published As

Publication number Publication date
CN110543639A (en) 2019-12-06

Similar Documents

Publication Publication Date Title
CN110543639B (en) English sentence simplification algorithm based on pre-training transducer language model
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN108287822B (en) Chinese similarity problem generation system and method
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
JP3768205B2 (en) Morphological analyzer, morphological analysis method, and morphological analysis program
CN109960804B (en) Method and device for generating topic text sentence vector
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN110413768B (en) Automatic generation method of article titles
CN110164447B (en) Spoken language scoring method and device
CN110347787B (en) Interview method and device based on AI auxiliary interview scene and terminal equipment
Ferreira et al. Zero-shot semantic parser for spoken language understanding.
CN107870901A (en) Similar literary method, program, device and system are generated from translation source original text
JP2006244262A (en) Retrieval system, method and program for answer to question
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN115204143B (en) Method and system for calculating text similarity based on prompt
CN112214989A (en) Chinese sentence simplification method based on BERT
Liu et al. Open intent discovery through unsupervised semantic clustering and dependency parsing
CN114781651A (en) Small sample learning robustness improving method based on contrast learning
CN114579729B (en) FAQ question-answer matching method and system fusing multi-algorithm models
CN117251524A (en) Short text classification method based on multi-strategy fusion
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
WO2022148467A1 (en) Cross-language data enhancement-based word segmentation method and apparatus
Guo et al. Selective text augmentation with word roles for low-resource text classification
CN116842168B (en) Cross-domain problem processing method and device, electronic equipment and storage medium
Smadja et al. Translating collocations for use in bilingual lexicons

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant