CN110807312A - Redundancy expression removing method based on combination of neural network model and rule - Google Patents

Redundancy expression removing method based on combination of neural network model and rule Download PDF

Info

Publication number
CN110807312A
CN110807312A CN201910957412.7A CN201910957412A CN110807312A CN 110807312 A CN110807312 A CN 110807312A CN 201910957412 A CN201910957412 A CN 201910957412A CN 110807312 A CN110807312 A CN 110807312A
Authority
CN
China
Prior art keywords
redundant
words
expression
sentence
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910957412.7A
Other languages
Chinese (zh)
Inventor
杨理想
张侨
王银瑞
陈振平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xingyao Intelligent Technology Co ltd
Original Assignee
Nanjing Shixing Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Shixing Intelligent Technology Co Ltd filed Critical Nanjing Shixing Intelligent Technology Co Ltd
Priority to CN201910957412.7A priority Critical patent/CN110807312A/en
Publication of CN110807312A publication Critical patent/CN110807312A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a redundant expression removing method based on combination of a neural network model and rules, which comprises redundant removal of a repeated expression part, a tone word part and a model identification part. Compared with the traditional method for removing redundant expression by using a single rule, the method has the following advantages: (1) redundant long sentences can be removed, and the rules cannot remove longer redundant sentences, so that the model trained by the method can be removed; (2) the method supports the removal of wrongly-written characters and can remove words which cannot be exhausted by rules, and the traditional rules cannot count all redundant words; (3) the model is more intelligent in removing redundancy, when the training model removes the redundancy expression words, whether the words are removed can cause the sentences to be unsmooth or not can be judged, if the sentences are not smooth, the words are not removed, and compared with a rule removing method, the method is more intelligent and keeps complete semantics.

Description

Redundancy expression removing method based on combination of neural network model and rule
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a redundant expression removing method based on combination of a neural network model and rules.
Background
At present, a speech content text is obtained by recording speech of a speaker by voice and then converting the speech into words, wherein the speech content comprises a plurality of words and sentences which are irrelevant to reading and displaying in a later period, and poor experience is brought to reading and displaying, so that the spoken redundant expression vocabulary of the speech content in the Chinese needs to be removed. The method adopted at present is to use the rule to match the redundant vocabulary, and the method is to count the redundant vocabulary firstly, then write the rule to search and replace the vocabulary, so as to achieve the purpose of removing the redundant spoken vocabulary.
However, the method still has the following obvious defects: long sentence redundancy removal is not supported, and for long redundant sentences, rules cannot be matched; the method does not support the removal of wrongly written characters, and the rules of redundant spoken words containing wrongly written characters and written characters cannot be intelligently matched; the rules are forcibly matched to cause the problems of sentence obstruction, grammar error, sentence structure incompleteness and the like, the rules cannot be intelligently matched and intelligently ignored, and for spoken words related to normal sentences, the rules are matched to prevent intelligent filtering, so that the problems of grammar error, sentence incompleteness and the like can be caused by directly removing the words.
Therefore, a solution for efficiently and intelligently removing the redundant spoken language expressions in the Chinese language on the premise of keeping a grammatical structure, completeness of sentences and no influence on semantics is needed for the method for removing the redundant spoken language expressions in the Chinese language.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a solution for intelligently removing Chinese redundant spoken language based on the combination of a neural network model and rules of natural language processing, which mainly aims at a method for removing the spoken language redundant expression from three parts: repeated expression, language and mood words and model recognition are solved, and the method specifically comprises the following steps: a redundant expression removing method based on the combination of a neural network model and rules comprises the redundant removal of a repeated expression part, a tone word part and a model identification part; the repeated expression part adopts a regular expression replacing method to remove repeated expression; the tone word part removes tone words by a method for identifying tone words through part-of-speech tagging; the model identification part removes redundancy through redundant word candidate identification based on redundancy outside two parts of a repeated expression part and a tone word part, or calculates the confusion PPL of a sentence by using a language model after redundant word discrimination to remove redundancy.
As an improvement, the redundant word candidate recognition comprises redundant word list recognition and sequence labeling model recognition based on Bi-LSTM + CRF, and the two recognized redundant words are different.
As an improvement, the redundant word discrimination is a method for removing redundant words by calculating sentence confusion PPL through a language model, and the specific steps are as follows:
assume a sentence as: s ═<w1,w2,…,wi,…,wn>Wherein w isiRemoving w for redundant word candidatesiThe latter sentence is: s ═<w1,w2,…,wi-1,wi+1,…,wn>Calculating the confusion degree PPL (s ') of the sentence by using the language model, and calculating the confusion degree PPL(s) of the sentence when the confusion degree PPL (s')<Confusion of sentence ppl(s), wiFor redundant words to be removed, the confusion PPL of a sentence is calculated as follows:
Figure BDA0002227796660000021
wherein,
Figure BDA0002227796660000022
is the word w1,w2…wNN is the number of words.
As an improvement, the language model is an n-gram language model trained on the basis of the news corpus of the people's daily news.
Has the advantages that: compared with the traditional method for removing the redundant expression by using the rule singly, the method for removing the redundant expression based on the combination of the neural network model and the rule has the following advantages that: (1) redundant long sentences can be removed, and the rules cannot remove longer redundant sentences, so that the model trained by the method can be removed; (2) the method supports the removal of wrongly-written characters and can remove words which cannot be exhausted by rules, and the traditional rules cannot count all redundant words; (3) the model is more intelligent in removing redundancy, when the training model removes the redundancy expression words, whether the words are removed can cause the sentences to be unsmooth or not can be judged, if the sentences are not smooth, the words are not removed, and compared with a rule removing method, the method is more intelligent and keeps complete semantics.
Drawings
FIG. 1 is a flow chart of a method for removing redundancy according to the present invention.
Detailed Description
The figures of the present invention are further described below in conjunction with the embodiments.
The invention adopts a solution scheme of intelligently removing Chinese redundant spoken language based on the combination of a neural network model and rules of natural language processing, and the method for removing the spoken language redundant expression mainly comprises three parts: repeated expressions, mood words, and other redundancies.
(1) Repeatedly expressed parts
The phenomenon of repetition of spoken language refers to a situation in which a word or words are repeated due to a need or an accident of a speaker due to a thinking process, or a situation in which a speaker intentionally emphasizes the meaning of a word. These repetitions are not in compliance with conventional grammar specifications, and normal Chinese overlapping words such as "look," "red Tong," etc. are not targeted for removal. Aiming at the part, the invention directly replaces the regular expression and removes the repeated expression.
(2) Part of Chinese language and word
A large number of language-qi words are often used in spoken language, the removal of the language-qi words has no influence on the semantics of sentences, and the expression can be more concise, such as 'o', 'hiccup', 'kayi', 'wool' and 'bar', and the language-qi words are identified through part-of-speech tagging in the patent and then directly removed.
(3) Model identification section
The model identification part is based on other redundancy phenomena except for a repeated expression part and a tone word part, and means that when a speaker pauses or is not consistent in thinking, some words are filled in the middle of a sentence unconsciously to keep the tone and the sentence continuous, and the redundancy is caused by the phenomenon. Likewise, removing these redundant words has no effect on semantics, making the expression more compact. For example, "the artificial intelligence is now being discussed in large numbers," the "in this sentence is a redundant word. The removal of the part of redundant words is mainly divided into two steps: 1) identifying redundant word candidates; 2) and judging redundant words.
1) Redundant word candidate recognition
For a sentence, the present invention first identifies which words in the sentence are likely to be redundant. The step is composed of two parts, firstly, common redundant word lists are utilized, through statistics on texts, some common redundant words are collected, such as 'that', 'this', 'just being', 'then', 'just being', 'say', 'such' and the like, and as long as such words appear in sentences, the words are used as redundant word candidates; secondly, redundant word candidates are identified through a sequence labeling model, because the redundant words are difficult to exhaust, the invention adopts an artificial labeling corpus to train a sequence labeling model (based on Bi-LSTM + CRF) to identify the redundant word candidates which are not in the list.
2) Redundant word discrimination
A redundant word candidate is not in all cases a redundant word. For example, "that is, we must determine the target first," that is, "redundancy; however, "what we aim at is" without caries "is not redundant, and removing it makes the sentence grammar incorrect, so after identifying the redundant word candidate, a further decision is needed to confirm whether the candidate is truly redundant.
The invention is implemented at this step using a language model. The language model can be used to calculate the probability of a sentence.
Assume a sentence as:
s=<w1,w2,…,wi,…,wn>wherein w isiAs redundant word candidates
Then, the sentence after removing this candidate is:
s’=<w1,w2,…,wi-1,wi+1,…,wn>,
calculating the confusion degree PPL (s ') of the sentence and the confusion degree PPL(s) of the sentence by using the language model, and calculating the confusion degree PPL (s ') of the sentence when the confusion degree PPL (s ')<Confusion of sentence ppl(s), wiFor redundant words to be removed, the confusion PPL of a sentence is calculated as follows:
Figure BDA0002227796660000041
wherein,
Figure BDA0002227796660000042
is the word w1,w2…wNN is the number of words. The PPL is the confusion of a sentence and can be obtained through a language model, and the lower PPL indicates that the sentence is more smooth. The language model adopted is an n-gram language model trained on the daily news corpus of people.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (4)

1. A redundancy expression removing method based on the combination of a neural network model and rules is characterized in that: the redundancy removal of a repeated expression part, a tone word part and a model identification part is included; the repeated expression part adopts a regular expression replacing method to remove repeated expression; the tone word part removes tone words by a method for identifying tone words through part-of-speech tagging; the model identification part removes redundancy through redundant word candidate identification based on redundancy outside two parts of a repeated expression part and a tone word part, or calculates the confusion PPL of a sentence by using a language model after redundant word discrimination to remove redundancy.
2. The neural network model and rule combination-based redundant expression removal method of claim 1, wherein: the redundant word candidate identification comprises redundant word list identification and sequence labeling model identification based on Bi-LSTM + CRF, and the two identified redundant words are different.
3. The neural network model and rule combination-based redundant expression removal method of claim 1, wherein: the redundant word discrimination is a method for removing redundant words by calculating sentence confusion PPL through a language model, and comprises the following specific steps:
assume a sentence as: s ═<w1,w2,…,wi,…,wn>Wherein w isiRemoving w for redundant word candidatesiThe latter sentence is: s ═<w1,w2,…,wi-1,wi+1,…,wn>Calculating the confusion degree PPL (s ') of the sentence by using the language model, and calculating the confusion degree PPL(s) of the sentence when the confusion degree PPL (s')<Confusion of sentence ppl(s), wiFor redundant words to be removed, the confusion PPL of a sentence is calculated as follows:
Figure FDA0002227796650000011
wherein,
Figure FDA0002227796650000012
is the word w1,w2…wNN is the number of words.
4. The neural network model and rule combination-based redundant expression removal method of claim 3, wherein: the language model is an n-gram language model trained on the daily news corpus.
CN201910957412.7A 2019-10-10 2019-10-10 Redundancy expression removing method based on combination of neural network model and rule Pending CN110807312A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910957412.7A CN110807312A (en) 2019-10-10 2019-10-10 Redundancy expression removing method based on combination of neural network model and rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910957412.7A CN110807312A (en) 2019-10-10 2019-10-10 Redundancy expression removing method based on combination of neural network model and rule

Publications (1)

Publication Number Publication Date
CN110807312A true CN110807312A (en) 2020-02-18

Family

ID=69488160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910957412.7A Pending CN110807312A (en) 2019-10-10 2019-10-10 Redundancy expression removing method based on combination of neural network model and rule

Country Status (1)

Country Link
CN (1) CN110807312A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767697A (en) * 2020-07-24 2020-10-13 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN113468305A (en) * 2021-06-29 2021-10-01 竹间智能科技(上海)有限公司 Method and device for identifying redundant components of spoken language
US12008332B1 (en) 2023-08-18 2024-06-11 Anzer, Inc. Systems for controllable summarization of content

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101517531A (en) * 2006-09-15 2009-08-26 微软公司 Transformation of modular finite state transducers
CN109461503A (en) * 2018-11-14 2019-03-12 科大讯飞股份有限公司 A kind of cognition appraisal procedure, device, equipment and the readable storage medium storing program for executing of object
CN109948152A (en) * 2019-03-06 2019-06-28 北京工商大学 A kind of Chinese text grammer error correcting model method based on LSTM
CN110147445A (en) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and storage medium based on text classification
CN110188349A (en) * 2019-05-21 2019-08-30 清华大学深圳研究生院 A kind of automation writing method based on extraction-type multiple file summarization method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101517531A (en) * 2006-09-15 2009-08-26 微软公司 Transformation of modular finite state transducers
CN109461503A (en) * 2018-11-14 2019-03-12 科大讯飞股份有限公司 A kind of cognition appraisal procedure, device, equipment and the readable storage medium storing program for executing of object
CN109948152A (en) * 2019-03-06 2019-06-28 北京工商大学 A kind of Chinese text grammer error correcting model method based on LSTM
CN110147445A (en) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and storage medium based on text classification
CN110188349A (en) * 2019-05-21 2019-08-30 清华大学深圳研究生院 A kind of automation writing method based on extraction-type multiple file summarization method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767697A (en) * 2020-07-24 2020-10-13 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN113468305A (en) * 2021-06-29 2021-10-01 竹间智能科技(上海)有限公司 Method and device for identifying redundant components of spoken language
US12008332B1 (en) 2023-08-18 2024-06-11 Anzer, Inc. Systems for controllable summarization of content

Similar Documents

Publication Publication Date Title
CN114444479B (en) End-to-end Chinese speech text error correction method, device and storage medium
US6067520A (en) System and method of recognizing continuous mandarin speech utilizing chinese hidden markou models
Creutz et al. Morph-based speech recognition and modeling of out-of-vocabulary words across languages
CN105845134B (en) Spoken language evaluation method and system for freely reading question types
JP5330450B2 (en) Topic-specific models for text formatting and speech recognition
US5835888A (en) Statistical language model for inflected languages
CN110807312A (en) Redundancy expression removing method based on combination of neural network model and rule
Li et al. A Mandarin-English Code-Switching Corpus.
US20210151036A1 (en) Detection of correctness of pronunciation
CN113221542A (en) Chinese text automatic proofreading method based on multi-granularity fusion and Bert screening
JP6718787B2 (en) Japanese speech recognition model learning device and program
Keating Word-level phonetic variation in large speech corpora
Dufour et al. Spontaneous speech characterization and detection in large audio database
CN107123419A (en) The optimization method of background noise reduction in the identification of Sphinx word speeds
Steensig Notes on turn-construction methods in Danish and Turkish conversation
Carranza Intermediate phonetic realizations in a Japanese accented L2 Spanish corpus
Mekki et al. COTA 2.0: An automatic corrector of Tunisian Arabic social media texts
Adda-Decker et al. A first LVCSR system for luxembourgish, a low-resourced european language
TWI635483B (en) Method and system for generating prosody by using linguistic features inspired by punctuation
Peshkov et al. Catogorizing syntactic chunks for marking disfluent speech in French language.
Decadt et al. Phoneme-to-grapheme conversion for out-of-vocabulary words in large vocabulary speech recognition
CN113506559B (en) Method for generating pronunciation dictionary according to Vietnam written text
CN117891928B (en) Intelligent processing method and system for user voice messages
Pitrelli ToBI prosodic analysis of a professional speaker of American English
Al-Shareef et al. CRF-based Diacritisation of Colloquial Arabic for Automatic Speech Recognition.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210312

Address after: 210000 rooms 1201 and 1209, building C, Xingzhi Science Park, Qixia Economic and Technological Development Zone, Nanjing, Jiangsu Province

Applicant after: Nanjing Xingyao Intelligent Technology Co.,Ltd.

Address before: Room 1211, building C, Xingzhi Science Park, 6 Xingzhi Road, Nanjing Economic and Technological Development Zone, Jiangsu Province, 210000

Applicant before: Nanjing Shixing Intelligent Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200218