CN110807312A

CN110807312A - Redundancy expression removing method based on combination of neural network model and rule

Info

Publication number: CN110807312A
Application number: CN201910957412.7A
Authority: CN
Inventors: 杨理想; 张侨; 王银瑞; 陈振平
Original assignee: Nanjing Shixing Intelligent Technology Co Ltd
Current assignee: Nanjing Xingyao Intelligent Technology Co ltd
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2020-02-18

Abstract

The invention provides a redundant expression removing method based on combination of a neural network model and rules, which comprises redundant removal of a repeated expression part, a tone word part and a model identification part. Compared with the traditional method for removing redundant expression by using a single rule, the method has the following advantages: (1) redundant long sentences can be removed, and the rules cannot remove longer redundant sentences, so that the model trained by the method can be removed; (2) the method supports the removal of wrongly-written characters and can remove words which cannot be exhausted by rules, and the traditional rules cannot count all redundant words; (3) the model is more intelligent in removing redundancy, when the training model removes the redundancy expression words, whether the words are removed can cause the sentences to be unsmooth or not can be judged, if the sentences are not smooth, the words are not removed, and compared with a rule removing method, the method is more intelligent and keeps complete semantics.

Description

Redundancy expression removing method based on combination of neural network model and rule

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a redundant expression removing method based on combination of a neural network model and rules.

Background

At present, a speech content text is obtained by recording speech of a speaker by voice and then converting the speech into words, wherein the speech content comprises a plurality of words and sentences which are irrelevant to reading and displaying in a later period, and poor experience is brought to reading and displaying, so that the spoken redundant expression vocabulary of the speech content in the Chinese needs to be removed. The method adopted at present is to use the rule to match the redundant vocabulary, and the method is to count the redundant vocabulary firstly, then write the rule to search and replace the vocabulary, so as to achieve the purpose of removing the redundant spoken vocabulary.

However, the method still has the following obvious defects: long sentence redundancy removal is not supported, and for long redundant sentences, rules cannot be matched; the method does not support the removal of wrongly written characters, and the rules of redundant spoken words containing wrongly written characters and written characters cannot be intelligently matched; the rules are forcibly matched to cause the problems of sentence obstruction, grammar error, sentence structure incompleteness and the like, the rules cannot be intelligently matched and intelligently ignored, and for spoken words related to normal sentences, the rules are matched to prevent intelligent filtering, so that the problems of grammar error, sentence incompleteness and the like can be caused by directly removing the words.

Therefore, a solution for efficiently and intelligently removing the redundant spoken language expressions in the Chinese language on the premise of keeping a grammatical structure, completeness of sentences and no influence on semantics is needed for the method for removing the redundant spoken language expressions in the Chinese language.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a solution for intelligently removing Chinese redundant spoken language based on the combination of a neural network model and rules of natural language processing, which mainly aims at a method for removing the spoken language redundant expression from three parts: repeated expression, language and mood words and model recognition are solved, and the method specifically comprises the following steps: a redundant expression removing method based on the combination of a neural network model and rules comprises the redundant removal of a repeated expression part, a tone word part and a model identification part; the repeated expression part adopts a regular expression replacing method to remove repeated expression; the tone word part removes tone words by a method for identifying tone words through part-of-speech tagging; the model identification part removes redundancy through redundant word candidate identification based on redundancy outside two parts of a repeated expression part and a tone word part, or calculates the confusion PPL of a sentence by using a language model after redundant word discrimination to remove redundancy.

As an improvement, the redundant word candidate recognition comprises redundant word list recognition and sequence labeling model recognition based on Bi-LSTM + CRF, and the two recognized redundant words are different.

As an improvement, the redundant word discrimination is a method for removing redundant words by calculating sentence confusion PPL through a language model, and the specific steps are as follows:

assume a sentence as: s ═<w₁，w₂，…，w_i，…,w_n>Wherein w is_iRemoving w for redundant word candidates_iThe latter sentence is: s ═<w₁，w₂，…，w_i-1，w_i+1，…，w_n>Calculating the confusion degree PPL (s ') of the sentence by using the language model, and calculating the confusion degree PPL(s) of the sentence when the confusion degree PPL (s')<Confusion of sentence ppl(s), w_iFor redundant words to be removed, the confusion PPL of a sentence is calculated as follows:

wherein,

is the word w₁,w₂…w_NN is the number of words.

As an improvement, the language model is an n-gram language model trained on the basis of the news corpus of the people's daily news.

Has the advantages that: compared with the traditional method for removing the redundant expression by using the rule singly, the method for removing the redundant expression based on the combination of the neural network model and the rule has the following advantages that: (1) redundant long sentences can be removed, and the rules cannot remove longer redundant sentences, so that the model trained by the method can be removed; (2) the method supports the removal of wrongly-written characters and can remove words which cannot be exhausted by rules, and the traditional rules cannot count all redundant words; (3) the model is more intelligent in removing redundancy, when the training model removes the redundancy expression words, whether the words are removed can cause the sentences to be unsmooth or not can be judged, if the sentences are not smooth, the words are not removed, and compared with a rule removing method, the method is more intelligent and keeps complete semantics.

Drawings

FIG. 1 is a flow chart of a method for removing redundancy according to the present invention.

Detailed Description

The figures of the present invention are further described below in conjunction with the embodiments.

The invention adopts a solution scheme of intelligently removing Chinese redundant spoken language based on the combination of a neural network model and rules of natural language processing, and the method for removing the spoken language redundant expression mainly comprises three parts: repeated expressions, mood words, and other redundancies.

(1) Repeatedly expressed parts

The phenomenon of repetition of spoken language refers to a situation in which a word or words are repeated due to a need or an accident of a speaker due to a thinking process, or a situation in which a speaker intentionally emphasizes the meaning of a word. These repetitions are not in compliance with conventional grammar specifications, and normal Chinese overlapping words such as "look," "red Tong," etc. are not targeted for removal. Aiming at the part, the invention directly replaces the regular expression and removes the repeated expression.

(2) Part of Chinese language and word

A large number of language-qi words are often used in spoken language, the removal of the language-qi words has no influence on the semantics of sentences, and the expression can be more concise, such as 'o', 'hiccup', 'kayi', 'wool' and 'bar', and the language-qi words are identified through part-of-speech tagging in the patent and then directly removed.

(3) Model identification section

The model identification part is based on other redundancy phenomena except for a repeated expression part and a tone word part, and means that when a speaker pauses or is not consistent in thinking, some words are filled in the middle of a sentence unconsciously to keep the tone and the sentence continuous, and the redundancy is caused by the phenomenon. Likewise, removing these redundant words has no effect on semantics, making the expression more compact. For example, "the artificial intelligence is now being discussed in large numbers," the "in this sentence is a redundant word. The removal of the part of redundant words is mainly divided into two steps: 1) identifying redundant word candidates; 2) and judging redundant words.

1) Redundant word candidate recognition

For a sentence, the present invention first identifies which words in the sentence are likely to be redundant. The step is composed of two parts, firstly, common redundant word lists are utilized, through statistics on texts, some common redundant words are collected, such as 'that', 'this', 'just being', 'then', 'just being', 'say', 'such' and the like, and as long as such words appear in sentences, the words are used as redundant word candidates; secondly, redundant word candidates are identified through a sequence labeling model, because the redundant words are difficult to exhaust, the invention adopts an artificial labeling corpus to train a sequence labeling model (based on Bi-LSTM + CRF) to identify the redundant word candidates which are not in the list.

2) Redundant word discrimination

A redundant word candidate is not in all cases a redundant word. For example, "that is, we must determine the target first," that is, "redundancy; however, "what we aim at is" without caries "is not redundant, and removing it makes the sentence grammar incorrect, so after identifying the redundant word candidate, a further decision is needed to confirm whether the candidate is truly redundant.

The invention is implemented at this step using a language model. The language model can be used to calculate the probability of a sentence.

Assume a sentence as:

s＝<w₁,w₂,…,w_i,…,w_n>wherein w is_iAs redundant word candidates

Then, the sentence after removing this candidate is:

s’＝<w₁,w₂,…,w_i-1,w_i+1,…,w_n>，

calculating the confusion degree PPL (s ') of the sentence and the confusion degree PPL(s) of the sentence by using the language model, and calculating the confusion degree PPL (s ') of the sentence when the confusion degree PPL (s ')<Confusion of sentence ppl(s), w_iFor redundant words to be removed, the confusion PPL of a sentence is calculated as follows:

wherein,

is the word w₁,w₂…w_NN is the number of words. The PPL is the confusion of a sentence and can be obtained through a language model, and the lower PPL indicates that the sentence is more smooth. The language model adopted is an n-gram language model trained on the daily news corpus of people.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A redundancy expression removing method based on the combination of a neural network model and rules is characterized in that: the redundancy removal of a repeated expression part, a tone word part and a model identification part is included; the repeated expression part adopts a regular expression replacing method to remove repeated expression; the tone word part removes tone words by a method for identifying tone words through part-of-speech tagging; the model identification part removes redundancy through redundant word candidate identification based on redundancy outside two parts of a repeated expression part and a tone word part, or calculates the confusion PPL of a sentence by using a language model after redundant word discrimination to remove redundancy.

2. The neural network model and rule combination-based redundant expression removal method of claim 1, wherein: the redundant word candidate identification comprises redundant word list identification and sequence labeling model identification based on Bi-LSTM + CRF, and the two identified redundant words are different.

3. The neural network model and rule combination-based redundant expression removal method of claim 1, wherein: the redundant word discrimination is a method for removing redundant words by calculating sentence confusion PPL through a language model, and comprises the following specific steps:

wherein,

is the word w₁,w₂…w_NN is the number of words.

4. The neural network model and rule combination-based redundant expression removal method of claim 3, wherein: the language model is an n-gram language model trained on the daily news corpus.