Background
At present, a speech content text is obtained by recording speech of a speaker by voice and then converting the speech into words, wherein the speech content comprises a plurality of words and sentences which are irrelevant to reading and displaying in a later period, and poor experience is brought to reading and displaying, so that the spoken redundant expression vocabulary of the speech content in the Chinese needs to be removed. The method adopted at present is to use the rule to match the redundant vocabulary, and the method is to count the redundant vocabulary firstly, then write the rule to search and replace the vocabulary, so as to achieve the purpose of removing the redundant spoken vocabulary.
However, the method still has the following obvious defects: long sentence redundancy removal is not supported, and for long redundant sentences, rules cannot be matched; the method does not support the removal of wrongly written characters, and the rules of redundant spoken words containing wrongly written characters and written characters cannot be intelligently matched; the rules are forcibly matched to cause the problems of sentence obstruction, grammar error, sentence structure incompleteness and the like, the rules cannot be intelligently matched and intelligently ignored, and for spoken words related to normal sentences, the rules are matched to prevent intelligent filtering, so that the problems of grammar error, sentence incompleteness and the like can be caused by directly removing the words.
Therefore, a solution for efficiently and intelligently removing the redundant spoken language expressions in the Chinese language on the premise of keeping a grammatical structure, completeness of sentences and no influence on semantics is needed for the method for removing the redundant spoken language expressions in the Chinese language.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a solution for intelligently removing Chinese redundant spoken language based on the combination of a neural network model and rules of natural language processing, which mainly aims at a method for removing the spoken language redundant expression from three parts: repeated expression, language and mood words and model recognition are solved, and the method specifically comprises the following steps: a redundant expression removing method based on the combination of a neural network model and rules comprises the redundant removal of a repeated expression part, a tone word part and a model identification part; the repeated expression part adopts a regular expression replacing method to remove repeated expression; the tone word part removes tone words by a method for identifying tone words through part-of-speech tagging; the model identification part removes redundancy through redundant word candidate identification based on redundancy outside two parts of a repeated expression part and a tone word part, or calculates the confusion PPL of a sentence by using a language model after redundant word discrimination to remove redundancy.
As an improvement, the redundant word candidate recognition comprises redundant word list recognition and sequence labeling model recognition based on Bi-LSTM + CRF, and the two recognized redundant words are different.
As an improvement, the redundant word discrimination is a method for removing redundant words by calculating sentence confusion PPL through a language model, and the specific steps are as follows:
assume a sentence as: s ═<w1,w2,…,wi,…,wn>Wherein w isiRemoving w for redundant word candidatesiThe latter sentence is: s ═<w1,w2,…,wi-1,wi+1,…,wn>Calculating the confusion degree PPL (s ') of the sentence by using the language model, and calculating the confusion degree PPL(s) of the sentence when the confusion degree PPL (s')<Confusion of sentence ppl(s), wiFor redundant words to be removed, the confusion PPL of a sentence is calculated as follows:
wherein,
is the word w
1,w
2…w
NN is the number of words.
As an improvement, the language model is an n-gram language model trained on the basis of the news corpus of the people's daily news.
Has the advantages that: compared with the traditional method for removing the redundant expression by using the rule singly, the method for removing the redundant expression based on the combination of the neural network model and the rule has the following advantages that: (1) redundant long sentences can be removed, and the rules cannot remove longer redundant sentences, so that the model trained by the method can be removed; (2) the method supports the removal of wrongly-written characters and can remove words which cannot be exhausted by rules, and the traditional rules cannot count all redundant words; (3) the model is more intelligent in removing redundancy, when the training model removes the redundancy expression words, whether the words are removed can cause the sentences to be unsmooth or not can be judged, if the sentences are not smooth, the words are not removed, and compared with a rule removing method, the method is more intelligent and keeps complete semantics.
Detailed Description
The figures of the present invention are further described below in conjunction with the embodiments.
The invention adopts a solution scheme of intelligently removing Chinese redundant spoken language based on the combination of a neural network model and rules of natural language processing, and the method for removing the spoken language redundant expression mainly comprises three parts: repeated expressions, mood words, and other redundancies.
(1) Repeatedly expressed parts
The phenomenon of repetition of spoken language refers to a situation in which a word or words are repeated due to a need or an accident of a speaker due to a thinking process, or a situation in which a speaker intentionally emphasizes the meaning of a word. These repetitions are not in compliance with conventional grammar specifications, and normal Chinese overlapping words such as "look," "red Tong," etc. are not targeted for removal. Aiming at the part, the invention directly replaces the regular expression and removes the repeated expression.
(2) Part of Chinese language and word
A large number of language-qi words are often used in spoken language, the removal of the language-qi words has no influence on the semantics of sentences, and the expression can be more concise, such as 'o', 'hiccup', 'kayi', 'wool' and 'bar', and the language-qi words are identified through part-of-speech tagging in the patent and then directly removed.
(3) Model identification section
The model identification part is based on other redundancy phenomena except for a repeated expression part and a tone word part, and means that when a speaker pauses or is not consistent in thinking, some words are filled in the middle of a sentence unconsciously to keep the tone and the sentence continuous, and the redundancy is caused by the phenomenon. Likewise, removing these redundant words has no effect on semantics, making the expression more compact. For example, "the artificial intelligence is now being discussed in large numbers," the "in this sentence is a redundant word. The removal of the part of redundant words is mainly divided into two steps: 1) identifying redundant word candidates; 2) and judging redundant words.
1) Redundant word candidate recognition
For a sentence, the present invention first identifies which words in the sentence are likely to be redundant. The step is composed of two parts, firstly, common redundant word lists are utilized, through statistics on texts, some common redundant words are collected, such as 'that', 'this', 'just being', 'then', 'just being', 'say', 'such' and the like, and as long as such words appear in sentences, the words are used as redundant word candidates; secondly, redundant word candidates are identified through a sequence labeling model, because the redundant words are difficult to exhaust, the invention adopts an artificial labeling corpus to train a sequence labeling model (based on Bi-LSTM + CRF) to identify the redundant word candidates which are not in the list.
2) Redundant word discrimination
A redundant word candidate is not in all cases a redundant word. For example, "that is, we must determine the target first," that is, "redundancy; however, "what we aim at is" without caries "is not redundant, and removing it makes the sentence grammar incorrect, so after identifying the redundant word candidate, a further decision is needed to confirm whether the candidate is truly redundant.
The invention is implemented at this step using a language model. The language model can be used to calculate the probability of a sentence.
Assume a sentence as:
s=<w1,w2,…,wi,…,wn>wherein w isiAs redundant word candidates
Then, the sentence after removing this candidate is:
s’=<w1,w2,…,wi-1,wi+1,…,wn>,
calculating the confusion degree PPL (s ') of the sentence and the confusion degree PPL(s) of the sentence by using the language model, and calculating the confusion degree PPL (s ') of the sentence when the confusion degree PPL (s ')<Confusion of sentence ppl(s), wiFor redundant words to be removed, the confusion PPL of a sentence is calculated as follows:
wherein,
is the word w
1,w
2…w
NN is the number of words. The PPL is the confusion of a sentence and can be obtained through a language model, and the lower PPL indicates that the sentence is more smooth. The language model adopted is an n-gram language model trained on the daily news corpus of people.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.