CN112257416A

CN112257416A - Inspection new word discovery method and system

Info

Publication number: CN112257416A
Application number: CN202011175920.9A
Authority: CN
Inventors: 赵郭燚; 王宗伟; 苏媛; 卜晓阳; 姜冬; 魏冰; 胡方坤; 任东英
Original assignee: Beijing Dataocean Smart Technology Co ltd; State Grid Co ltd Customer Service Center
Current assignee: Beijing Dataocean Smart Technology Co ltd; State Grid Co ltd Customer Service Center
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2021-01-22

Abstract

The invention relates to an inspection new word discovery method, which comprises the following steps: firstly, segmenting words by using an n-gram algorithm model, and filtering candidate words with smaller word frequency according to a threshold value; then calculating mutual information and left and right adjacent entropy of the candidate words, extracting part-of-speech combination characteristics of the candidate words, and selecting a random forest algorithm construction model to train and test characteristic indexes to ensure the accuracy of the new words; and finally, after filtering the part of speech, introducing a bloom filter algorithm to improve the matching efficiency, and finally outputting a new word discovery model result. The scheme provided by the invention can help analysts quickly and accurately find new words appearing in the inspection work order, construct an inspection full-professional basic word bank, support classification and identification of work order texts and improve the analysis capability of the central inspection work order.

Description

Inspection new word discovery method and system

Technical Field

The invention relates to the technical field of business hall inspection, in particular to a method and a system for discovering new words in inspection.

Background

In the prior art, the average work order of a marketing inspection month in the power industry reaches more than 10 thousands of work orders, and many new services exist, the prior art cannot identify inspection new words, and the automatic analysis and judgment of the work orders of the new services are not realized, so that the working efficiency and the inspection quality are influenced.

The words in the power industry have self-speciality and particularity, and new words can emerge continuously along with the development of business, and if the new words are found singly through statistical characteristics, the accuracy of the new words cannot be ensured; on the other hand, if a single value is determined manually, the correlation is ignored and the correctness of a new word cannot be ensured.

For the service of new word discovery in the power industry, the prior art generally adopts a statistical-based new word discovery method and a judgment standard for artificially determining part-of-speech characteristics, cannot extract part-of-speech characteristics by combining with the actual word relationship of power, and cannot provide a scientific part-of-speech characteristic judgment basis.

Most of the existing new word discovery methods are new word discovery based on statistical characteristics, but the professional vocabularies of the power industry are more, and the accuracy is low by using a single method. The words are used as a language unit which can independently exist, certain correlation exists among all characters of the words, and the mutual information and the adjacent entropy are only judged by artificial setting, so that the scientificity and the objectivity are lacked. Meanwhile, the vocabulary of the power industry is huge, and a plurality of methods are available for removing the duplicate words, but the time and space problems of removing the duplicate words are difficult to solve.

Therefore, for new word discovery in the power industry or construction of a new word discovery method, no effective solution is available in the industry at present, and a new solution capable of solving the problem of new word discovery in the industry is urgently needed.

Disclosure of Invention

The invention provides an audit new word discovery method and an audit new word discovery system, which solve the problems of untimely and incomplete discovery of new words in the power industry and wrong and missed judgment in the prior art.

According to one aspect of the invention, an inspection new word discovery method is provided, which comprises the following steps:

segmenting words of the sentence by using an n-gram Chinese language model;

filtering the word segmentation result according to the word frequency to obtain candidate words;

performing feature extraction on the candidate words to obtain feature indexes of the candidate words;

constructing a word segmentation model by using a random forest algorithm according to the characteristic indexes of the candidate words and training;

performing part-of-speech filtering on the word segmentation result of the word segmentation model to obtain alternative words;

and comparing the alternative words with the dictionary according to the Bloom filter algorithm, and filtering out existing participles and stop words to obtain audit new words.

The method further comprises the following steps:

inputting the filtered alternative words into a model dictionary;

and establishing a new word discovery model according to the model dictionary, and performing word segmentation processing on the sentence to be segmented by using the new word discovery model.

The filtering the word segmentation result according to the word frequency comprises the following steps:

acquiring the word frequency of each word in the word segmentation result;

and setting a word frequency high-low threshold value, and removing words with the word frequency lower than the word frequency high-low threshold value.

The characteristic indexes of the candidate words comprise:

and calculating the mutual information of the candidate words, the left and right adjacent entropies, the word frequency, the word property and the left and right adjacent entropy difference of the candidate words.

The part-of-speech filtering of the word segmentation result of the word segmentation model comprises the following steps:

obtaining a word segmentation result of the word segmentation model;

and eliminating the word segmentation result containing the adverb and the preposition.

The comparing the candidate word with a dictionary according to the Bloom filter algorithm comprises:

according to the Bloom filter algorithm, the alternative words are respectively compared with a general dictionary, a basic dictionary and a stop dictionary, and stopped words are filtered;

and comparing the alternative words with a prior model dictionary, and filtering out the alternative words existing in the model dictionary.

According to another aspect of the present invention, there is provided an audit new word discovery system, the system comprising:

the word segmentation unit is used for segmenting words of the sentences by using the n-gram Chinese language model;

the word frequency filtering unit is used for filtering the word segmentation result according to the word frequency to obtain candidate words;

the characteristic extraction unit is used for extracting the characteristics of the candidate words to obtain characteristic indexes of the candidate words;

the model training unit is used for constructing a word segmentation model by utilizing a random forest algorithm and training the word segmentation model according to the candidate measured characteristic indexes;

the part-of-speech filtering unit is used for performing part-of-speech filtering on the word segmentation result of the word segmentation model to obtain alternative words;

and the detection output unit is used for comparing the alternative words with the dictionary according to the Bloom filter algorithm, filtering out existing participles and stop words and obtaining audit new words.

The system further comprises:

and the model dictionary unit is used for acquiring the filtered alternative words, continuously updating a word bank and establishing a model dictionary with existing word segmentation results.

The system further comprises:

and the new word processing unit is used for establishing a new word discovery model according to the model dictionary and performing word segmentation processing on the sentence to be segmented by using the new word discovery model.

The system further comprises:

and the characteristic index unit is used for storing the calculated candidate word mutual information, the left and right adjacent entropies, the word frequency and the word property of the candidate word and the difference value of the left and right adjacent entropies as characteristic indexes.

The beneficial effect who adopts above-mentioned scheme is:

the method comprises the steps of firstly, performing word segmentation by using an n-gram algorithm model, and filtering candidate words with smaller word frequency according to a threshold value; then calculating mutual information and left and right adjacent entropy of the candidate words, extracting part-of-speech combination characteristics of the candidate words, and selecting a random forest algorithm construction model to train and test characteristic indexes to ensure the accuracy of the new words; and finally, after filtering the part of speech, introducing a bloom filter algorithm to improve the matching efficiency, and finally outputting a new word discovery model result. The scheme provided by the invention can help analysts quickly and accurately find new words appearing in the inspection work order, construct an inspection full-professional basic word bank, support classification and identification of work order texts and improve the analysis capability of the central inspection work order.

According to the practical situation of electric power, the word combination characteristics of candidate words are extracted based on a method combining rules and statistics; performing simulation training on the part-of-speech characteristic indexes by using a random forest algorithm to ensure the correctness of new word discovery; a bloom filter algorithm is introduced, and the execution efficiency of part-of-speech filtering is improved; three algorithms of n-gram, random forest algorithm and bloom filter are fused to form the new word inspection discovery method.

The method for discovering the new words for the inspection of the fusion multi-algorithm lays a foundation for constructing the basic word bank for the inspection of the whole major in the aspect of business, supports the marketing inspection business, analyzes and judges the work orders by using the related technologies such as natural voice processing and the like, and improves the working efficiency; from the aspect of data processing, the scientificity and effectiveness of each link of data are ensured through fusion of multiple algorithms, and the accuracy of new word discovery is ensured.

Drawings

FIG. 1 is a schematic flow chart of a method for inspecting new words discovery according to the present invention.

FIG. 2 is a schematic diagram of a system for inspecting new words and phrases in the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

The new word discovery method based on the rules is to discover new words by utilizing the parts of speech characteristics, the word construction rules of linguistics and other aspects. The accuracy of finding new words is high, but the expandability and the flexibility are poor, and a large amount of manpower and material resources are consumed.

The new word discovery method based on statistics is used for identifying new words by calculating statistical characteristics of word frequency, word forming probability, left and right adjacent entropy, adjacent change number and the like of the words through a large number of experimental corpora. The statistical-based method is flexible, free from the limitation of fields, easy to expand and good in transportability, but has the defects of sparse data and low accuracy.

In each embodiment of the invention, starting from actual requirements, combining the actual situation of a marketing inspection work order, performing feature extraction by using a method combining rules and statistics after word segmentation by using n-grams, simultaneously performing simulation training on feature indexes by combining a random forest algorithm, and finally completing part-of-speech filtering and general dictionary filtering. The item content includes: firstly, n-gram word segmentation and word frequency filtering are carried out, and candidate words with smaller word frequency are filtered. And secondly, mutual information and left and right adjacent entropy are calculated for the candidate words, and word part combination characteristics of the candidate words are extracted, so that a single method is avoided, and the accuracy and efficiency of new word discovery are ensured. And thirdly, a random forest algorithm is selected to construct a model for training and testing, so that the correctness of the part-of-speech combination characteristics of the candidate words is guaranteed, and a good corpus foundation is laid for the classification and recognition of the worksheet text. And fourthly, filtering the part of speech, and introducing a bloomfilter algorithm to filter a universal dictionary, so that the matching efficiency is improved. Through the fusion application of the algorithm models, the new word discovery service of checking the work order can be accurately and efficiently completed.

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, a schematic flow chart of an inspection new word discovery method provided in embodiment 1 of the present invention is specifically as follows:

and 11, segmenting the sentence by using the n-gram Chinese language model.

An n-gram is a Language Model commonly used in large vocabulary continuous speech recognition, and for Chinese, we refer to it as a Chinese Language Model (CLM). The Chinese language model can realize automatic conversion to Chinese characters by using collocation information between adjacent words in the context. The Chinese language model can calculate the sentence with the maximum probability by using the collocation information between adjacent words in the context when the continuous blank-free pinyin, strokes or numbers representing letters or strokes need to be converted into Chinese character strings (namely sentences), thereby realizing the automatic conversion of Chinese characters without manual selection of a user and avoiding the problem of repeated codes of a plurality of Chinese characters corresponding to the same pinyin (or stroke strings or number strings).

An n-gram is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation of size n on the content in the text according to bytes, and form a byte fragment sequence with length n. Each byte segment is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset threshold value to form a key gram list, namely a vector feature space of the text, wherein each gram in the list is a feature vector dimension.

The model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the words. These probabilities can be obtained by counting the number of times n words occur simultaneously directly from the corpus. Binary Bi-grams and ternary Tri-grams are commonly used.

In the embodiment, the sentences waiting for word segmentation are segmented based on the n-gram model to obtain a preliminary word segmentation result. The word segmentation result is to segment the whole sentence into words to obtain each specific word. After the words are subjected to subsequent processing, operations such as merging, removing and duplicate removal can be performed, and a specific word segmentation result is obtained.

And step 12, filtering the word segmentation result according to the word frequency to obtain candidate words.

And filtering the word segmentation result of the last step according to the word frequency. The word frequency is the frequency of occurrence of the word. If the frequency of occurrence of a word is high, the word is frequently used and is a common word; otherwise, the word is not a common vocabulary. For words with low frequency of occurrence, culling may be performed.

In general, a word frequency high-low threshold may be set, and the threshold is set according to the experience summary. Words with a word frequency exceeding the threshold value are more common and need to be reserved; words with a word frequency below this threshold are not commonly used and can be eliminated.

Usually, obtaining the word frequency of each word in the word segmentation result; and setting a word frequency high-low threshold value, and removing words with the word frequency lower than the word frequency high-low threshold value.

And step 13, performing feature extraction on the candidate words to obtain feature indexes of the candidate words.

And the characteristic extraction of the candidate words is the information summarization and combing of each dimension of the candidate words. The characteristic index comprises all important dimension information of the candidate words. In the embodiment, indexes of several dimensions are selected as candidate word characteristics for subsequent processing.

In this embodiment, mutual information and left and right adjacent entropies are calculated for the candidate words, word segmentation is performed on the candidate words to extract word part combination features of the candidate words, and finally feature systems such as word frequency, word part, mutual information, left adjacent entropy, right adjacent entropy, left and right adjacent entropy difference values and the like are formed.

The adjacency entropy (BE) is a method for currently determining left and right boundaries of a new word, and the adjacency entropy can measure uncertainty of left and right adjacent characters of a candidate new word, wherein the larger the uncertainty is, the more information contained in the adjacent characters is, and the higher the probability of word formation is.

Mutual information is that certain correlation exists between each word of a word, so the greater the correlation between words or words, the greater the probability of word-to-word or word-to-word formation. Mutual information can calculate the degree of mutual dependence of two objects, and the larger the mutual information value is, the larger the degree of dependence representing the two objects is, so that the internal word forming probability of a new word can be calculated by using the mutual information.

And 14, constructing a word segmentation model by using a random forest algorithm according to the characteristic indexes of the candidate words and training.

The random forest algorithm is a supervised learning algorithm. It creates a forest and makes it random in some way. The constructed forest is the integration of decision trees and is mostly trained by using a bagging method. The bagging method, namely bootstrapping aggregation, adopts the steps of randomly returning selected training data, then constructing a classifier, and finally combining the learned models to increase the overall effect.

Random forest algorithms build multiple decision trees and merge them together to obtain a more accurate and stable prediction. One advantage of random forests is that it can be used for both classification and regression problems.

In this embodiment, a random forest algorithm is used for classification. Specifically, a random forest algorithm is utilized to classify the characteristic indexes of each candidate word, respective branches are established, and a word segmentation model is established.

And inputting a specific sentence by using the established word segmentation model, training the obtained word segmentation result feedback model, and gradually regressing to obtain the trained word segmentation model.

For example, in this embodiment, based on 1255 training work orders and 5083 sample words, "checking and accepting-billing accounting-fictitious household," statistical characteristics such as word frequency, word combination, inter-point mutual information, left and right adjacent information entropy, adjacent information entropy dispersion, and the like are extracted, and a random forest algorithm is combined to perform multiple rounds of parameter tuning training. According to the results of model tests performed on 659 work orders of 'power price-power price execution' by the new word discovery model, the model precision rate is 87.71%, the recall rate is 74.43%, and the F1 value is 0.80.

And step 15, performing part-of-speech filtering on the word segmentation result of the word segmentation model to obtain alternative words.

And further processing the word segmentation result obtained by the word segmentation model, and removing words which contain adverbs, prepositions and the like and cannot reflect meanings to obtain alternative words.

The part-of-speech filtering rules can be designed manually, and candidate words containing part-of-speech, prepositions and the like can be filtered out. That is, the filter adverb and preposition here are not necessarily, but can be set by the user according to the user's requirement.

And step 16, comparing the alternative words with a dictionary according to a Bloom filter algorithm, and filtering out existing participles and stop words to obtain audit new words.

The Bloom Filter algorithm consists of a very long binary vector (bit vector) and a series of random mapping functions that can be used to retrieve whether an element is in a set. Its advantages are high space efficiency and inquiry time far beyond those of ordinary algorithm, and no error recognition. Therefore, the Bloom Filter algorithm can replace a very small number of errors for a great saving of storage space in an application situation that can tolerate a low error rate.

The principle of the Bloom Filter algorithm is: when an element is added to a set, the element is mapped to K points in a bit array by K hash functions, setting them to 1. In search, we know (approximately) whether there is any point in the set as long as we see whether these points are all 1: if any of these points has 0, the detected element must not be present; if both are 1, the detected element is likely to be present. This is the basic idea of the Bloom Filter algorithm.

In this embodiment, the Bloom Filter algorithm is used to compare and detect the candidate word with various dictionaries, that is, the Bloom Filter algorithm is used to detect whether the candidate word is in the corresponding dictionary. This detection includes two cases, one is to detect whether the alternative word is in the dictionary of the stop word, and the other is to detect whether the alternative word is in the model dictionary.

In fact, in this embodiment, the filtered candidate words may be input into a model dictionary; and establishing a new word discovery model according to the model dictionary, and performing word segmentation processing on the sentence to be segmented by using the new word discovery model.

In the embodiment, through n-gram word segmentation, a word frequency, a part of speech, mutual information, a left adjacent entropy, a right adjacent entropy, a left adjacent entropy difference value and other characteristic systems are formed, a random forest algorithm and a bloomfilter algorithm are fused, a complete new word inspection discovery method is formed, the new word filtering efficiency is improved, and the method has great significance for solving the problems.

As shown in fig. 2, a schematic structural diagram of the inspection new word discovery system provided by the present invention includes:

a word segmentation unit 21 for segmenting words of a sentence using an n-gram chinese language model;

the word frequency filtering unit 22 is configured to filter the word segmentation result according to the word frequency to obtain candidate words;

the feature extraction unit 23 is configured to perform feature extraction on the candidate words to obtain feature indexes of the candidate words;

the model training unit 24 is used for constructing a word segmentation model by using a random forest algorithm according to the candidate measured feature indexes and training the word segmentation model;

a part-of-speech filtering unit 25, configured to perform part-of-speech filtering on the word segmentation result of the word segmentation model to obtain an alternative word;

and the detection output unit 26 is used for comparing the alternative words with the dictionary according to the Bloom filter algorithm, filtering out existing participles and stop words and obtaining audit new words.

Further, the system further comprises:

and the model dictionary unit 27 is configured to obtain the filtered candidate words, continuously update a word bank, and establish a model dictionary with existing word segmentation results.

Further, the system further comprises:

and the new word processing unit 28 is configured to establish a new word discovery model according to the model dictionary, and perform word segmentation processing on the sentence to be segmented by using the new word discovery model.

Further, the system further comprises:

and the characteristic index unit 29 is configured to store the calculated candidate word mutual information, the left and right adjacent entropies, the word frequency, the part of speech, and the left and right adjacent entropy difference as characteristic indexes.

In summary, the invention firstly uses an n-gram algorithm model to perform word segmentation, and filters candidate words with smaller word frequency according to a threshold value; then calculating mutual information and left and right adjacent entropy of the candidate words, extracting part-of-speech combination characteristics of the candidate words, and selecting a random forest algorithm construction model to train and test characteristic indexes to ensure the accuracy of the new words; and finally, after filtering the part of speech, introducing a bloom filter algorithm to improve the matching efficiency, and finally outputting a new word discovery model result. The scheme provided by the invention can help analysts quickly and accurately find new words appearing in the inspection work order, construct an inspection full-professional basic word bank, support classification and identification of work order texts and improve the analysis capability of the central inspection work order.

The present invention has been described in detail with reference to specific embodiments, but the above embodiments are merely illustrative, and the present invention is not limited to the above embodiments.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An audit neologism discovery method, comprising:

segmenting words of the sentence by using an n-gram Chinese language model;

2. The method of claim 1, wherein the method further comprises:

inputting the filtered alternative words into a model dictionary;

3. The method of claim 1, wherein filtering the segmentation results according to word frequency comprises:

acquiring the word frequency of each word in the word segmentation result;

4. The method of claim 1, wherein the feature indicators of the candidate words comprise:

5. The method of claim 1, wherein said performing part-of-speech filtering on said segmentation model segmentation results comprises:

obtaining a word segmentation result of the word segmentation model;

6. The method of claim 1, wherein said comparing the candidate word to a dictionary according to a Bloom filter algorithm comprises:

7. An audit neologism discovery system, the system comprising:

8. The system of claim 7, wherein the system further comprises:

9. The system of claim 8, wherein the system further comprises:

10. The system of claim 7, wherein the system further comprises: