CN109522559B

CN109522559B - Method and system for Chinese word segmentation in power grid operation and distribution system

Info

Publication number: CN109522559B
Application number: CN201811417689.2A
Authority: CN
Inventors: 李志�; 夏同飞; 章玉龙; 郭振; 王超; 张学敏; 岳想想; 费晓璐
Original assignee: State Grid Information and Telecommunication Co Ltd; Electric Power Research Institute of State Grid Anhui Electric Power Co Ltd; Anhui Jiyuan Software Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Electric Power Research Institute of State Grid Anhui Electric Power Co Ltd; Anhui Jiyuan Software Co Ltd
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2023-03-31
Anticipated expiration: 2038-11-26
Also published as: CN109522559A

Abstract

The invention provides a method for Chinese word segmentation in a power grid operation and distribution system, which comprises the following steps: establishing a power grid operation and distribution word segmentation word bank; selecting a word segmentation word bank corresponding to a preset scene; carrying out hash indexing on the first 2 characters of the data to be processed one by one according to the word segmentation word bank in the second step; arranging the residual word strings of the processed data according to a preset sequence, and performing word-by-word matching on the arranged data according to the word segmentation word bank in the second step; extracting sample data to form a big data training set and a verification set; and evaluating the word feature indexes. The invention provides a word segmentation method for improving a TRIE index tree on the basis of a classical dictionary word segmentation method, and further provides a double-array Trie word segmentation method, which is more suitable for a power service environment; a Chinese word segmentation method is provided by combining with the scene requirements of the power business, the feature information of the power business object is efficiently and accurately extracted, and the feature extraction meets certain synonymy recognition rate, ambiguity recognition rate and new word recognition rate indexes.

Description

Method and system for Chinese word segmentation in power grid operation and distribution system

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a system for Chinese word segmentation in a power grid operation and distribution system.

Background

Distribution and utilization of electricity are core services of power grid enterprises, and the operation and distribution ledger is an important basis for development of the distribution and utilization services. Because the power grid operation, distribution and dispatching business relevance is strong, operation, distribution and dispatching basic accounts (such as lines, areas, transformers, users and the like) belong to different professional management and have intersection, the through and corresponding problems of the operation, distribution and dispatching basic accounts are one of the difficulties of power business.

At present, a large amount of research work is carried out on Chinese unstructured text matching by domestic scholars, and certain achievements are achieved. The word segmentation and matching process is the focus of research, and the feature extraction and weight calculation process can also be generally included in the matching process. The word segmentation technology belongs to the category of natural language understanding technology, is the first link of semantic understanding, and is a technology for exactly separating words in sentences. Different from the separation of English words by spaces, the absence of fixed separators between Chinese words and the existence of ambiguity problems and new word recognition problems, the word segmentation is relatively difficult.

The existing Chinese word segmentation can be generally divided into 3 types such as a word segmentation method based on a dictionary, a word segmentation method based on statistics, a word segmentation method based on understanding and the like, wherein a mechanical word segmentation method based on the dictionary is the most mature. However, the method is limited by the scale of the dictionary, has certain difficulty in identifying unregistered new words, and is also troubled by ambiguity problems, and the ideal word segmentation method is based on an understood word segmentation method, namely, a computer learns grammar and semantic rules like a human being, and correct word segmentation selection is made according to the rules.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method and a system for Chinese word segmentation in a power grid operation and distribution system, which can efficiently and accurately extract the characteristic information of a power business object, and the characteristic extraction meets certain synonymy recognition rate, ambiguity recognition rate and new word recognition rate indexes.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a method for Chinese word segmentation in a power grid operation and distribution system comprises the following steps:

step one, establishing a power grid marketing and distribution word segmentation word bank;

selecting a word segmentation word bank corresponding to a preset scene;

step three, carrying out hash index on the first 2 characters of the data to be processed one by one according to the word segmentation word bank in the step two;

step four, arranging the residual word strings of the processed data according to a preset sequence, and performing word-by-word matching on the arranged data according to the word segmentation word bank in the step two;

step five, extracting sample data to form a big data training set and a verification set;

and sixthly, evaluating the word feature indexes.

Further, the second step specifically includes: selecting the distribution line name to be matched with the name in the dispatching, operation and inspection and marketing system; selecting naming matching of the transformer substation in a dispatching and marketing system; and selecting the naming matching of the distribution station in the electric power operation inspection and marketing system.

Further, each node in the method uses two arrays of the same index for element expression, including an array for determining state transition and an array for checking the correctness of the transition.

Furthermore, the word segmentation characteristic indexes comprise accuracy and recall rate, and the accuracy is calculated by the method

Wherein b represents the number of correctly segmented words, and a represents the total number of segmented words;

the recall rate is calculated by

Where b denotes the number of correctly segmented words and n denotes the total number of words that should be segmented.

A system for Chinese word segmentation in a power grid operation and distribution system comprises:

the word bank establishing module is used for establishing a power grid marketing and distribution word division word bank;

the scene selection module is used for selecting a word segmentation word bank corresponding to a preset scene;

the Trie node index module is used for carrying out hash index one by one on the first 2 characters of the data to be processed according to the word segmentation word bank selected by the scene selection module;

the Trie mechanism index module is used for arranging the residual word strings of the processed data according to a preset sequence and performing word-by-word matching on the arranged data according to the word segmentation word bank selected by the scene selection module;

the set generation module is used for extracting sample data to form a big data training set and a verification set;

and the characteristic index evaluation module is used for evaluating the word characteristic indexes.

Further, the scene selection module comprises:

the distribution circuit selection submodule is used for selecting the naming matching of the distribution circuit naming in the dispatching, operation and inspection and marketing systems;

the transformer substation selection submodule is used for selecting naming matching of the transformer substation in the dispatching and marketing system;

and the power distribution area selection submodule selects the naming matching of the power distribution area in the electric power operation inspection and marketing system.

Further, the set generation module includes:

the training set generation submodule is used for extracting sample data to form a big data training set;

and the verification set generation submodule is used for extracting sample data to form a large data verification set.

Further, the characteristic index evaluation module includes:

an accuracy calculation submodule including an accuracy calculation method of

Wherein b represents the number of correctly segmented words and a represents the total number of segmented words;

a recall rate calculating submodule including a recall rate calculating method of

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a word segmentation method for improving a TRIE index tree on the basis of a classical dictionary word segmentation method, and further provides a double-array Trie word segmentation method which is more suitable for a power service environment; the Chinese word segmentation method is provided by combining with the scene requirements of the power service, the feature information of the power service object is efficiently and accurately extracted, and the feature extraction meets certain synonymy recognition rate, ambiguity recognition rate and new word recognition rate indexes.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a block diagram of the system architecture of the present invention;

FIG. 3 is a block diagram of a scene selection module according to the present invention;

FIG. 4 is a block diagram of the structure of a collection generation module according to the present invention;

fig. 5 is a block diagram of a feature index evaluation module according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a method for Chinese word segmentation in a power grid marketing and distribution system, which comprises the following steps:

s1, establishing a power grid marketing and distribution word segmentation word bank;

s2, selecting a word segmentation word bank corresponding to a preset scene;

s3, performing hash index one by one on the first 2 characters of the data to be processed according to the word segmentation word bank in the S2;

s4, arranging the rest word strings of the processed data according to a preset sequence, and performing word-by-word matching on the arranged data according to the word segmentation word bank in the S2;

s5, extracting sample data to form a big data training set and a verification set;

and S6, evaluating the word feature indexes.

Specifically, S2 specifically includes: selecting the distribution line name to be matched with the name in the dispatching, operation and inspection and marketing system; selecting naming matching of the transformer substation in a dispatching and marketing system; and selecting the naming matching of the distribution station in the electric power operation inspection and marketing system.

Specifically, each node in the method uses two arrays of the same subscript for element expression, including an array for determining state transition and an array for checking the correctness of the transition.

Specifically, the word segmentation characteristic indexes comprise accuracy and recall rate, and the accuracy is calculated by the method

the recall rate is calculated by

The invention also provides a system for Chinese word segmentation in a power grid marketing and distribution system, which comprises the following steps:

the word bank establishing module 201 is used for establishing a power grid marketing and distribution word bank;

a scene selection module 202, configured to select a word segmentation lexicon corresponding to a preset scene;

the Trie node indexing module 203 is configured to perform hash indexing on the first 2 words of the data to be processed one by one according to the participle lexicon selected by the scene selection module 202;

the Trie mechanism index module 203 is configured to arrange the remaining word strings of the processed data according to a preset sequence, and perform word-by-word matching on the arranged data according to the word segmentation lexicon selected by the scene selection module 202;

the set generating module 205 is configured to extract sample data to form a big data training set and a verification set;

and the feature index evaluation module 206 is used for evaluating the word feature indexes.

Specifically, the scene selection module 202 includes:

the distribution circuit selection submodule 301 is used for selecting the naming matching of the distribution circuit naming in the dispatching, operation and inspection and marketing system;

the transformer substation selection submodule 302 is used for selecting naming matching of a transformer substation in a dispatching and marketing system;

and the power distribution area selection submodule 303 selects naming matching of the power distribution area in the electric power operation inspection and marketing system.

Specifically, the set generating module 205 includes:

a training set generation submodule 401, configured to extract sample data to form a big data training set;

the verification set generation submodule 402 is configured to extract sample data to form a large data verification set.

Specifically, the feature index evaluation module 206 includes:

the accuracy calculation submodule 501 includes an accuracy calculation method of

the recall ratio calculation submodule 502 comprises a recall ratio calculation method

In order to adapt to naming habits of electric power objects in different regions, different systems and different time periods, key features of the electric power objects are extracted according to naming, recognition effects of classical Chinese word segmentation methods based on dictionaries, statistics and the like under an electric power service scene are researched, key indexes such as synonymy recognition rate, ambiguity recognition rate, new word recognition rate and the like are used for evaluation, and a current mainstream research direction is referred to on the basis of the classical Chinese word segmentation, the invention respectively provides an improved Trie index tree facing the electric power service scene and a double-array Trie Chinese word segmentation method facing the electric power service scene:

the classical Chinese word segmentation method is carried out depending on a machine dictionary, all word segmentation processes need to pass through a word list, namely the word segmentation dictionary, and too much information about languages such as lexical, semantic, syntactic knowledge and the like is not involved. The dictionary classification lists various vocabulary entries, and the number of entries in the dictionary, the selection of the entries and the organization structure of the dictionary directly influence the final word segmentation effect.

The basic idea of classical word segmentation is to first build a lexicon, i.e. a word segmentation dictionary, which contains as many as possible all possible words. For a given Chinese character string s to be segmented, a substring of the s is taken according to a certain determined principle (forward or reverse), if the substring is matched with a certain entry in a dictionary, the substring is a word and is segmented, and the rest is continuously segmented until the substring is empty; otherwise, the substring is not a word, and the next substring is continuously taken for matching. The classical word segmentation method can be divided into forward matching and reverse matching according to different scanning directions; according to the condition of preferential matching of different lengths, the method can be divided into maximum (longest) matching and minimum (shortest) matching; whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging.

The invention respectively selects the naming matching of distribution line naming in a dispatching, operation and inspection and marketing system, the naming matching of a transformer substation in the dispatching and marketing system, the naming matching of a distribution substation in a power operation and inspection and marketing system and other different scenes, applies different classical Chinese word segmentation methods to extract features, and checks the feature expression effect of feature segmentation verification, and the research work comprises the following steps: sample data extraction, training set and verification set setting, implementation of word segmentation algorithm, chinese word segmentation feature extraction, word segmentation feature index evaluation and the like.

The Trie index tree is a key tree expressed in the form of multiple linked lists of the tree, and consists of Trie index tree nodes and a Trie index mechanism 2 part, and the tree structure expresses the covering and preferential matching relation between Chinese dictionaries and each participle in the dictionaries. In the word segmentation application, the segmented sentences only need to be matched word by word along the tree chain without predicting the length of the word to be queried.

According to the characteristic that double-character words are more in Chinese, a Trie index tree dictionary indexing mechanism is improved, a structure that the first 2 words are subjected to hash indexing one by one and the rest word strings are arranged in order is adopted, a word-by-word matching method is adopted in the query process, namely, phrases below 2 words are realized by the Trie index tree mechanism, and the rest parts of long words above 3 words are organized by linear tables, so that deep search is avoided, and the word segmentation speed is improved under the condition that the maintenance complexity of a typical dictionary mechanism is not improved.

On the basis of a classical Chinese word segmentation method, the invention researches an establishment method of an improved Trie index tree, a maintenance method of the improved Trie index tree, an application method of the improved Trie index tree in a typical power service scene and a feature extraction effect.

The double-array Trie tree is a variant of the Trie tree, and is a data structure which is provided on the premise of ensuring the Trie tree retrieval speed and improving the space utilization rate. The essence of the method is to determine the finite state automaton, each node represents one state of the automaton, state transition is carried out according to different variables, and query is completed when the end state is reached or the transition is not possible. The method comprises the following steps of adopting two linear arrays (base and check) to express a Trie tree, wherein each node in the Trie tree is expressed by using two array elements with the same subscript, the base array is used for determining state transfer, and the check array is used for checking the transfer correctness.

On the basis of the improved Trie index tree Chinese word segmentation method, the invention researches an establishing method of an even-number Trie index tree, a maintenance method of the even-number Trie index tree, an application method of the even-number Trie index tree in a typical electric power service scene and a feature extraction effect.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for Chinese word segmentation in a power grid operation and distribution system is characterized by comprising the following steps:

selecting a word segmentation word bank corresponding to a preset scene;

step four, arranging the rest word strings of the processed data according to a preset sequence, and performing word-by-word matching on the arranged data according to the word segmentation word bank in the step two;

and sixthly, evaluating the word feature indexes.

2. The method for Chinese word segmentation in the power grid operation and distribution system according to claim 1, wherein the second step specifically comprises: selecting the naming matching of the distribution line naming in the dispatching, operation and inspection and marketing system; selecting naming matching of the transformer substation in a dispatching and marketing system; and selecting the naming matching of the distribution station in the electric power operation inspection and marketing system.

3. The method for Chinese word segmentation in the power grid operation and distribution system according to claim 1, wherein the method comprises the following steps: in the method, each node uses two arrays of the same subscript for element expression, including an array for determining state transition and an array for checking the correctness of the transition.

4. The method for Chinese word segmentation in the power grid operation and distribution system according to claim 1, wherein the method comprises the following steps: the word segmentation characteristic indexes comprise accuracy and recall rate, and the accuracy is calculated by

the recall rate is calculated by

5. A system for chinese word segmentation in a power grid operation and distribution system, the system comprising:

the word bank establishing module is used for establishing a power grid operation and distribution word bank;

6. The system for Chinese word segmentation in the power grid operation and distribution system according to claim 5, wherein the scene selection module comprises:

the distribution circuit selecting submodule is used for selecting the naming matching of the distribution circuit naming in the dispatching, operation and inspection and marketing system;

7. The system for Chinese word segmentation in the power grid operation and distribution system according to claim 5, wherein the set generation module comprises:

8. The system for Chinese word segmentation in the power grid operation and distribution system according to claim 5, wherein the characteristic index evaluation module comprises:

an accuracy calculation submodule including an accuracy calculation method of