CN110413998B

CN110413998B - Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof

Info

Publication number: CN110413998B
Application number: CN201910638948.2A
Authority: CN
Inventors: 张云翔; 饶竹一
Original assignee: Shenzhen Power Supply Bureau Co Ltd
Current assignee: Shenzhen Power Supply Bureau Co Ltd
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2023-04-21
Anticipated expiration: 2039-07-16
Also published as: CN110413998A

Abstract

The invention relates to a self-adaptive Chinese word segmentation method oriented to the power industry, a system and a medium thereof, wherein the method comprises the following steps: s1, acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented; s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences; s3, segmenting each candidate text sentence to obtain one or more segmented words; s4, replacing the word in the candidate text terms one by one with the word with the same meaning as the word of the word and carrying out semantic discrimination, returning to S3 if ambiguity occurs, and reserving the word as the candidate word if ambiguity does not exist; s5, acquiring one or more power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more power field professional vocabularies, and determining a final word segmentation according to the similarity; s6, sorting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms, and outputting the sorted word segmentation.

Description

Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof

Technical Field

The invention relates to the technical field of data processing of power equipment, in particular to a self-adaptive Chinese word segmentation method and system for the power industry and a computer readable storage medium.

Background

In recent years, with the increasing popularity of networks, the text scale on the internet is gradually enlarged, information resources are continuously increased, in order to retrieve and mine valuable information from a large amount of resources, internet companies are greatly developing technology in the field of natural language processing, chinese word segmentation is a basis and premise of the natural language processing technology, and plays an important role in information processing such as information retrieval, machine translation, information filtering and the like, and is a key technology and difficulty of information processing; so far, a large number of data management systems are established by the national grid company, and the service data volume is huge.

Therefore, the following technical problems exist: because of different definition rules of data information by each business department and each business system, the situation that the names of the data from the same source are inconsistent in different business systems in reality causes a problem of a plurality of sources, and certain difficulty is brought to the data uniformity among the business systems.

Disclosure of Invention

The invention aims to provide a self-adaptive Chinese word segmentation method and system for the power industry and a computer readable storage medium, so as to solve the technical problems.

In order to achieve the object of the present invention, according to a first aspect of the present invention, an embodiment of the present invention provides an adaptive chinese word segmentation method for the power industry, including the steps of:

step S1, acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;

s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;

s3, segmenting each candidate text sentence to obtain one or more segmented words;

step S4, replacing the word segmentation in the candidate text terms one by one with the word with the same meaning as the word segmentation word, carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and retaining the word segmentation as the candidate word segmentation if the text terms before and after replacement are not ambiguous;

s5, obtaining one or more power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more power field professional vocabularies, and determining a final word segmentation according to the similarity;

and S6, sorting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms, and outputting the sorted word segmentation.

Preferably, the step S2 includes:

separating punctuations and spaces in the candidate text terms to obtain a plurality of text parts, and removing the punctuations and the spaces in the text parts to obtain a plurality of text sentences to be filtered;

judging whether characters in each text sentence to be filtered are professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; the word segmentation is to segment characters and characters after the characters together to obtain candidate text sentences.

Preferably, the step S3 includes:

extracting vocabulary corresponding to vocabulary in a dictionary database from candidate text sentences to obtain segmented words; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field.

Preferably, the step S4 includes:

when a candidate text sentence corresponds to a plurality of candidate word segments, calculating the similarity value of each candidate word segment in the candidate text sentence and one or more power domain professional vocabularies, and accumulating to obtain the similarity value corresponding to the candidate word segment;

and selecting the candidate word with the highest similarity value as the final word of the candidate text sentence.

Preferably, the step S6 includes:

and outputting the sequenced final word segmentation with the space as an interval, selecting the first ten sequenced digits for key display, and hiding other final word segmentation results.

According to a second aspect of the present invention, an embodiment of the present invention provides an adaptive chinese word segmentation system for the power industry, including:

the text acquisition unit is used for acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;

the text segmentation unit is used for carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;

the word segmentation unit is used for segmenting each candidate text sentence to obtain one or more word segments;

the first word segmentation screening unit is used for replacing the word segments in the candidate text terms one by one with words with the same meaning as the word segments and carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and reserving the word segments as the candidate word segments if the text terms before and after replacement are not ambiguous;

the second word screening unit is used for acquiring one or more electric power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more electric power field professional vocabularies, and determining a final word segmentation according to the similarity;

and the output unit is used for sequencing and outputting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms.

Preferably, the text segmentation unit includes:

the first segmentation unit is used for separating punctuation and space in the candidate text terms to obtain a plurality of text parts, and removing the punctuation and space in the text parts to obtain a plurality of text sentences to be filtered;

the second segmentation unit is used for judging whether the characters in each text sentence to be filtered are professional segmentation words in the power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; the word segmentation is to segment characters and characters after the characters together to obtain candidate text sentences.

Preferably, the word segmentation unit is specifically configured to extract a vocabulary corresponding to a vocabulary in the dictionary database in the candidate text sentence to obtain a segmented word; wherein, the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;

the output unit includes:

the similarity calculation unit is used for calculating the similarity value of each candidate word in the candidate text sentence and one or more power domain professional vocabularies when a plurality of candidate words are corresponding to the candidate text sentence, and accumulating to obtain the similarity value corresponding to the candidate word;

and the final word segmentation determining unit is used for selecting the candidate word segmentation with the highest similarity value as the final word segmentation of the candidate text sentence.

Preferably, the output unit includes:

and the display unit is used for outputting the sequenced final word segmentation by taking the space as an interval, selecting the first ten sequenced bits for key display, and hiding other final word segmentation results.

According to a third aspect of the present invention, an embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the power industry oriented adaptive chinese word segmentation method.

In the embodiment of the invention, the characteristics of the electric data are combined, a word segmentation dictionary base unique to the electric power field is established, candidate word segmentation is obtained by splitting and ambiguity judging candidate text sentences according to the words in the word segmentation dictionary base, and the final word segmentation is further determined according to the similarity between the candidate word segmentation and the similar words in the word segmentation dictionary base, so that the accuracy of word segmentation is greatly improved, and the working efficiency and the use efficiency of data can be remarkably improved according to the data matching analysis among various business systems.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings. Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a self-adaptive chinese word segmentation method for the power industry according to a first embodiment of the present invention.

Fig. 2 is a schematic diagram of a self-adaptive chinese word segmentation system for the power industry in a second embodiment of the present invention.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

In addition, numerous specific details are set forth in the following examples in order to provide a better illustration of the invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known means have not been described in detail in order to not obscure the present invention.

As shown in fig. 1, the embodiment of the invention provides a self-adaptive Chinese word segmentation method for the power industry, which comprises the following steps:

The step S2 specifically includes:

Specifically, for a text sentence to be filtered, first extracting a first character, judging whether the first character is a professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; and then, continuing to judge the subsequent characters until the last character in the text sentence to be filtered is taken out, so as to realize the filtering of the candidate text sentence. And comparing the characters taken out of the text sentences with the special vocabulary of the electric power industry according to the constructed special vocabulary of the electric power industry and the daily vocabulary word segmentation dictionary, and judging whether the characters are special words of the electric power industry.

Wherein, the step S3 includes:

In particular, there may be zero or more segmentations in one candidate text sentence for words corresponding to the vocabulary in the dictionary database that are semantically similar to each other.

Wherein, the step S4 includes:

Specifically, one candidate text sentence may correspond to a plurality of candidate word segments, in this step, the candidate word segments are screened according to the similarity value, and finally, only one word segment is output by one candidate text sentence, so that the word segment error rate is reduced.

Wherein, the step S6 includes:

Specifically, in this embodiment, each word segmentation result obtained by calculation is ranked according to the occurrence frequency, the ranked word segmentation results are output at intervals of spaces, the first ten digits after ranking are selected for key display, the subsequent word segmentation results are hidden, when viewing is required, corresponding keys are clicked, the remaining word segmentation results are displayed, and all word segmentation results are output to a display device in the form of a bar graph and displayed to a user.

According to the embodiment of the invention, through selecting word segmentation data in a special word segmentation dictionary in the electric power field, the extracted candidate text terms are separated into a plurality of text sentences to be output, the text terms can be preprocessed, word segmentation interference caused by the marks and the spaces contained in the text terms is reduced, preprocessing efficiency of the text terms is increased, the problem of efficiency in processing the text terms is solved, the extracted characters are substituted for comparison, whether the characters are special word segmentation in the electric power field is judged until the last characters in the text sentences are extracted, word-by-word substitution and judgment can be carried out on the extracted text sentences, all the same characters are not substituted for comparison judgment, the workload of character comparison judgment is reduced, the character comparison judgment efficiency is higher, the candidate text terms after segmentation can be segmented, ambiguity is carried out on the word segmentation data obtained after segmentation until the word segmentation is not contained, the situation generated after text terms segmentation is reduced, the word segmentation ambiguity is avoided, the word segmentation ambiguity is caused by the fact that the user is generated when the word segmentation is still more old, the word segmentation ambiguity is increased, the word segmentation ambiguity is more clear, the word segmentation ambiguity is calculated, the word can be obtained, the word segmentation ambiguity is more clear, and the result can be obtained, and the visual and the result is more clear, and the word can be obtained by the visual and the word is more clear, and the result is obtained by the word is more when the word segmentation is calculated.

As shown in fig. 2, a second embodiment of the present invention provides an adaptive chinese word segmentation system for the power industry, including:

a text obtaining unit 1, configured to obtain candidate text terms, where the candidate text terms are phrases or paragraphs to be segmented;

a text segmentation unit 2, configured to perform segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;

the word segmentation unit 3 is used for segmenting each candidate text sentence to obtain one or more word segments;

the first word segmentation screening unit 4 is used for replacing the word segments in the candidate text terms one by one with words with the same meaning as the word segments and carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and reserving the word segments as the candidate word segments if the text terms before and after replacement are not ambiguous;

the second word screening unit 5 is used for acquiring one or more electric power field professional vocabularies similar to the candidate word semanteme, calculating the similarity between the candidate word and the one or more electric power field professional vocabularies, and determining a final word according to the similarity;

and the output unit 6 is used for sorting and outputting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms.

Wherein the text segmentation unit 2 comprises:

The word segmentation unit 3 is specifically configured to extract a vocabulary corresponding to a vocabulary in a dictionary database in a candidate text sentence to obtain a segmented word; wherein, the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;

the output unit 6 includes:

Wherein the output unit 6 includes:

It should be noted that the system of the second embodiment corresponds to the method of the first embodiment, and is used for implementing the method of the first embodiment, so that other undescribed contents of the system of the second embodiment can be obtained by referring to the method of the first embodiment, and are not repeated herein.

It should also be appreciated that the method of embodiment one and the system of embodiment two may be implemented in numerous ways, including as a process, an apparatus, or a system. The methods described herein may be implemented in part by program instructions for instructing a processor to perform such methods, as well as such instructions recorded on a non-transitory computer-readable storage medium such as a hard disk drive, floppy disk, optical disk (such as a Compact Disc (CD) or Digital Versatile Disc (DVD)), flash memory, and the like. In some embodiments, the program instructions may be stored remotely and transmitted over a network via optical or electronic communication links.

An embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the adaptive chinese word segmentation method for electric power industry of embodiment one.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. The self-adaptive Chinese word segmentation method for the power industry is characterized by comprising the following steps of:

s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences; separating punctuations and spaces in the candidate text terms to obtain a plurality of text parts, and removing the punctuations and the spaces in the text parts to obtain a plurality of text sentences to be filtered; judging whether characters in each text sentence to be filtered are professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, wherein the word segmentation is to segment the characters and the characters after the characters together to obtain a candidate text sentence; if not, extracting all the same characters in the text sentence and discarding the same characters;

s3, segmenting each candidate text sentence to obtain one or more segmented words; extracting vocabulary corresponding to vocabulary in a dictionary database in the candidate text sentence to obtain word segmentation; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;

2. The power industry-oriented adaptive chinese word segmentation method as in claim 1, wherein step S4 comprises:

3. The power industry-oriented adaptive chinese word segmentation method as in claim 2, wherein step S6 comprises:

4. An adaptive chinese word segmentation system for the power industry, comprising:

the word segmentation unit is used for segmenting each candidate text sentence to obtain one or more word segments; extracting vocabulary corresponding to vocabulary in a dictionary database in the candidate text sentence to obtain word segmentation; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;

the second word screening unit is used for acquiring one or more electric power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more electric power field professional vocabularies, and determining a final word segmentation according to the similarity; and

the output unit is used for sorting and outputting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms;

wherein the text segmentation unit comprises:

the first segmentation unit is used for separating punctuation and space in the candidate text terms to obtain a plurality of text parts, and removing the punctuation and space in the text parts to obtain a plurality of text sentences to be filtered; and

the second segmentation unit is used for judging whether characters in each text sentence to be filtered are professional segmentation words in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, wherein the segmentation into words is to segment the characters and the characters after the characters together to obtain candidate text sentences; if not, extracting all the same characters in the text sentence and discarding the same characters;

the word segmentation unit is specifically used for extracting words corresponding to word assembly in the dictionary database in the candidate text sentence to obtain segmented words; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field.

5. The power industry oriented adaptive Chinese word segmentation system of claim 4,

the output unit includes:

6. The power industry oriented adaptive chinese word segmentation system as recited in claim 5, wherein the output unit comprises:

7. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the power industry oriented adaptive chinese word segmentation method of any one of claims 1-3.