CN110413998B - Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof - Google Patents

Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof Download PDF

Info

Publication number
CN110413998B
CN110413998B CN201910638948.2A CN201910638948A CN110413998B CN 110413998 B CN110413998 B CN 110413998B CN 201910638948 A CN201910638948 A CN 201910638948A CN 110413998 B CN110413998 B CN 110413998B
Authority
CN
China
Prior art keywords
word segmentation
candidate
word
text
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910638948.2A
Other languages
Chinese (zh)
Other versions
CN110413998A (en
Inventor
张云翔
饶竹一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Power Supply Bureau Co Ltd
Original Assignee
Shenzhen Power Supply Bureau Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Power Supply Bureau Co Ltd filed Critical Shenzhen Power Supply Bureau Co Ltd
Priority to CN201910638948.2A priority Critical patent/CN110413998B/en
Publication of CN110413998A publication Critical patent/CN110413998A/en
Application granted granted Critical
Publication of CN110413998B publication Critical patent/CN110413998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a self-adaptive Chinese word segmentation method oriented to the power industry, a system and a medium thereof, wherein the method comprises the following steps: s1, acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented; s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences; s3, segmenting each candidate text sentence to obtain one or more segmented words; s4, replacing the word in the candidate text terms one by one with the word with the same meaning as the word of the word and carrying out semantic discrimination, returning to S3 if ambiguity occurs, and reserving the word as the candidate word if ambiguity does not exist; s5, acquiring one or more power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more power field professional vocabularies, and determining a final word segmentation according to the similarity; s6, sorting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms, and outputting the sorted word segmentation.

Description

Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
Technical Field
The invention relates to the technical field of data processing of power equipment, in particular to a self-adaptive Chinese word segmentation method and system for the power industry and a computer readable storage medium.
Background
In recent years, with the increasing popularity of networks, the text scale on the internet is gradually enlarged, information resources are continuously increased, in order to retrieve and mine valuable information from a large amount of resources, internet companies are greatly developing technology in the field of natural language processing, chinese word segmentation is a basis and premise of the natural language processing technology, and plays an important role in information processing such as information retrieval, machine translation, information filtering and the like, and is a key technology and difficulty of information processing; so far, a large number of data management systems are established by the national grid company, and the service data volume is huge.
Therefore, the following technical problems exist: because of different definition rules of data information by each business department and each business system, the situation that the names of the data from the same source are inconsistent in different business systems in reality causes a problem of a plurality of sources, and certain difficulty is brought to the data uniformity among the business systems.
Disclosure of Invention
The invention aims to provide a self-adaptive Chinese word segmentation method and system for the power industry and a computer readable storage medium, so as to solve the technical problems.
In order to achieve the object of the present invention, according to a first aspect of the present invention, an embodiment of the present invention provides an adaptive chinese word segmentation method for the power industry, including the steps of:
step S1, acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;
s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;
s3, segmenting each candidate text sentence to obtain one or more segmented words;
step S4, replacing the word segmentation in the candidate text terms one by one with the word with the same meaning as the word segmentation word, carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and retaining the word segmentation as the candidate word segmentation if the text terms before and after replacement are not ambiguous;
s5, obtaining one or more power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more power field professional vocabularies, and determining a final word segmentation according to the similarity;
and S6, sorting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms, and outputting the sorted word segmentation.
Preferably, the step S2 includes:
separating punctuations and spaces in the candidate text terms to obtain a plurality of text parts, and removing the punctuations and the spaces in the text parts to obtain a plurality of text sentences to be filtered;
judging whether characters in each text sentence to be filtered are professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; the word segmentation is to segment characters and characters after the characters together to obtain candidate text sentences.
Preferably, the step S3 includes:
extracting vocabulary corresponding to vocabulary in a dictionary database from candidate text sentences to obtain segmented words; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field.
Preferably, the step S4 includes:
when a candidate text sentence corresponds to a plurality of candidate word segments, calculating the similarity value of each candidate word segment in the candidate text sentence and one or more power domain professional vocabularies, and accumulating to obtain the similarity value corresponding to the candidate word segment;
and selecting the candidate word with the highest similarity value as the final word of the candidate text sentence.
Preferably, the step S6 includes:
and outputting the sequenced final word segmentation with the space as an interval, selecting the first ten sequenced digits for key display, and hiding other final word segmentation results.
According to a second aspect of the present invention, an embodiment of the present invention provides an adaptive chinese word segmentation system for the power industry, including:
the text acquisition unit is used for acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;
the text segmentation unit is used for carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;
the word segmentation unit is used for segmenting each candidate text sentence to obtain one or more word segments;
the first word segmentation screening unit is used for replacing the word segments in the candidate text terms one by one with words with the same meaning as the word segments and carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and reserving the word segments as the candidate word segments if the text terms before and after replacement are not ambiguous;
the second word screening unit is used for acquiring one or more electric power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more electric power field professional vocabularies, and determining a final word segmentation according to the similarity;
and the output unit is used for sequencing and outputting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms.
Preferably, the text segmentation unit includes:
the first segmentation unit is used for separating punctuation and space in the candidate text terms to obtain a plurality of text parts, and removing the punctuation and space in the text parts to obtain a plurality of text sentences to be filtered;
the second segmentation unit is used for judging whether the characters in each text sentence to be filtered are professional segmentation words in the power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; the word segmentation is to segment characters and characters after the characters together to obtain candidate text sentences.
Preferably, the word segmentation unit is specifically configured to extract a vocabulary corresponding to a vocabulary in the dictionary database in the candidate text sentence to obtain a segmented word; wherein, the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;
the output unit includes:
the similarity calculation unit is used for calculating the similarity value of each candidate word in the candidate text sentence and one or more power domain professional vocabularies when a plurality of candidate words are corresponding to the candidate text sentence, and accumulating to obtain the similarity value corresponding to the candidate word;
and the final word segmentation determining unit is used for selecting the candidate word segmentation with the highest similarity value as the final word segmentation of the candidate text sentence.
Preferably, the output unit includes:
and the display unit is used for outputting the sequenced final word segmentation by taking the space as an interval, selecting the first ten sequenced bits for key display, and hiding other final word segmentation results.
According to a third aspect of the present invention, an embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the power industry oriented adaptive chinese word segmentation method.
In the embodiment of the invention, the characteristics of the electric data are combined, a word segmentation dictionary base unique to the electric power field is established, candidate word segmentation is obtained by splitting and ambiguity judging candidate text sentences according to the words in the word segmentation dictionary base, and the final word segmentation is further determined according to the similarity between the candidate word segmentation and the similar words in the word segmentation dictionary base, so that the accuracy of word segmentation is greatly improved, and the working efficiency and the use efficiency of data can be remarkably improved according to the data matching analysis among various business systems.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings. Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a self-adaptive chinese word segmentation method for the power industry according to a first embodiment of the present invention.
Fig. 2 is a schematic diagram of a self-adaptive chinese word segmentation system for the power industry in a second embodiment of the present invention.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
In addition, numerous specific details are set forth in the following examples in order to provide a better illustration of the invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known means have not been described in detail in order to not obscure the present invention.
As shown in fig. 1, the embodiment of the invention provides a self-adaptive Chinese word segmentation method for the power industry, which comprises the following steps:
step S1, acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;
s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;
s3, segmenting each candidate text sentence to obtain one or more segmented words;
step S4, replacing the word segmentation in the candidate text terms one by one with the word with the same meaning as the word segmentation word, carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and retaining the word segmentation as the candidate word segmentation if the text terms before and after replacement are not ambiguous;
s5, obtaining one or more power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more power field professional vocabularies, and determining a final word segmentation according to the similarity;
and S6, sorting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms, and outputting the sorted word segmentation.
The step S2 specifically includes:
separating punctuations and spaces in the candidate text terms to obtain a plurality of text parts, and removing the punctuations and the spaces in the text parts to obtain a plurality of text sentences to be filtered;
judging whether characters in each text sentence to be filtered are professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; the word segmentation is to segment characters and characters after the characters together to obtain candidate text sentences.
Specifically, for a text sentence to be filtered, first extracting a first character, judging whether the first character is a professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; and then, continuing to judge the subsequent characters until the last character in the text sentence to be filtered is taken out, so as to realize the filtering of the candidate text sentence. And comparing the characters taken out of the text sentences with the special vocabulary of the electric power industry according to the constructed special vocabulary of the electric power industry and the daily vocabulary word segmentation dictionary, and judging whether the characters are special words of the electric power industry.
Wherein, the step S3 includes:
extracting vocabulary corresponding to vocabulary in a dictionary database from candidate text sentences to obtain segmented words; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field.
In particular, there may be zero or more segmentations in one candidate text sentence for words corresponding to the vocabulary in the dictionary database that are semantically similar to each other.
Wherein, the step S4 includes:
when a candidate text sentence corresponds to a plurality of candidate word segments, calculating the similarity value of each candidate word segment in the candidate text sentence and one or more power domain professional vocabularies, and accumulating to obtain the similarity value corresponding to the candidate word segment;
and selecting the candidate word with the highest similarity value as the final word of the candidate text sentence.
Specifically, one candidate text sentence may correspond to a plurality of candidate word segments, in this step, the candidate word segments are screened according to the similarity value, and finally, only one word segment is output by one candidate text sentence, so that the word segment error rate is reduced.
Wherein, the step S6 includes:
and outputting the sequenced final word segmentation with the space as an interval, selecting the first ten sequenced digits for key display, and hiding other final word segmentation results.
Specifically, in this embodiment, each word segmentation result obtained by calculation is ranked according to the occurrence frequency, the ranked word segmentation results are output at intervals of spaces, the first ten digits after ranking are selected for key display, the subsequent word segmentation results are hidden, when viewing is required, corresponding keys are clicked, the remaining word segmentation results are displayed, and all word segmentation results are output to a display device in the form of a bar graph and displayed to a user.
According to the embodiment of the invention, through selecting word segmentation data in a special word segmentation dictionary in the electric power field, the extracted candidate text terms are separated into a plurality of text sentences to be output, the text terms can be preprocessed, word segmentation interference caused by the marks and the spaces contained in the text terms is reduced, preprocessing efficiency of the text terms is increased, the problem of efficiency in processing the text terms is solved, the extracted characters are substituted for comparison, whether the characters are special word segmentation in the electric power field is judged until the last characters in the text sentences are extracted, word-by-word substitution and judgment can be carried out on the extracted text sentences, all the same characters are not substituted for comparison judgment, the workload of character comparison judgment is reduced, the character comparison judgment efficiency is higher, the candidate text terms after segmentation can be segmented, ambiguity is carried out on the word segmentation data obtained after segmentation until the word segmentation is not contained, the situation generated after text terms segmentation is reduced, the word segmentation ambiguity is avoided, the word segmentation ambiguity is caused by the fact that the user is generated when the word segmentation is still more old, the word segmentation ambiguity is increased, the word segmentation ambiguity is more clear, the word segmentation ambiguity is calculated, the word can be obtained, the word segmentation ambiguity is more clear, and the result can be obtained, and the visual and the result is more clear, and the word can be obtained by the visual and the word is more clear, and the result is obtained by the word is more when the word segmentation is calculated.
As shown in fig. 2, a second embodiment of the present invention provides an adaptive chinese word segmentation system for the power industry, including:
a text obtaining unit 1, configured to obtain candidate text terms, where the candidate text terms are phrases or paragraphs to be segmented;
a text segmentation unit 2, configured to perform segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;
the word segmentation unit 3 is used for segmenting each candidate text sentence to obtain one or more word segments;
the first word segmentation screening unit 4 is used for replacing the word segments in the candidate text terms one by one with words with the same meaning as the word segments and carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and reserving the word segments as the candidate word segments if the text terms before and after replacement are not ambiguous;
the second word screening unit 5 is used for acquiring one or more electric power field professional vocabularies similar to the candidate word semanteme, calculating the similarity between the candidate word and the one or more electric power field professional vocabularies, and determining a final word according to the similarity;
and the output unit 6 is used for sorting and outputting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms.
Wherein the text segmentation unit 2 comprises:
the first segmentation unit is used for separating punctuation and space in the candidate text terms to obtain a plurality of text parts, and removing the punctuation and space in the text parts to obtain a plurality of text sentences to be filtered;
the second segmentation unit is used for judging whether the characters in each text sentence to be filtered are professional segmentation words in the power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; the word segmentation is to segment characters and characters after the characters together to obtain candidate text sentences.
The word segmentation unit 3 is specifically configured to extract a vocabulary corresponding to a vocabulary in a dictionary database in a candidate text sentence to obtain a segmented word; wherein, the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;
the output unit 6 includes:
the similarity calculation unit is used for calculating the similarity value of each candidate word in the candidate text sentence and one or more power domain professional vocabularies when a plurality of candidate words are corresponding to the candidate text sentence, and accumulating to obtain the similarity value corresponding to the candidate word;
and the final word segmentation determining unit is used for selecting the candidate word segmentation with the highest similarity value as the final word segmentation of the candidate text sentence.
Wherein the output unit 6 includes:
and the display unit is used for outputting the sequenced final word segmentation by taking the space as an interval, selecting the first ten sequenced bits for key display, and hiding other final word segmentation results.
It should be noted that the system of the second embodiment corresponds to the method of the first embodiment, and is used for implementing the method of the first embodiment, so that other undescribed contents of the system of the second embodiment can be obtained by referring to the method of the first embodiment, and are not repeated herein.
It should also be appreciated that the method of embodiment one and the system of embodiment two may be implemented in numerous ways, including as a process, an apparatus, or a system. The methods described herein may be implemented in part by program instructions for instructing a processor to perform such methods, as well as such instructions recorded on a non-transitory computer-readable storage medium such as a hard disk drive, floppy disk, optical disk (such as a Compact Disc (CD) or Digital Versatile Disc (DVD)), flash memory, and the like. In some embodiments, the program instructions may be stored remotely and transmitted over a network via optical or electronic communication links.
An embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the adaptive chinese word segmentation method for electric power industry of embodiment one.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (7)

1. The self-adaptive Chinese word segmentation method for the power industry is characterized by comprising the following steps of:
step S1, acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;
s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences; separating punctuations and spaces in the candidate text terms to obtain a plurality of text parts, and removing the punctuations and the spaces in the text parts to obtain a plurality of text sentences to be filtered; judging whether characters in each text sentence to be filtered are professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, wherein the word segmentation is to segment the characters and the characters after the characters together to obtain a candidate text sentence; if not, extracting all the same characters in the text sentence and discarding the same characters;
s3, segmenting each candidate text sentence to obtain one or more segmented words; extracting vocabulary corresponding to vocabulary in a dictionary database in the candidate text sentence to obtain word segmentation; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;
step S4, replacing the word segmentation in the candidate text terms one by one with the word with the same meaning as the word segmentation word, carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and retaining the word segmentation as the candidate word segmentation if the text terms before and after replacement are not ambiguous;
s5, obtaining one or more power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more power field professional vocabularies, and determining a final word segmentation according to the similarity;
and S6, sorting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms, and outputting the sorted word segmentation.
2. The power industry-oriented adaptive chinese word segmentation method as in claim 1, wherein step S4 comprises:
when a candidate text sentence corresponds to a plurality of candidate word segments, calculating the similarity value of each candidate word segment in the candidate text sentence and one or more power domain professional vocabularies, and accumulating to obtain the similarity value corresponding to the candidate word segment;
and selecting the candidate word with the highest similarity value as the final word of the candidate text sentence.
3. The power industry-oriented adaptive chinese word segmentation method as in claim 2, wherein step S6 comprises:
and outputting the sequenced final word segmentation with the space as an interval, selecting the first ten sequenced digits for key display, and hiding other final word segmentation results.
4. An adaptive chinese word segmentation system for the power industry, comprising:
the text acquisition unit is used for acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;
the text segmentation unit is used for carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;
the word segmentation unit is used for segmenting each candidate text sentence to obtain one or more word segments; extracting vocabulary corresponding to vocabulary in a dictionary database in the candidate text sentence to obtain word segmentation; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;
the first word segmentation screening unit is used for replacing the word segments in the candidate text terms one by one with words with the same meaning as the word segments and carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and reserving the word segments as the candidate word segments if the text terms before and after replacement are not ambiguous;
the second word screening unit is used for acquiring one or more electric power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more electric power field professional vocabularies, and determining a final word segmentation according to the similarity; and
the output unit is used for sorting and outputting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms;
wherein the text segmentation unit comprises:
the first segmentation unit is used for separating punctuation and space in the candidate text terms to obtain a plurality of text parts, and removing the punctuation and space in the text parts to obtain a plurality of text sentences to be filtered; and
the second segmentation unit is used for judging whether characters in each text sentence to be filtered are professional segmentation words in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, wherein the segmentation into words is to segment the characters and the characters after the characters together to obtain candidate text sentences; if not, extracting all the same characters in the text sentence and discarding the same characters;
the word segmentation unit is specifically used for extracting words corresponding to word assembly in the dictionary database in the candidate text sentence to obtain segmented words; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field.
5. The power industry oriented adaptive Chinese word segmentation system of claim 4,
the output unit includes:
the similarity calculation unit is used for calculating the similarity value of each candidate word in the candidate text sentence and one or more power domain professional vocabularies when a plurality of candidate words are corresponding to the candidate text sentence, and accumulating to obtain the similarity value corresponding to the candidate word;
and the final word segmentation determining unit is used for selecting the candidate word segmentation with the highest similarity value as the final word segmentation of the candidate text sentence.
6. The power industry oriented adaptive chinese word segmentation system as recited in claim 5, wherein the output unit comprises:
and the display unit is used for outputting the sequenced final word segmentation by taking the space as an interval, selecting the first ten sequenced bits for key display, and hiding other final word segmentation results.
7. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the power industry oriented adaptive chinese word segmentation method of any one of claims 1-3.
CN201910638948.2A 2019-07-16 2019-07-16 Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof Active CN110413998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910638948.2A CN110413998B (en) 2019-07-16 2019-07-16 Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910638948.2A CN110413998B (en) 2019-07-16 2019-07-16 Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof

Publications (2)

Publication Number Publication Date
CN110413998A CN110413998A (en) 2019-11-05
CN110413998B true CN110413998B (en) 2023-04-21

Family

ID=68361553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910638948.2A Active CN110413998B (en) 2019-07-16 2019-07-16 Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof

Country Status (1)

Country Link
CN (1) CN110413998B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079428B (en) * 2019-12-27 2023-09-19 北京羽扇智信息科技有限公司 Word segmentation and industry dictionary construction method and device and readable storage medium
CN112257425A (en) * 2020-09-29 2021-01-22 国网天津市电力公司 Power data analysis method and system based on data classification model
CN112926320B (en) * 2021-03-24 2022-12-27 山东亿云信息技术有限公司 Text key content intelligent extraction method and system based on subject term optimization

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077275A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Method and device for performing word segmentation based on context
CN106844326A (en) * 2015-12-04 2017-06-13 北京国双科技有限公司 A kind of method and device for obtaining word
CN107608968A (en) * 2017-09-22 2018-01-19 深圳市易图资讯股份有限公司 Chinese word cutting method, the device of text-oriented big data
CN107918604A (en) * 2017-11-13 2018-04-17 彩讯科技股份有限公司 A kind of Chinese segmenting method and device
CN109828981A (en) * 2017-11-22 2019-05-31 阿里巴巴集团控股有限公司 A kind of data processing method and calculate equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077275A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Method and device for performing word segmentation based on context
CN106844326A (en) * 2015-12-04 2017-06-13 北京国双科技有限公司 A kind of method and device for obtaining word
CN107608968A (en) * 2017-09-22 2018-01-19 深圳市易图资讯股份有限公司 Chinese word cutting method, the device of text-oriented big data
CN107918604A (en) * 2017-11-13 2018-04-17 彩讯科技股份有限公司 A kind of Chinese segmenting method and device
CN109828981A (en) * 2017-11-22 2019-05-31 阿里巴巴集团控股有限公司 A kind of data processing method and calculate equipment

Also Published As

Publication number Publication date
CN110413998A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN109189942B (en) Construction method and device of patent data knowledge graph
CN108920467B (en) Method and device for learning word meaning of polysemous word and search result display method
CN104881458B (en) A kind of mask method and device of Web page subject
CN107463548B (en) Phrase mining method and device
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN111783518A (en) Training sample generation method and device, electronic equipment and readable storage medium
WO2008098956A1 (en) Method and apparatus for automatically discovering features in free form heterogeneous data
CN112364628B (en) New word recognition method and device, electronic equipment and storage medium
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
US11507746B2 (en) Method and apparatus for generating context information
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN110928981A (en) Method, system and storage medium for establishing and perfecting iteration of text label system
CN110910175A (en) Tourist ticket product portrait generation method
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN111325019A (en) Word bank updating method and device and electronic equipment
CN110704638A (en) Clustering algorithm-based electric power text dictionary construction method
CN114912425A (en) Presentation generation method and device
CN113806483A (en) Data processing method and device, electronic equipment and computer program product
CN107291952B (en) Method and device for extracting meaningful strings
CN114492425B (en) Method for communicating multi-dimensional data by adopting one set of field label system
CN112115362B (en) Programming information recommendation method and device based on similar code recognition
CN106933797B (en) Target information generation method and device
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
CN114298048A (en) Named entity identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant