CN110413997A

CN110413997A - New word discovery method, system and readable storage medium for power industry

Info

Publication number: CN110413997A
Application number: CN201910638878.0A
Authority: CN
Inventors: 张云翔; 饶竹一
Original assignee: Shenzhen Power Supply Co ltd
Current assignee: Shenzhen Power Supply Co ltd
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2019-11-05
Anticipated expiration: 2039-07-16
Also published as: CN110413997B

Abstract

The invention relates to a new word discovery method, a system and a storage medium thereof aiming at the power industry, wherein the method comprises the following steps: step S1, acquiring relevant electric power data and preprocessing the relevant electric power data to obtain text data; step S2, performing word segmentation processing according to the text data to obtain a plurality of words; step S3, processing the multiple participles by using a pre-trained deep learning neural network, retrieving connection words corresponding to the multiple participles in the deep learning neural network, if the participles have corresponding connection words, determining the participles to be power industry feature words, and outputting the clauses where the participles are located; step S4, obtaining a plurality of candidate words according to the plurality of participles and a plurality of connected words corresponding to the plurality of participles; step S5, determining whether the candidate words are new words according to the candidate words and the connected words corresponding to the candidate words. The method can avoid missing of new words and reduce time consumption of a calculation process.

Description

For the new word discovery method and its system of power industry, readable storage medium storing program for executing

Technical field

The present invention relates to power industry language processing techniques fields, and in particular to a kind of new word discovery for power industry Method and its system, computer readable storage medium.

Background technique

The industry that modern age electric power rises after occurring when power industry, electric power have indispensable work in people's lives With, with Social Culture and science and technology progress and each region power industry development, often driven the language of power industry Variation, what is best embodied is exactly the neologisms of power industry, and the acquisition for neologisms can effectively improve the integration of electric power trade information Understand with timely, accelerates the development of power industry.But current power industry new word discovery integrity degree is lower, is easy to appear something lost Leakage, causes part neologisms to be difficult to the problem of being found in time, while in new word discovery, by only to the participle of text data Comentropy calculated, there is biggish limitation, and calculating process is taken a long time, so needing a kind of for electric power The new word discovery method of industry solves the above problems.

Summary of the invention

It is an object of the invention to propose a kind of new word discovery method and its system for power industry, computer-readable Storage medium, the technical problem present in solve in a manner of current power industry new word discovery.

In order to achieve the object of the present invention, according to a first aspect of the present invention, the embodiment of the present invention provides a kind of for electric power row The new word discovery method of industry, includes the following steps:

Step S1, it obtains power-related data and the power-related data is pre-processed to obtain text data；

Step S2, word segmentation processing is carried out according to the text data and obtains multiple participles；

Step S3, the multiple participle is handled using preparatory trained deep learning neural network, retrieval with Connection words corresponding with the multiple participle in deep learning neural network, if there are corresponding connection words for participle, really The fixed participle is power industry Feature Words, and exports the subordinate sentence where the participle；

Step S4, multiple candidates are obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle Word；

Step S5, according to the multiple candidate word and multiple connection words determinations corresponding with the multiple candidate word Whether multiple candidate words are neologisms.

Preferably, wherein the power-related data includes the relevant personnel of power industry, things and knowledge data, number It include image, text or voice according to form；

The step S1 includes: to be filtered removal to obscene word, sensitive words and the stop words in text data；Wherein, if Power-related data is image data, then image data progress image procossing is extracted its text data；If power-related data For voice data, then its text data is extracted into voice data progress speech recognition.

Preferably, the step S1 includes:

Whether the text data after examining filtering removal obscene word, sensitive words and stop words is the relevant textual data of power industry According to；Wherein, retrieving whether there is power industry associative key in this article notebook data, and if it exists, then this article notebook data is electric power The relevant text data of industry, if it does not exist, then this article notebook data is not the relevant text data of power industry.

Preferably, the step S2 includes:

Subordinate sentence is carried out to text data by re.split () method, then text data is segmented by BERT, And to after participle sensitive word and stop words be removed.

Preferably, the step S4 includes:

For each subordinate sentence of deep learning neural network output, remaining point in addition to power industry Feature Words is calculated The comentropy of word；

Obtain connection words corresponding with power industry Feature Words in subordinate sentence；

Calculate the mutual information respectively segmented in subordinate sentence with the connection words；

The participle that mutual information meets first threshold range is saved as into candidate word.

Preferably, the step S5 includes:

For any candidate word, choose and the highest connection words of the candidate word mutual information；

It obtains all subordinate sentences containing the candidate word in the text data, and the connection words is substituted into and all contains this In the subordinate sentence of candidate word；

The cross entropy between subordinate sentence according to former subordinate sentence and after substituting into and the comparison result of second threshold range determine whether There are ambiguities, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for second threshold model It encloses, then the candidate word is not neologisms.

Preferably, the connection words in the deep learning neural network is set by the step S5 neologisms determined.

According to a second aspect of the present invention, the embodiment of the present invention provides a kind of new word discovery system for power industry, packet It includes:

Text data acquiring unit, for obtaining power-related data and pre-process to the power-related data To text data；

Participle unit obtains multiple participles for carrying out word segmentation processing according to the text data；

Neural network unit, for using preparatory trained deep learning neural network to the multiple participle at It manages, connection words corresponding with the multiple participle in retrieval and deep learning neural network, if there are corresponding connections for participle Words, it is determined that the participle is power industry Feature Words, and exports the subordinate sentence where the participle；

Candidate word determination unit, for according to the multiple participle and the corresponding multiple connection words of the multiple participle Obtain multiple candidate words；

Neologisms determination unit, for according to the multiple candidate word and multiple connective words corresponding with the multiple candidate word Word determines whether the multiple candidate word is neologisms.

Preferably, the candidate word determination unit includes:

First computing unit, each subordinate sentence for exporting for the deep learning neural network calculate and remove electric power row The comentropy of remaining participle outside industry Feature Words；

Fisrt feature word acquiring unit, for obtaining connection words corresponding with power industry Feature Words in subordinate sentence；

Second computing unit, for calculating the mutual information respectively segmented in subordinate sentence with the connection words；

Candidate word judging unit, the participle for mutual information to be met first threshold range save as candidate word；

The neologisms determination unit includes:

Second feature word acquiring unit, for choosing and the highest connection of candidate word mutual information for any candidate word Words；

Subordinate sentence acquiring unit, for obtaining all subordinate sentences containing the candidate word in the text data, and by the connection Words substitutes into all subordinate sentences containing the candidate word；

Neologisms judging unit, for the cross entropy and second threshold range between the subordinate sentence according to former subordinate sentence and after substituting into Comparison result determines whether there is ambiguity, if cross entropy meets second threshold range, which is neologisms, if cross entropy is not Meet second threshold range, then the candidate word is not neologisms.

According to a third aspect of the present invention, the embodiment of the present invention provides a kind of computer readable storage medium, is stored thereon with Computer program realizes the new word discovery method for being directed to power industry when the program is executed by processor.

In embodiments of the present invention, by the foundation of deep learning neural network, deep learning neural network passes through low layer Feature connects high-level characteristic, plays the function of similar catalogue, each participle is quickly connected with corresponding connection words, So that retrieval is more quick, while being calculated by the mutual information between Feature Words and each participle, connection words is substituted into former point Sentence in, by cross entropy carry out ambiguity judgement, candidate word whether with connect words have accurate relationship, reduce close industry Interference of the word to neologisms makes score so that new word discovery is more accurate, and uses newest BERT model in word segmentation processing Word is more accurate, and the participle for reducing ambiguity, implication and mistake occurs.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that being emerged from by implementing the present invention.The objectives and other advantages of the invention can by specification, Specifically noted structure is achieved and obtained in claims and attached drawing.Certainly, implement any of the products of the present invention or Method does not necessarily require achieving all the advantages described above at the same time.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of new word discovery method flow diagram for power industry in the embodiment of the present invention one.

Fig. 2 is a kind of new word discovery system schematic for power industry in the embodiment of the present invention two.

Specific embodiment

Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove It non-specifically points out, it is not necessary to attached drawing drawn to scale.

In addition, in order to better illustrate the present invention, numerous details is given in specific embodiment below.This Field is it will be appreciated by the skilled person that without certain details, the present invention equally be can be implemented.In some instances, for this Means known to the technical staff of field are not described in detail, in order to highlight purport of the invention.

As shown in Figure 1, the embodiment of the present invention provides a kind of new word discovery method for power industry, including walk as follows It is rapid:

Specifically, the training in advance of deep learning neural network includes the text data wound by removing stop word after importing Deep learning neural network is built, deep learning, which forms more abstract high level by combination low-level feature, indicates attribute classification or spy Sign, to find that the distributed nature of data indicates, wherein deep learning neural network be by the relationship between each Feature Words, it is right Several monolayer neural networks are linked, and wherein the structure of deep learning neural network is as follows, and unsupervised learning is for every first The pre-train of one layer network then only trains one layer with unsupervised learning every time, using its training result as its high one layer Input, finally with from top and under supervision algorithm remove to adjust all layers, thought be exactly in multiple layers of stacking, this layer The input as next layer is exported, and then reaches each feature and is connected with each other, is finally reached and quickly finds conjunctive word and conjunctive word Connection word purpose.

Wherein, the power-related data includes the relevant personnel of power industry, things and knowledge data, data mode packet Include image, text or voice；

Wherein, the step S1 includes:

Specifically, can be by trained in advance with the presence or absence of power industry associative key in retrieval text data Deep learning neural network carries out retrieval process, retrieves text for all power industry associative keys as search key in advance The mode training deep learning neural network of notebook data sample.

Wherein, when retrieving text data using deep learning neural network, the time range of retrieval is in deep learning nerve Whether the time point of the time point that network creation finishes to the secondary retrieval determines text data and by electric power critical word and search For the relevant text data of power industry, if so, text duplication is imported etc. to be processed, if not then being given up, and find Lower portion text data, which imports after first neologisms is added to deep learning neural network, by retrieval time model It encloses and is revised as a neologisms and is added to the time point of deep learning neural network to the time point of the secondary retrieval.

Wherein, the step S2 includes:

Specifically, character string can be split as using re.split () method by different fields in the present embodiment, and Multiple modes can be specified for separator, then text fragment is segmented by BERT, BERT pass through up and down into Row signature search determine, improve the accuracy of participle, after participle, to after participle sensitive word and stop words go It removes.

Wherein, the step S4 includes:

Specifically, calculating comentropy, the minimum participle of removal comentropy by c4.5 algorithm in the present embodiment.Wherein, The connection words of deep learning neural network first selects several monolayer neuronals associated with power industry Feature Words when obtaining Network is exported Feature Words in this several monolayer neural networks one by one；Wherein, the present embodiment passes through mRMR feature selecting Algorithm calculates the mutual information for respectively segmenting in subordinate sentence and connecting words.Wherein, first threshold range be preferably but not limited to for greater than etc. In 50%, i.e. removal mutual information is lower than 50% participle.

Wherein, the step S5 includes:

Specifically, cross entropy is calculated by Tensorflow classification function in the present embodiment.

Wherein, the connection words in the deep learning neural network is set by the step S5 neologisms determined.

Specifically, determining neologisms paraphrase by the original text notebook data where neologisms, and bring neologisms into deep learning Neural network carries out perfect, and a text under retrieving after improving to deep learning neural network as new connection words Data continue to find neologisms.

As shown in Fig. 2, second embodiment of the present invention provides a kind of new word discovery systems for power industry, comprising:

Text data acquiring unit 1, for obtaining power-related data and being pre-processed to the power-related data Obtain text data；

Participle unit 2 obtains multiple participles for carrying out word segmentation processing according to the text data；

Neural network unit 3, for being carried out using preparatory trained deep learning neural network to the multiple participle It handles, connection words corresponding with the multiple participle in retrieval and deep learning neural network, if there are corresponding companies for participle Connect words, it is determined that the participle is power industry Feature Words, and exports the subordinate sentence where the participle；

Candidate word determination unit 4, for according to the multiple participle and the corresponding multiple connective words of the multiple participle Word obtains multiple candidate words；

Neologisms determination unit 5, for according to the multiple candidate word and multiple connections corresponding with the multiple candidate word Words determines whether the multiple candidate word is neologisms.

Wherein, the candidate word determination unit includes:

The neologisms determination unit includes:

It should be noted that system described in the present embodiment two be it is corresponding with one the method for embodiment, be used to implement One the method for example, therefore, other contents not described of system described in related embodiment two can be refering to described in embodiment one Method content obtains, and details are not described herein again.

It should also be understood that system described in one the method for embodiment and embodiment two can be implemented in many ways, including As process, device or system.Method described herein partly can execute this method by being used to indicate processor Program instruction and the instruction being recorded in non-transient computer readable storage medium and implement, non-transient computer is readable Storage medium hard drive, floppy disk, optical disc (small-sized dish (CD) or digital universal dish (DVD)), flash memory etc.. In some embodiments, program instruction can be stored remotely and be sent out on network via optics or electronic communication link It send.

The embodiment of the present invention three provides a kind of computer readable storage medium, is stored thereon with computer program, the program The new word discovery method that power industry is directed to described in embodiment one is realized when being executed by processor.

Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In principle, the practical application or to the technological improvement in market for best explaining each embodiment, or make the art its Its those of ordinary skill can understand each embodiment disclosed herein.

Claims

1. a kind of new word discovery method for power industry, which comprises the steps of:

Step S3, the multiple participle is handled using preparatory trained deep learning neural network, retrieval and depth Connection words corresponding with the multiple participle in learning neural network, if there are corresponding connection words for participle, it is determined that should Participle is power industry Feature Words, and exports the subordinate sentence where the participle；

Step S4, multiple candidate words are obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle；

Step S5, it is determined according to the multiple candidate word and multiple connection words corresponding with the multiple candidate word the multiple Whether candidate word is neologisms.

2. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that wherein, the electric power phase Closing data includes the relevant personnel of power industry, things and knowledge data, and data mode includes image, text or voice；

The step S1 includes: to be filtered removal to obscene word, sensitive words and the stop words in text data；Wherein, if electric power Related data is image data, then image data progress image procossing is extracted its text data；If power-related data is language Voice data progress speech recognition is then extracted its text data by sound data.

3. being directed to the new word discovery method of power industry as claimed in claim 2, which is characterized in that the step S1 includes:

Whether the text data after examining filtering removal obscene word, sensitive words and stop words is the relevant text data of power industry； Wherein, retrieving whether there is power industry associative key in this article notebook data, and if it exists, then this article notebook data is power industry Relevant text data, if it does not exist, then this article notebook data is not the relevant text data of power industry.

4. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that the step S2 includes:

Subordinate sentence is carried out to text data by re.split () method, then text data is segmented by BERT, and right Sensitive word and stop words after participle are removed.

5. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that the step S4 includes:

For each subordinate sentence of deep learning neural network output, remaining participle in addition to power industry Feature Words is calculated Comentropy；

6. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that the step S5 includes:

It obtains all subordinate sentences containing the candidate word in the text data, and the connection words is substituted into and all contains the candidate In the subordinate sentence of word；

The cross entropy between subordinate sentence according to former subordinate sentence and after substituting into and the comparison result of second threshold range determine whether there is Ambiguity, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for second threshold range, The candidate word is not neologisms.

7. being directed to the new word discovery method of power industry as claimed in claim 6, which is characterized in that determine step S5 new Word is set as the connection words in the deep learning neural network.

8. a kind of new word discovery system for power industry characterized by comprising

Text data acquiring unit, for obtaining power-related data and being pre-processed to obtain text to the power-related data Notebook data；

Neural network unit, for being handled using preparatory trained deep learning neural network the multiple participle, Connection words corresponding with the multiple participle in retrieval and deep learning neural network, if there are corresponding connective words for participle Word, it is determined that the participle is power industry Feature Words, and exports the subordinate sentence where the participle；

Candidate word determination unit, for being obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle Multiple candidate words；

Neologisms determination unit, for true according to the multiple candidate word and multiple connection words corresponding with the multiple candidate word Whether fixed the multiple candidate word is neologisms.

9. being directed to the new word discovery method of power industry as claimed in claim 8, which is characterized in that the candidate word determines single Member includes:

First computing unit, each subordinate sentence for exporting for the deep learning neural network are calculated except power industry is special Levy the comentropy of remaining participle outside word；

The neologisms determination unit includes:

Second feature word acquiring unit, for choosing and the highest connection words of the candidate word mutual information for any candidate word；

Subordinate sentence acquiring unit, for obtaining all subordinate sentences containing the candidate word in the text data, and by the connection words It substitutes into all subordinate sentences containing the candidate word；

Neologisms judging unit, for the cross entropy between the subordinate sentence according to former subordinate sentence and after substituting into compared with second threshold range As a result ambiguity is determined whether there is, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for Second threshold range, then the candidate word is not neologisms.

10. a kind of computer readable storage medium, is stored thereon with computer program, power is realized when which is executed by processor Benefit require any one of 1~7 described in be directed to power industry new word discovery method.