CN110413997A - New word discovery method, system and readable storage medium for power industry - Google Patents
New word discovery method, system and readable storage medium for power industry Download PDFInfo
- Publication number
- CN110413997A CN110413997A CN201910638878.0A CN201910638878A CN110413997A CN 110413997 A CN110413997 A CN 110413997A CN 201910638878 A CN201910638878 A CN 201910638878A CN 110413997 A CN110413997 A CN 110413997A
- Authority
- CN
- China
- Prior art keywords
- words
- word
- participle
- power industry
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000013528 artificial neural network Methods 0.000 claims abstract description 42
- 238000013135 deep learning Methods 0.000 claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 10
- 230000011218 segmentation Effects 0.000 claims abstract description 8
- 206010028916 Neologism Diseases 0.000 claims description 39
- 238000004590 computer program Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 5
- 238000004364 calculation method Methods 0.000 abstract 1
- 238000007781 pre-processing Methods 0.000 abstract 1
- 239000010410 layer Substances 0.000 description 7
- 239000002356 single layer Substances 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000001052 transient effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Economics (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a new word discovery method, a system and a storage medium thereof aiming at the power industry, wherein the method comprises the following steps: step S1, acquiring relevant electric power data and preprocessing the relevant electric power data to obtain text data; step S2, performing word segmentation processing according to the text data to obtain a plurality of words; step S3, processing the multiple participles by using a pre-trained deep learning neural network, retrieving connection words corresponding to the multiple participles in the deep learning neural network, if the participles have corresponding connection words, determining the participles to be power industry feature words, and outputting the clauses where the participles are located; step S4, obtaining a plurality of candidate words according to the plurality of participles and a plurality of connected words corresponding to the plurality of participles; step S5, determining whether the candidate words are new words according to the candidate words and the connected words corresponding to the candidate words. The method can avoid missing of new words and reduce time consumption of a calculation process.
Description
Technical field
The present invention relates to power industry language processing techniques fields, and in particular to a kind of new word discovery for power industry
Method and its system, computer readable storage medium.
Background technique
The industry that modern age electric power rises after occurring when power industry, electric power have indispensable work in people's lives
With, with Social Culture and science and technology progress and each region power industry development, often driven the language of power industry
Variation, what is best embodied is exactly the neologisms of power industry, and the acquisition for neologisms can effectively improve the integration of electric power trade information
Understand with timely, accelerates the development of power industry.But current power industry new word discovery integrity degree is lower, is easy to appear something lost
Leakage, causes part neologisms to be difficult to the problem of being found in time, while in new word discovery, by only to the participle of text data
Comentropy calculated, there is biggish limitation, and calculating process is taken a long time, so needing a kind of for electric power
The new word discovery method of industry solves the above problems.
Summary of the invention
It is an object of the invention to propose a kind of new word discovery method and its system for power industry, computer-readable
Storage medium, the technical problem present in solve in a manner of current power industry new word discovery.
In order to achieve the object of the present invention, according to a first aspect of the present invention, the embodiment of the present invention provides a kind of for electric power row
The new word discovery method of industry, includes the following steps:
Step S1, it obtains power-related data and the power-related data is pre-processed to obtain text data;
Step S2, word segmentation processing is carried out according to the text data and obtains multiple participles;
Step S3, the multiple participle is handled using preparatory trained deep learning neural network, retrieval with
Connection words corresponding with the multiple participle in deep learning neural network, if there are corresponding connection words for participle, really
The fixed participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Step S4, multiple candidates are obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle
Word;
Step S5, according to the multiple candidate word and multiple connection words determinations corresponding with the multiple candidate word
Whether multiple candidate words are neologisms.
Preferably, wherein the power-related data includes the relevant personnel of power industry, things and knowledge data, number
It include image, text or voice according to form;
The step S1 includes: to be filtered removal to obscene word, sensitive words and the stop words in text data;Wherein, if
Power-related data is image data, then image data progress image procossing is extracted its text data;If power-related data
For voice data, then its text data is extracted into voice data progress speech recognition.
Preferably, the step S1 includes:
Whether the text data after examining filtering removal obscene word, sensitive words and stop words is the relevant textual data of power industry
According to;Wherein, retrieving whether there is power industry associative key in this article notebook data, and if it exists, then this article notebook data is electric power
The relevant text data of industry, if it does not exist, then this article notebook data is not the relevant text data of power industry.
Preferably, the step S2 includes:
Subordinate sentence is carried out to text data by re.split () method, then text data is segmented by BERT,
And to after participle sensitive word and stop words be removed.
Preferably, the step S4 includes:
For each subordinate sentence of deep learning neural network output, remaining point in addition to power industry Feature Words is calculated
The comentropy of word;
Obtain connection words corresponding with power industry Feature Words in subordinate sentence;
Calculate the mutual information respectively segmented in subordinate sentence with the connection words;
The participle that mutual information meets first threshold range is saved as into candidate word.
Preferably, the step S5 includes:
For any candidate word, choose and the highest connection words of the candidate word mutual information;
It obtains all subordinate sentences containing the candidate word in the text data, and the connection words is substituted into and all contains this
In the subordinate sentence of candidate word;
The cross entropy between subordinate sentence according to former subordinate sentence and after substituting into and the comparison result of second threshold range determine whether
There are ambiguities, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for second threshold model
It encloses, then the candidate word is not neologisms.
Preferably, the connection words in the deep learning neural network is set by the step S5 neologisms determined.
According to a second aspect of the present invention, the embodiment of the present invention provides a kind of new word discovery system for power industry, packet
It includes:
Text data acquiring unit, for obtaining power-related data and pre-process to the power-related data
To text data;
Participle unit obtains multiple participles for carrying out word segmentation processing according to the text data;
Neural network unit, for using preparatory trained deep learning neural network to the multiple participle at
It manages, connection words corresponding with the multiple participle in retrieval and deep learning neural network, if there are corresponding connections for participle
Words, it is determined that the participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Candidate word determination unit, for according to the multiple participle and the corresponding multiple connection words of the multiple participle
Obtain multiple candidate words;
Neologisms determination unit, for according to the multiple candidate word and multiple connective words corresponding with the multiple candidate word
Word determines whether the multiple candidate word is neologisms.
Preferably, the candidate word determination unit includes:
First computing unit, each subordinate sentence for exporting for the deep learning neural network calculate and remove electric power row
The comentropy of remaining participle outside industry Feature Words;
Fisrt feature word acquiring unit, for obtaining connection words corresponding with power industry Feature Words in subordinate sentence;
Second computing unit, for calculating the mutual information respectively segmented in subordinate sentence with the connection words;
Candidate word judging unit, the participle for mutual information to be met first threshold range save as candidate word;
The neologisms determination unit includes:
Second feature word acquiring unit, for choosing and the highest connection of candidate word mutual information for any candidate word
Words;
Subordinate sentence acquiring unit, for obtaining all subordinate sentences containing the candidate word in the text data, and by the connection
Words substitutes into all subordinate sentences containing the candidate word;
Neologisms judging unit, for the cross entropy and second threshold range between the subordinate sentence according to former subordinate sentence and after substituting into
Comparison result determines whether there is ambiguity, if cross entropy meets second threshold range, which is neologisms, if cross entropy is not
Meet second threshold range, then the candidate word is not neologisms.
According to a third aspect of the present invention, the embodiment of the present invention provides a kind of computer readable storage medium, is stored thereon with
Computer program realizes the new word discovery method for being directed to power industry when the program is executed by processor.
In embodiments of the present invention, by the foundation of deep learning neural network, deep learning neural network passes through low layer
Feature connects high-level characteristic, plays the function of similar catalogue, each participle is quickly connected with corresponding connection words,
So that retrieval is more quick, while being calculated by the mutual information between Feature Words and each participle, connection words is substituted into former point
Sentence in, by cross entropy carry out ambiguity judgement, candidate word whether with connect words have accurate relationship, reduce close industry
Interference of the word to neologisms makes score so that new word discovery is more accurate, and uses newest BERT model in word segmentation processing
Word is more accurate, and the participle for reducing ambiguity, implication and mistake occurs.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification
It obtains it is clear that being emerged from by implementing the present invention.The objectives and other advantages of the invention can by specification,
Specifically noted structure is achieved and obtained in claims and attached drawing.Certainly, implement any of the products of the present invention or
Method does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of new word discovery method flow diagram for power industry in the embodiment of the present invention one.
Fig. 2 is a kind of new word discovery system schematic for power industry in the embodiment of the present invention two.
Specific embodiment
Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing
Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove
It non-specifically points out, it is not necessary to attached drawing drawn to scale.
In addition, in order to better illustrate the present invention, numerous details is given in specific embodiment below.This
Field is it will be appreciated by the skilled person that without certain details, the present invention equally be can be implemented.In some instances, for this
Means known to the technical staff of field are not described in detail, in order to highlight purport of the invention.
As shown in Figure 1, the embodiment of the present invention provides a kind of new word discovery method for power industry, including walk as follows
It is rapid:
Step S1, it obtains power-related data and the power-related data is pre-processed to obtain text data;
Step S2, word segmentation processing is carried out according to the text data and obtains multiple participles;
Step S3, the multiple participle is handled using preparatory trained deep learning neural network, retrieval with
Connection words corresponding with the multiple participle in deep learning neural network, if there are corresponding connection words for participle, really
The fixed participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Step S4, multiple candidates are obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle
Word;
Step S5, according to the multiple candidate word and multiple connection words determinations corresponding with the multiple candidate word
Whether multiple candidate words are neologisms.
Specifically, the training in advance of deep learning neural network includes the text data wound by removing stop word after importing
Deep learning neural network is built, deep learning, which forms more abstract high level by combination low-level feature, indicates attribute classification or spy
Sign, to find that the distributed nature of data indicates, wherein deep learning neural network be by the relationship between each Feature Words, it is right
Several monolayer neural networks are linked, and wherein the structure of deep learning neural network is as follows, and unsupervised learning is for every first
The pre-train of one layer network then only trains one layer with unsupervised learning every time, using its training result as its high one layer
Input, finally with from top and under supervision algorithm remove to adjust all layers, thought be exactly in multiple layers of stacking, this layer
The input as next layer is exported, and then reaches each feature and is connected with each other, is finally reached and quickly finds conjunctive word and conjunctive word
Connection word purpose.
Wherein, the power-related data includes the relevant personnel of power industry, things and knowledge data, data mode packet
Include image, text or voice;
The step S1 includes: to be filtered removal to obscene word, sensitive words and the stop words in text data;Wherein, if
Power-related data is image data, then image data progress image procossing is extracted its text data;If power-related data
For voice data, then its text data is extracted into voice data progress speech recognition.
Wherein, the step S1 includes:
Whether the text data after examining filtering removal obscene word, sensitive words and stop words is the relevant textual data of power industry
According to;Wherein, retrieving whether there is power industry associative key in this article notebook data, and if it exists, then this article notebook data is electric power
The relevant text data of industry, if it does not exist, then this article notebook data is not the relevant text data of power industry.
Specifically, can be by trained in advance with the presence or absence of power industry associative key in retrieval text data
Deep learning neural network carries out retrieval process, retrieves text for all power industry associative keys as search key in advance
The mode training deep learning neural network of notebook data sample.
Wherein, when retrieving text data using deep learning neural network, the time range of retrieval is in deep learning nerve
Whether the time point of the time point that network creation finishes to the secondary retrieval determines text data and by electric power critical word and search
For the relevant text data of power industry, if so, text duplication is imported etc. to be processed, if not then being given up, and find
Lower portion text data, which imports after first neologisms is added to deep learning neural network, by retrieval time model
It encloses and is revised as a neologisms and is added to the time point of deep learning neural network to the time point of the secondary retrieval.
Wherein, the step S2 includes:
Subordinate sentence is carried out to text data by re.split () method, then text data is segmented by BERT,
And to after participle sensitive word and stop words be removed.
Specifically, character string can be split as using re.split () method by different fields in the present embodiment, and
Multiple modes can be specified for separator, then text fragment is segmented by BERT, BERT pass through up and down into
Row signature search determine, improve the accuracy of participle, after participle, to after participle sensitive word and stop words go
It removes.
Wherein, the step S4 includes:
For each subordinate sentence of deep learning neural network output, remaining point in addition to power industry Feature Words is calculated
The comentropy of word;
Obtain connection words corresponding with power industry Feature Words in subordinate sentence;
Calculate the mutual information respectively segmented in subordinate sentence with the connection words;
The participle that mutual information meets first threshold range is saved as into candidate word.
Specifically, calculating comentropy, the minimum participle of removal comentropy by c4.5 algorithm in the present embodiment.Wherein,
The connection words of deep learning neural network first selects several monolayer neuronals associated with power industry Feature Words when obtaining
Network is exported Feature Words in this several monolayer neural networks one by one;Wherein, the present embodiment passes through mRMR feature selecting
Algorithm calculates the mutual information for respectively segmenting in subordinate sentence and connecting words.Wherein, first threshold range be preferably but not limited to for greater than etc.
In 50%, i.e. removal mutual information is lower than 50% participle.
Wherein, the step S5 includes:
For any candidate word, choose and the highest connection words of the candidate word mutual information;
It obtains all subordinate sentences containing the candidate word in the text data, and the connection words is substituted into and all contains this
In the subordinate sentence of candidate word;
The cross entropy between subordinate sentence according to former subordinate sentence and after substituting into and the comparison result of second threshold range determine whether
There are ambiguities, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for second threshold model
It encloses, then the candidate word is not neologisms.
Specifically, cross entropy is calculated by Tensorflow classification function in the present embodiment.
Wherein, the connection words in the deep learning neural network is set by the step S5 neologisms determined.
Specifically, determining neologisms paraphrase by the original text notebook data where neologisms, and bring neologisms into deep learning
Neural network carries out perfect, and a text under retrieving after improving to deep learning neural network as new connection words
Data continue to find neologisms.
As shown in Fig. 2, second embodiment of the present invention provides a kind of new word discovery systems for power industry, comprising:
Text data acquiring unit 1, for obtaining power-related data and being pre-processed to the power-related data
Obtain text data;
Participle unit 2 obtains multiple participles for carrying out word segmentation processing according to the text data;
Neural network unit 3, for being carried out using preparatory trained deep learning neural network to the multiple participle
It handles, connection words corresponding with the multiple participle in retrieval and deep learning neural network, if there are corresponding companies for participle
Connect words, it is determined that the participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Candidate word determination unit 4, for according to the multiple participle and the corresponding multiple connective words of the multiple participle
Word obtains multiple candidate words;
Neologisms determination unit 5, for according to the multiple candidate word and multiple connections corresponding with the multiple candidate word
Words determines whether the multiple candidate word is neologisms.
Wherein, the candidate word determination unit includes:
First computing unit, each subordinate sentence for exporting for the deep learning neural network calculate and remove electric power row
The comentropy of remaining participle outside industry Feature Words;
Fisrt feature word acquiring unit, for obtaining connection words corresponding with power industry Feature Words in subordinate sentence;
Second computing unit, for calculating the mutual information respectively segmented in subordinate sentence with the connection words;
Candidate word judging unit, the participle for mutual information to be met first threshold range save as candidate word;
The neologisms determination unit includes:
Second feature word acquiring unit, for choosing and the highest connection of candidate word mutual information for any candidate word
Words;
Subordinate sentence acquiring unit, for obtaining all subordinate sentences containing the candidate word in the text data, and by the connection
Words substitutes into all subordinate sentences containing the candidate word;
Neologisms judging unit, for the cross entropy and second threshold range between the subordinate sentence according to former subordinate sentence and after substituting into
Comparison result determines whether there is ambiguity, if cross entropy meets second threshold range, which is neologisms, if cross entropy is not
Meet second threshold range, then the candidate word is not neologisms.
It should be noted that system described in the present embodiment two be it is corresponding with one the method for embodiment, be used to implement
One the method for example, therefore, other contents not described of system described in related embodiment two can be refering to described in embodiment one
Method content obtains, and details are not described herein again.
It should also be understood that system described in one the method for embodiment and embodiment two can be implemented in many ways, including
As process, device or system.Method described herein partly can execute this method by being used to indicate processor
Program instruction and the instruction being recorded in non-transient computer readable storage medium and implement, non-transient computer is readable
Storage medium hard drive, floppy disk, optical disc (small-sized dish (CD) or digital universal dish (DVD)), flash memory etc..
In some embodiments, program instruction can be stored remotely and be sent out on network via optics or electronic communication link
It send.
The embodiment of the present invention three provides a kind of computer readable storage medium, is stored thereon with computer program, the program
The new word discovery method that power industry is directed to described in embodiment one is realized when being executed by processor.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport
In principle, the practical application or to the technological improvement in market for best explaining each embodiment, or make the art its
Its those of ordinary skill can understand each embodiment disclosed herein.
Claims (10)
1. a kind of new word discovery method for power industry, which comprises the steps of:
Step S1, it obtains power-related data and the power-related data is pre-processed to obtain text data;
Step S2, word segmentation processing is carried out according to the text data and obtains multiple participles;
Step S3, the multiple participle is handled using preparatory trained deep learning neural network, retrieval and depth
Connection words corresponding with the multiple participle in learning neural network, if there are corresponding connection words for participle, it is determined that should
Participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Step S4, multiple candidate words are obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle;
Step S5, it is determined according to the multiple candidate word and multiple connection words corresponding with the multiple candidate word the multiple
Whether candidate word is neologisms.
2. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that wherein, the electric power phase
Closing data includes the relevant personnel of power industry, things and knowledge data, and data mode includes image, text or voice;
The step S1 includes: to be filtered removal to obscene word, sensitive words and the stop words in text data;Wherein, if electric power
Related data is image data, then image data progress image procossing is extracted its text data;If power-related data is language
Voice data progress speech recognition is then extracted its text data by sound data.
3. being directed to the new word discovery method of power industry as claimed in claim 2, which is characterized in that the step S1 includes:
Whether the text data after examining filtering removal obscene word, sensitive words and stop words is the relevant text data of power industry;
Wherein, retrieving whether there is power industry associative key in this article notebook data, and if it exists, then this article notebook data is power industry
Relevant text data, if it does not exist, then this article notebook data is not the relevant text data of power industry.
4. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that the step S2 includes:
Subordinate sentence is carried out to text data by re.split () method, then text data is segmented by BERT, and right
Sensitive word and stop words after participle are removed.
5. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that the step S4 includes:
For each subordinate sentence of deep learning neural network output, remaining participle in addition to power industry Feature Words is calculated
Comentropy;
Obtain connection words corresponding with power industry Feature Words in subordinate sentence;
Calculate the mutual information respectively segmented in subordinate sentence with the connection words;
The participle that mutual information meets first threshold range is saved as into candidate word.
6. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that the step S5 includes:
For any candidate word, choose and the highest connection words of the candidate word mutual information;
It obtains all subordinate sentences containing the candidate word in the text data, and the connection words is substituted into and all contains the candidate
In the subordinate sentence of word;
The cross entropy between subordinate sentence according to former subordinate sentence and after substituting into and the comparison result of second threshold range determine whether there is
Ambiguity, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for second threshold range,
The candidate word is not neologisms.
7. being directed to the new word discovery method of power industry as claimed in claim 6, which is characterized in that determine step S5 new
Word is set as the connection words in the deep learning neural network.
8. a kind of new word discovery system for power industry characterized by comprising
Text data acquiring unit, for obtaining power-related data and being pre-processed to obtain text to the power-related data
Notebook data;
Participle unit obtains multiple participles for carrying out word segmentation processing according to the text data;
Neural network unit, for being handled using preparatory trained deep learning neural network the multiple participle,
Connection words corresponding with the multiple participle in retrieval and deep learning neural network, if there are corresponding connective words for participle
Word, it is determined that the participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Candidate word determination unit, for being obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle
Multiple candidate words;
Neologisms determination unit, for true according to the multiple candidate word and multiple connection words corresponding with the multiple candidate word
Whether fixed the multiple candidate word is neologisms.
9. being directed to the new word discovery method of power industry as claimed in claim 8, which is characterized in that the candidate word determines single
Member includes:
First computing unit, each subordinate sentence for exporting for the deep learning neural network are calculated except power industry is special
Levy the comentropy of remaining participle outside word;
Fisrt feature word acquiring unit, for obtaining connection words corresponding with power industry Feature Words in subordinate sentence;
Second computing unit, for calculating the mutual information respectively segmented in subordinate sentence with the connection words;
Candidate word judging unit, the participle for mutual information to be met first threshold range save as candidate word;
The neologisms determination unit includes:
Second feature word acquiring unit, for choosing and the highest connection words of the candidate word mutual information for any candidate word;
Subordinate sentence acquiring unit, for obtaining all subordinate sentences containing the candidate word in the text data, and by the connection words
It substitutes into all subordinate sentences containing the candidate word;
Neologisms judging unit, for the cross entropy between the subordinate sentence according to former subordinate sentence and after substituting into compared with second threshold range
As a result ambiguity is determined whether there is, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for
Second threshold range, then the candidate word is not neologisms.
10. a kind of computer readable storage medium, is stored thereon with computer program, power is realized when which is executed by processor
Benefit require any one of 1~7 described in be directed to power industry new word discovery method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910638878.0A CN110413997B (en) | 2019-07-16 | 2019-07-16 | New word discovery method, system and readable storage medium for power industry |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910638878.0A CN110413997B (en) | 2019-07-16 | 2019-07-16 | New word discovery method, system and readable storage medium for power industry |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110413997A true CN110413997A (en) | 2019-11-05 |
CN110413997B CN110413997B (en) | 2023-04-07 |
Family
ID=68361554
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910638878.0A Active CN110413997B (en) | 2019-07-16 | 2019-07-16 | New word discovery method, system and readable storage medium for power industry |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110413997B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112966501A (en) * | 2021-02-22 | 2021-06-15 | 广州寄锦教育科技有限公司 | New word discovery method, system, terminal and medium |
CN114974228A (en) * | 2022-05-24 | 2022-08-30 | 名日之梦(北京)科技有限公司 | Rapid voice recognition method based on hierarchical recognition |
CN117951246A (en) * | 2024-03-26 | 2024-04-30 | 中国电子科技集团公司第三十研究所 | New word discovery and application field prediction method and system for network technology |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107577763A (en) * | 2017-09-04 | 2018-01-12 | 北京京东尚科信息技术有限公司 | Search method and device |
CN107683469A (en) * | 2015-12-30 | 2018-02-09 | 中国科学院深圳先进技术研究院 | A kind of product classification method and device based on deep learning |
CN108595433A (en) * | 2018-05-02 | 2018-09-28 | 北京中电普华信息技术有限公司 | A kind of new word discovery method and device |
CN108763213A (en) * | 2018-05-25 | 2018-11-06 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Theme feature text key word extracting method |
CN109710947A (en) * | 2019-01-22 | 2019-05-03 | 福建亿榕信息技术有限公司 | Power specialty word stock generating method and device |
-
2019
- 2019-07-16 CN CN201910638878.0A patent/CN110413997B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107683469A (en) * | 2015-12-30 | 2018-02-09 | 中国科学院深圳先进技术研究院 | A kind of product classification method and device based on deep learning |
CN107577763A (en) * | 2017-09-04 | 2018-01-12 | 北京京东尚科信息技术有限公司 | Search method and device |
CN108595433A (en) * | 2018-05-02 | 2018-09-28 | 北京中电普华信息技术有限公司 | A kind of new word discovery method and device |
CN108763213A (en) * | 2018-05-25 | 2018-11-06 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Theme feature text key word extracting method |
CN109710947A (en) * | 2019-01-22 | 2019-05-03 | 福建亿榕信息技术有限公司 | Power specialty word stock generating method and device |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112966501A (en) * | 2021-02-22 | 2021-06-15 | 广州寄锦教育科技有限公司 | New word discovery method, system, terminal and medium |
CN112966501B (en) * | 2021-02-22 | 2023-04-11 | 广州寄锦教育科技有限公司 | New word discovery method, system, terminal and medium |
CN114974228A (en) * | 2022-05-24 | 2022-08-30 | 名日之梦(北京)科技有限公司 | Rapid voice recognition method based on hierarchical recognition |
CN117951246A (en) * | 2024-03-26 | 2024-04-30 | 中国电子科技集团公司第三十研究所 | New word discovery and application field prediction method and system for network technology |
CN117951246B (en) * | 2024-03-26 | 2024-05-28 | 中国电子科技集团公司第三十研究所 | New word discovery and application field prediction method and system for network technology |
Also Published As
Publication number | Publication date |
---|---|
CN110413997B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240028837A1 (en) | Device and method for machine reading comprehension question and answer | |
CN110096570B (en) | Intention identification method and device applied to intelligent customer service robot | |
JP2019125343A (en) | Text processing method and apparatus based on ambiguous entity words | |
JP6335898B2 (en) | Information classification based on product recognition | |
CN106777013A (en) | Dialogue management method and apparatus | |
CN112735383A (en) | Voice signal processing method, device, equipment and storage medium | |
CN106446109A (en) | Acquiring method and device for audio file abstract | |
CN110795532A (en) | Voice information processing method and device, intelligent terminal and storage medium | |
US9852125B2 (en) | Dynamic portmanteau word semantic identification | |
CN110413997A (en) | New word discovery method, system and readable storage medium for power industry | |
CN110413998B (en) | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof | |
CN113033183B (en) | Network new word discovery method and system based on statistics and similarity | |
CN112633011B (en) | Research front edge identification method and device for fusing word semantics and word co-occurrence information | |
CN115203421A (en) | Method, device and equipment for generating label of long text and storage medium | |
CN112784009A (en) | Subject term mining method and device, electronic equipment and storage medium | |
CN111209373A (en) | Sensitive text recognition method and device based on natural semantics | |
CN112883182A (en) | Question-answer matching method and device based on machine reading | |
US20190095525A1 (en) | Extraction of expression for natural language processing | |
CN114265931A (en) | Big data text mining-based consumer policy perception analysis method and system | |
CN107622047B (en) | Design decision knowledge extraction and expression method | |
EP3822816A1 (en) | Device and method for machine reading comprehension question and answer | |
CN113553410B (en) | Long document processing method, processing device, electronic equipment and storage medium | |
CN113590774B (en) | Event query method, device and storage medium | |
CN109783797A (en) | Abstracting method, device, equipment and the storage medium of semantic relation | |
CN116737520B (en) | Data braiding method, device and equipment for log data and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |