CN110413997A - For the new word discovery method and its system of power industry, readable storage medium storing program for executing - Google Patents

For the new word discovery method and its system of power industry, readable storage medium storing program for executing Download PDF

Info

Publication number
CN110413997A
CN110413997A CN201910638878.0A CN201910638878A CN110413997A CN 110413997 A CN110413997 A CN 110413997A CN 201910638878 A CN201910638878 A CN 201910638878A CN 110413997 A CN110413997 A CN 110413997A
Authority
CN
China
Prior art keywords
words
word
participle
power industry
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910638878.0A
Other languages
Chinese (zh)
Other versions
CN110413997B (en
Inventor
张云翔
饶竹一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Power Supply Bureau Co Ltd
Original Assignee
Shenzhen Power Supply Bureau Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Power Supply Bureau Co Ltd filed Critical Shenzhen Power Supply Bureau Co Ltd
Priority to CN201910638878.0A priority Critical patent/CN110413997B/en
Publication of CN110413997A publication Critical patent/CN110413997A/en
Application granted granted Critical
Publication of CN110413997B publication Critical patent/CN110413997B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of for the new word discovery method and its system of power industry, storage medium, which comprises step S1, obtains power-related data and is pre-processed to obtain text data to power-related data;Step S2, word segmentation processing is carried out according to text data and obtains multiple participles;Step S3, multiple participles are handled using preparatory trained deep learning neural network, connection words corresponding with multiple participles in retrieval and deep learning neural network, if there are corresponding connection words for participle, it then determines that the participle is power industry Feature Words, and exports the subordinate sentence where the participle;Step S4, multiple candidate words are obtained according to multiple participles and the corresponding multiple connection words of multiple participles;Step S5, determine whether multiple candidate words are neologisms according to multiple candidate words and multiple connection words corresponding with multiple candidate words.The present invention can be avoided neologisms omission, and reduce calculating process time-consuming.

Description

For the new word discovery method and its system of power industry, readable storage medium storing program for executing
Technical field
The present invention relates to power industry language processing techniques fields, and in particular to a kind of new word discovery for power industry Method and its system, computer readable storage medium.
Background technique
The industry that modern age electric power rises after occurring when power industry, electric power have indispensable work in people's lives With, with Social Culture and science and technology progress and each region power industry development, often driven the language of power industry Variation, what is best embodied is exactly the neologisms of power industry, and the acquisition for neologisms can effectively improve the integration of electric power trade information Understand with timely, accelerates the development of power industry.But current power industry new word discovery integrity degree is lower, is easy to appear something lost Leakage, causes part neologisms to be difficult to the problem of being found in time, while in new word discovery, by only to the participle of text data Comentropy calculated, there is biggish limitation, and calculating process is taken a long time, so needing a kind of for electric power The new word discovery method of industry solves the above problems.
Summary of the invention
It is an object of the invention to propose a kind of new word discovery method and its system for power industry, computer-readable Storage medium, the technical problem present in solve in a manner of current power industry new word discovery.
In order to achieve the object of the present invention, according to a first aspect of the present invention, the embodiment of the present invention provides a kind of for electric power row The new word discovery method of industry, includes the following steps:
Step S1, it obtains power-related data and the power-related data is pre-processed to obtain text data;
Step S2, word segmentation processing is carried out according to the text data and obtains multiple participles;
Step S3, the multiple participle is handled using preparatory trained deep learning neural network, retrieval with Connection words corresponding with the multiple participle in deep learning neural network, if there are corresponding connection words for participle, really The fixed participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Step S4, multiple candidates are obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle Word;
Step S5, according to the multiple candidate word and multiple connection words determinations corresponding with the multiple candidate word Whether multiple candidate words are neologisms.
Preferably, wherein the power-related data includes the relevant personnel of power industry, things and knowledge data, number It include image, text or voice according to form;
The step S1 includes: to be filtered removal to obscene word, sensitive words and the stop words in text data;Wherein, if Power-related data is image data, then image data progress image procossing is extracted its text data;If power-related data For voice data, then its text data is extracted into voice data progress speech recognition.
Preferably, the step S1 includes:
Whether the text data after examining filtering removal obscene word, sensitive words and stop words is the relevant textual data of power industry According to;Wherein, retrieving whether there is power industry associative key in this article notebook data, and if it exists, then this article notebook data is electric power The relevant text data of industry, if it does not exist, then this article notebook data is not the relevant text data of power industry.
Preferably, the step S2 includes:
Subordinate sentence is carried out to text data by re.split () method, then text data is segmented by BERT, And to after participle sensitive word and stop words be removed.
Preferably, the step S4 includes:
For each subordinate sentence of deep learning neural network output, remaining point in addition to power industry Feature Words is calculated The comentropy of word;
Obtain connection words corresponding with power industry Feature Words in subordinate sentence;
Calculate the mutual information respectively segmented in subordinate sentence with the connection words;
The participle that mutual information meets first threshold range is saved as into candidate word.
Preferably, the step S5 includes:
For any candidate word, choose and the highest connection words of the candidate word mutual information;
It obtains all subordinate sentences containing the candidate word in the text data, and the connection words is substituted into and all contains this In the subordinate sentence of candidate word;
The cross entropy between subordinate sentence according to former subordinate sentence and after substituting into and the comparison result of second threshold range determine whether There are ambiguities, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for second threshold model It encloses, then the candidate word is not neologisms.
Preferably, the connection words in the deep learning neural network is set by the step S5 neologisms determined.
According to a second aspect of the present invention, the embodiment of the present invention provides a kind of new word discovery system for power industry, packet It includes:
Text data acquiring unit, for obtaining power-related data and pre-process to the power-related data To text data;
Participle unit obtains multiple participles for carrying out word segmentation processing according to the text data;
Neural network unit, for using preparatory trained deep learning neural network to the multiple participle at It manages, connection words corresponding with the multiple participle in retrieval and deep learning neural network, if there are corresponding connections for participle Words, it is determined that the participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Candidate word determination unit, for according to the multiple participle and the corresponding multiple connection words of the multiple participle Obtain multiple candidate words;
Neologisms determination unit, for according to the multiple candidate word and multiple connective words corresponding with the multiple candidate word Word determines whether the multiple candidate word is neologisms.
Preferably, the candidate word determination unit includes:
First computing unit, each subordinate sentence for exporting for the deep learning neural network calculate and remove electric power row The comentropy of remaining participle outside industry Feature Words;
Fisrt feature word acquiring unit, for obtaining connection words corresponding with power industry Feature Words in subordinate sentence;
Second computing unit, for calculating the mutual information respectively segmented in subordinate sentence with the connection words;
Candidate word judging unit, the participle for mutual information to be met first threshold range save as candidate word;
The neologisms determination unit includes:
Second feature word acquiring unit, for choosing and the highest connection of candidate word mutual information for any candidate word Words;
Subordinate sentence acquiring unit, for obtaining all subordinate sentences containing the candidate word in the text data, and by the connection Words substitutes into all subordinate sentences containing the candidate word;
Neologisms judging unit, for the cross entropy and second threshold range between the subordinate sentence according to former subordinate sentence and after substituting into Comparison result determines whether there is ambiguity, if cross entropy meets second threshold range, which is neologisms, if cross entropy is not Meet second threshold range, then the candidate word is not neologisms.
According to a third aspect of the present invention, the embodiment of the present invention provides a kind of computer readable storage medium, is stored thereon with Computer program realizes the new word discovery method for being directed to power industry when the program is executed by processor.
In embodiments of the present invention, by the foundation of deep learning neural network, deep learning neural network passes through low layer Feature connects high-level characteristic, plays the function of similar catalogue, each participle is quickly connected with corresponding connection words, So that retrieval is more quick, while being calculated by the mutual information between Feature Words and each participle, connection words is substituted into former point Sentence in, by cross entropy carry out ambiguity judgement, candidate word whether with connect words have accurate relationship, reduce close industry Interference of the word to neologisms makes score so that new word discovery is more accurate, and uses newest BERT model in word segmentation processing Word is more accurate, and the participle for reducing ambiguity, implication and mistake occurs.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that being emerged from by implementing the present invention.The objectives and other advantages of the invention can by specification, Specifically noted structure is achieved and obtained in claims and attached drawing.Certainly, implement any of the products of the present invention or Method does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of new word discovery method flow diagram for power industry in the embodiment of the present invention one.
Fig. 2 is a kind of new word discovery system schematic for power industry in the embodiment of the present invention two.
Specific embodiment
Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove It non-specifically points out, it is not necessary to attached drawing drawn to scale.
In addition, in order to better illustrate the present invention, numerous details is given in specific embodiment below.This Field is it will be appreciated by the skilled person that without certain details, the present invention equally be can be implemented.In some instances, for this Means known to the technical staff of field are not described in detail, in order to highlight purport of the invention.
As shown in Figure 1, the embodiment of the present invention provides a kind of new word discovery method for power industry, including walk as follows It is rapid:
Step S1, it obtains power-related data and the power-related data is pre-processed to obtain text data;
Step S2, word segmentation processing is carried out according to the text data and obtains multiple participles;
Step S3, the multiple participle is handled using preparatory trained deep learning neural network, retrieval with Connection words corresponding with the multiple participle in deep learning neural network, if there are corresponding connection words for participle, really The fixed participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Step S4, multiple candidates are obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle Word;
Step S5, according to the multiple candidate word and multiple connection words determinations corresponding with the multiple candidate word Whether multiple candidate words are neologisms.
Specifically, the training in advance of deep learning neural network includes the text data wound by removing stop word after importing Deep learning neural network is built, deep learning, which forms more abstract high level by combination low-level feature, indicates attribute classification or spy Sign, to find that the distributed nature of data indicates, wherein deep learning neural network be by the relationship between each Feature Words, it is right Several monolayer neural networks are linked, and wherein the structure of deep learning neural network is as follows, and unsupervised learning is for every first The pre-train of one layer network then only trains one layer with unsupervised learning every time, using its training result as its high one layer Input, finally with from top and under supervision algorithm remove to adjust all layers, thought be exactly in multiple layers of stacking, this layer The input as next layer is exported, and then reaches each feature and is connected with each other, is finally reached and quickly finds conjunctive word and conjunctive word Connection word purpose.
Wherein, the power-related data includes the relevant personnel of power industry, things and knowledge data, data mode packet Include image, text or voice;
The step S1 includes: to be filtered removal to obscene word, sensitive words and the stop words in text data;Wherein, if Power-related data is image data, then image data progress image procossing is extracted its text data;If power-related data For voice data, then its text data is extracted into voice data progress speech recognition.
Wherein, the step S1 includes:
Whether the text data after examining filtering removal obscene word, sensitive words and stop words is the relevant textual data of power industry According to;Wherein, retrieving whether there is power industry associative key in this article notebook data, and if it exists, then this article notebook data is electric power The relevant text data of industry, if it does not exist, then this article notebook data is not the relevant text data of power industry.
Specifically, can be by trained in advance with the presence or absence of power industry associative key in retrieval text data Deep learning neural network carries out retrieval process, retrieves text for all power industry associative keys as search key in advance The mode training deep learning neural network of notebook data sample.
Wherein, when retrieving text data using deep learning neural network, the time range of retrieval is in deep learning nerve Whether the time point of the time point that network creation finishes to the secondary retrieval determines text data and by electric power critical word and search For the relevant text data of power industry, if so, text duplication is imported etc. to be processed, if not then being given up, and find Lower portion text data, which imports after first neologisms is added to deep learning neural network, by retrieval time model It encloses and is revised as a neologisms and is added to the time point of deep learning neural network to the time point of the secondary retrieval.
Wherein, the step S2 includes:
Subordinate sentence is carried out to text data by re.split () method, then text data is segmented by BERT, And to after participle sensitive word and stop words be removed.
Specifically, character string can be split as using re.split () method by different fields in the present embodiment, and Multiple modes can be specified for separator, then text fragment is segmented by BERT, BERT pass through up and down into Row signature search determine, improve the accuracy of participle, after participle, to after participle sensitive word and stop words go It removes.
Wherein, the step S4 includes:
For each subordinate sentence of deep learning neural network output, remaining point in addition to power industry Feature Words is calculated The comentropy of word;
Obtain connection words corresponding with power industry Feature Words in subordinate sentence;
Calculate the mutual information respectively segmented in subordinate sentence with the connection words;
The participle that mutual information meets first threshold range is saved as into candidate word.
Specifically, calculating comentropy, the minimum participle of removal comentropy by c4.5 algorithm in the present embodiment.Wherein, The connection words of deep learning neural network first selects several monolayer neuronals associated with power industry Feature Words when obtaining Network is exported Feature Words in this several monolayer neural networks one by one;Wherein, the present embodiment passes through mRMR feature selecting Algorithm calculates the mutual information for respectively segmenting in subordinate sentence and connecting words.Wherein, first threshold range be preferably but not limited to for greater than etc. In 50%, i.e. removal mutual information is lower than 50% participle.
Wherein, the step S5 includes:
For any candidate word, choose and the highest connection words of the candidate word mutual information;
It obtains all subordinate sentences containing the candidate word in the text data, and the connection words is substituted into and all contains this In the subordinate sentence of candidate word;
The cross entropy between subordinate sentence according to former subordinate sentence and after substituting into and the comparison result of second threshold range determine whether There are ambiguities, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for second threshold model It encloses, then the candidate word is not neologisms.
Specifically, cross entropy is calculated by Tensorflow classification function in the present embodiment.
Wherein, the connection words in the deep learning neural network is set by the step S5 neologisms determined.
Specifically, determining neologisms paraphrase by the original text notebook data where neologisms, and bring neologisms into deep learning Neural network carries out perfect, and a text under retrieving after improving to deep learning neural network as new connection words Data continue to find neologisms.
As shown in Fig. 2, second embodiment of the present invention provides a kind of new word discovery systems for power industry, comprising:
Text data acquiring unit 1, for obtaining power-related data and being pre-processed to the power-related data Obtain text data;
Participle unit 2 obtains multiple participles for carrying out word segmentation processing according to the text data;
Neural network unit 3, for being carried out using preparatory trained deep learning neural network to the multiple participle It handles, connection words corresponding with the multiple participle in retrieval and deep learning neural network, if there are corresponding companies for participle Connect words, it is determined that the participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Candidate word determination unit 4, for according to the multiple participle and the corresponding multiple connective words of the multiple participle Word obtains multiple candidate words;
Neologisms determination unit 5, for according to the multiple candidate word and multiple connections corresponding with the multiple candidate word Words determines whether the multiple candidate word is neologisms.
Wherein, the candidate word determination unit includes:
First computing unit, each subordinate sentence for exporting for the deep learning neural network calculate and remove electric power row The comentropy of remaining participle outside industry Feature Words;
Fisrt feature word acquiring unit, for obtaining connection words corresponding with power industry Feature Words in subordinate sentence;
Second computing unit, for calculating the mutual information respectively segmented in subordinate sentence with the connection words;
Candidate word judging unit, the participle for mutual information to be met first threshold range save as candidate word;
The neologisms determination unit includes:
Second feature word acquiring unit, for choosing and the highest connection of candidate word mutual information for any candidate word Words;
Subordinate sentence acquiring unit, for obtaining all subordinate sentences containing the candidate word in the text data, and by the connection Words substitutes into all subordinate sentences containing the candidate word;
Neologisms judging unit, for the cross entropy and second threshold range between the subordinate sentence according to former subordinate sentence and after substituting into Comparison result determines whether there is ambiguity, if cross entropy meets second threshold range, which is neologisms, if cross entropy is not Meet second threshold range, then the candidate word is not neologisms.
It should be noted that system described in the present embodiment two be it is corresponding with one the method for embodiment, be used to implement One the method for example, therefore, other contents not described of system described in related embodiment two can be refering to described in embodiment one Method content obtains, and details are not described herein again.
It should also be understood that system described in one the method for embodiment and embodiment two can be implemented in many ways, including As process, device or system.Method described herein partly can execute this method by being used to indicate processor Program instruction and the instruction being recorded in non-transient computer readable storage medium and implement, non-transient computer is readable Storage medium hard drive, floppy disk, optical disc (small-sized dish (CD) or digital universal dish (DVD)), flash memory etc.. In some embodiments, program instruction can be stored remotely and be sent out on network via optics or electronic communication link It send.
The embodiment of the present invention three provides a kind of computer readable storage medium, is stored thereon with computer program, the program The new word discovery method that power industry is directed to described in embodiment one is realized when being executed by processor.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In principle, the practical application or to the technological improvement in market for best explaining each embodiment, or make the art its Its those of ordinary skill can understand each embodiment disclosed herein.

Claims (10)

1. a kind of new word discovery method for power industry, which comprises the steps of:
Step S1, it obtains power-related data and the power-related data is pre-processed to obtain text data;
Step S2, word segmentation processing is carried out according to the text data and obtains multiple participles;
Step S3, the multiple participle is handled using preparatory trained deep learning neural network, retrieval and depth Connection words corresponding with the multiple participle in learning neural network, if there are corresponding connection words for participle, it is determined that should Participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Step S4, multiple candidate words are obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle;
Step S5, it is determined according to the multiple candidate word and multiple connection words corresponding with the multiple candidate word the multiple Whether candidate word is neologisms.
2. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that wherein, the electric power phase Closing data includes the relevant personnel of power industry, things and knowledge data, and data mode includes image, text or voice;
The step S1 includes: to be filtered removal to obscene word, sensitive words and the stop words in text data;Wherein, if electric power Related data is image data, then image data progress image procossing is extracted its text data;If power-related data is language Voice data progress speech recognition is then extracted its text data by sound data.
3. being directed to the new word discovery method of power industry as claimed in claim 2, which is characterized in that the step S1 includes:
Whether the text data after examining filtering removal obscene word, sensitive words and stop words is the relevant text data of power industry; Wherein, retrieving whether there is power industry associative key in this article notebook data, and if it exists, then this article notebook data is power industry Relevant text data, if it does not exist, then this article notebook data is not the relevant text data of power industry.
4. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that the step S2 includes:
Subordinate sentence is carried out to text data by re.split () method, then text data is segmented by BERT, and right Sensitive word and stop words after participle are removed.
5. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that the step S4 includes:
For each subordinate sentence of deep learning neural network output, remaining participle in addition to power industry Feature Words is calculated Comentropy;
Obtain connection words corresponding with power industry Feature Words in subordinate sentence;
Calculate the mutual information respectively segmented in subordinate sentence with the connection words;
The participle that mutual information meets first threshold range is saved as into candidate word.
6. being directed to the new word discovery method of power industry as described in claim 1, which is characterized in that the step S5 includes:
For any candidate word, choose and the highest connection words of the candidate word mutual information;
It obtains all subordinate sentences containing the candidate word in the text data, and the connection words is substituted into and all contains the candidate In the subordinate sentence of word;
The cross entropy between subordinate sentence according to former subordinate sentence and after substituting into and the comparison result of second threshold range determine whether there is Ambiguity, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for second threshold range, The candidate word is not neologisms.
7. being directed to the new word discovery method of power industry as claimed in claim 6, which is characterized in that determine step S5 new Word is set as the connection words in the deep learning neural network.
8. a kind of new word discovery system for power industry characterized by comprising
Text data acquiring unit, for obtaining power-related data and being pre-processed to obtain text to the power-related data Notebook data;
Participle unit obtains multiple participles for carrying out word segmentation processing according to the text data;
Neural network unit, for being handled using preparatory trained deep learning neural network the multiple participle, Connection words corresponding with the multiple participle in retrieval and deep learning neural network, if there are corresponding connective words for participle Word, it is determined that the participle is power industry Feature Words, and exports the subordinate sentence where the participle;
Candidate word determination unit, for being obtained according to the multiple participle and the corresponding multiple connection words of the multiple participle Multiple candidate words;
Neologisms determination unit, for true according to the multiple candidate word and multiple connection words corresponding with the multiple candidate word Whether fixed the multiple candidate word is neologisms.
9. being directed to the new word discovery method of power industry as claimed in claim 8, which is characterized in that the candidate word determines single Member includes:
First computing unit, each subordinate sentence for exporting for the deep learning neural network are calculated except power industry is special Levy the comentropy of remaining participle outside word;
Fisrt feature word acquiring unit, for obtaining connection words corresponding with power industry Feature Words in subordinate sentence;
Second computing unit, for calculating the mutual information respectively segmented in subordinate sentence with the connection words;
Candidate word judging unit, the participle for mutual information to be met first threshold range save as candidate word;
The neologisms determination unit includes:
Second feature word acquiring unit, for choosing and the highest connection words of the candidate word mutual information for any candidate word;
Subordinate sentence acquiring unit, for obtaining all subordinate sentences containing the candidate word in the text data, and by the connection words It substitutes into all subordinate sentences containing the candidate word;
Neologisms judging unit, for the cross entropy between the subordinate sentence according to former subordinate sentence and after substituting into compared with second threshold range As a result ambiguity is determined whether there is, if cross entropy meets second threshold range, which is neologisms, if cross entropy is unsatisfactory for Second threshold range, then the candidate word is not neologisms.
10. a kind of computer readable storage medium, is stored thereon with computer program, power is realized when which is executed by processor Benefit require any one of 1~7 described in be directed to power industry new word discovery method.
CN201910638878.0A 2019-07-16 2019-07-16 New word discovery method, system and readable storage medium for power industry Active CN110413997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910638878.0A CN110413997B (en) 2019-07-16 2019-07-16 New word discovery method, system and readable storage medium for power industry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910638878.0A CN110413997B (en) 2019-07-16 2019-07-16 New word discovery method, system and readable storage medium for power industry

Publications (2)

Publication Number Publication Date
CN110413997A true CN110413997A (en) 2019-11-05
CN110413997B CN110413997B (en) 2023-04-07

Family

ID=68361554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910638878.0A Active CN110413997B (en) 2019-07-16 2019-07-16 New word discovery method, system and readable storage medium for power industry

Country Status (1)

Country Link
CN (1) CN110413997B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966501A (en) * 2021-02-22 2021-06-15 广州寄锦教育科技有限公司 New word discovery method, system, terminal and medium
CN114974228A (en) * 2022-05-24 2022-08-30 名日之梦(北京)科技有限公司 Rapid voice recognition method based on hierarchical recognition
CN117951246A (en) * 2024-03-26 2024-04-30 中国电子科技集团公司第三十研究所 New word discovery and application field prediction method and system for network technology

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577763A (en) * 2017-09-04 2018-01-12 北京京东尚科信息技术有限公司 Search method and device
CN107683469A (en) * 2015-12-30 2018-02-09 中国科学院深圳先进技术研究院 A kind of product classification method and device based on deep learning
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN109710947A (en) * 2019-01-22 2019-05-03 福建亿榕信息技术有限公司 Power specialty word stock generating method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107683469A (en) * 2015-12-30 2018-02-09 中国科学院深圳先进技术研究院 A kind of product classification method and device based on deep learning
CN107577763A (en) * 2017-09-04 2018-01-12 北京京东尚科信息技术有限公司 Search method and device
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN109710947A (en) * 2019-01-22 2019-05-03 福建亿榕信息技术有限公司 Power specialty word stock generating method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966501A (en) * 2021-02-22 2021-06-15 广州寄锦教育科技有限公司 New word discovery method, system, terminal and medium
CN112966501B (en) * 2021-02-22 2023-04-11 广州寄锦教育科技有限公司 New word discovery method, system, terminal and medium
CN114974228A (en) * 2022-05-24 2022-08-30 名日之梦(北京)科技有限公司 Rapid voice recognition method based on hierarchical recognition
CN117951246A (en) * 2024-03-26 2024-04-30 中国电子科技集团公司第三十研究所 New word discovery and application field prediction method and system for network technology
CN117951246B (en) * 2024-03-26 2024-05-28 中国电子科技集团公司第三十研究所 New word discovery and application field prediction method and system for network technology

Also Published As

Publication number Publication date
CN110413997B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
US11816441B2 (en) Device and method for machine reading comprehension question and answer
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
JP2019125343A (en) Text processing method and apparatus based on ambiguous entity words
CN107301170B (en) Method and device for segmenting sentences based on artificial intelligence
JP6335898B2 (en) Information classification based on product recognition
CN110795532A (en) Voice information processing method and device, intelligent terminal and storage medium
US10108602B2 (en) Dynamic portmanteau word semantic identification
CN110413997A (en) For the new word discovery method and its system of power industry, readable storage medium storing program for executing
CN112735383A (en) Voice signal processing method, device, equipment and storage medium
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN113033183B (en) Network new word discovery method and system based on statistics and similarity
CN115713072A (en) Relation category inference system and method based on prompt learning and context awareness
CN112883182A (en) Question-answer matching method and device based on machine reading
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN114491034B (en) Text classification method and intelligent device
CN107622047B (en) Design decision knowledge extraction and expression method
US20190095525A1 (en) Extraction of expression for natural language processing
EP3822816A1 (en) Device and method for machine reading comprehension question and answer
CN113590774B (en) Event query method, device and storage medium
CN114265931A (en) Big data text mining-based consumer policy perception analysis method and system
CN110427613B (en) Method and system for finding similar meaning words and computer readable storage medium
CN116737520B (en) Data braiding method, device and equipment for log data and storage medium
CN113553410B (en) Long document processing method, processing device, electronic equipment and storage medium
CN114036946B (en) Text feature extraction and auxiliary retrieval system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant