CN110413779A - It is a kind of for the term vector training method and its system of power industry, medium - Google Patents
It is a kind of for the term vector training method and its system of power industry, medium Download PDFInfo
- Publication number
- CN110413779A CN110413779A CN201910638876.1A CN201910638876A CN110413779A CN 110413779 A CN110413779 A CN 110413779A CN 201910638876 A CN201910638876 A CN 201910638876A CN 110413779 A CN110413779 A CN 110413779A
- Authority
- CN
- China
- Prior art keywords
- term vector
- phrase
- node
- vocabulary
- power industry
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Abstract
The present invention relates to a kind of for the term vector training method and its system of power industry, storage medium, the described method includes: obtaining power industry relative words and generating vocabulary according to the power industry relative words, the vocabulary is Huffman tree structure, and a leaf node in Huffman tree structure indicates a vocabulary;It obtains training corpus and word segmentation processing is carried out to the training corpus and obtain multiple vocabulary, and multiple vocabulary is assigned semantic;Wherein, the power industry relative words include multiple vocabulary;Corresponding guidance term vector is generated according to the multiple vocabulary;Find guiding node corresponding with the guidance term vector in Huffman tree structure;An initial term vector is inputted to the guiding node, and the target leaves node of the initial term vector and Huffman tree structure is connected, wherein the initial term vector is identical as the guidance term vector part of speech;Target term vector is determined according to the connection path of the target leaves node and guiding node.
Description
Technical field
The present invention relates to power industry language processing techniques fields, and in particular to a kind of term vector instruction for power industry
Practice method and its system, computer readable storage medium.
Background technique
With the rapid development of the national economy, power industry is rapidly developed, the letter of phrase included in power industry
Breath is enriched constantly, need the phrase information for including to power industry training, by the natural language used with it includes phrase letter
Breath is associated, wherein term vector has become a kind of popular tool in natural language processing field.Term vector at present
Training method generally using word as essential characteristic, vocabulary is shown as binary-coded term vector.
At present there is following technical problem in the training method of term vector: it is not only easy to produce feature sparsity, but also
It is mutually indepedent between any two word, the semanteme lain between word and morphology association can not be correctly captured, and when to several
It when a word is trained, needs to carry out operation to entire parameter matrix, increases calculation amount, reduce training effectiveness, so being badly in need of
A kind of term vector training method for power industry solves the above problems.
Summary of the invention
It is an object of the invention to propose it is a kind of can for the term vector training method and its system of power industry, computer
Storage medium is read, to solve current term vector training method.
In order to achieve the object of the present invention, according to a first aspect of the present invention, the embodiment of the present invention provides a kind of for electric power row
The term vector training method of industry, includes the following steps:
Step S1, power industry relative words are obtained and generate vocabulary according to the power industry relative words, it is described
Vocabulary is Huffman tree structure, and in Huffman tree structure a leaf node indicates a vocabulary;
Step S2, it obtains training corpus and word segmentation processing is carried out to the training corpus and obtain multiple vocabulary, and is more to this
A vocabulary assigns semantic;Wherein, the power industry relative words include multiple vocabulary;
Step S3, corresponding guidance term vector is generated according to the multiple vocabulary;
Step S4, guiding node corresponding with the guidance term vector in Huffman tree structure is found;
Step S5, Xiang Suoshu guiding node inputs an initial term vector, and by the initial term vector and Huffman tree knot
The target leaves node of structure connects, wherein the initial term vector is identical as the guidance term vector part of speech;
Step S6, target term vector is determined according to the connection path of the target leaves node and guiding node.
Preferably, the step S2 include: in training corpus punctuation mark and stop words be filtered removal and big
Small-format carries out unified conversion, and then further progress is decomposed to form the training corpus phrase set of multiple phrase compositions;Finally
The training corpus phrase set is expanded, and the corpus phrase after expansion is assigned semantic;
Wherein, it is described by training corpus carry out expand include: that the training corpus phrase set is decomposed into N number of phrase
Set, and N number of phrase is ranked up in order, it is combined into a x tuple, being then defaulted as participle group in the middle part of N number of phrase can
It skips, great-jump-forward sequence is carried out to N number of phrase, be combined into b new x tuples, so that training corpus is extended to a+b from a
It is a;Wherein, N is the total quantity that training corpus decomposes phrase, and x tuple is made of x phrase in N number of phrase.
Preferably, it includes: the hyponymy according to phrase to phrase that the corpus phrase after described pair of expansion, which assigns semanteme,
It carries out assigning part of speech semanteme, then traverses the set of entire phrase, finally classify to all phrases for assigning semanteme, will have
There is the phrase of identical part of speech semanteme to be clustered.
Preferably, the guidance term vector is term vector corresponding to the upperseat concept of the multiple vocabulary.
Preferably, the step S4 includes:
When by being connect in the guidance term vector input Huffman tree structure of generation with the leaf node of Huffman tree structure, note
The first path that record connection generates, and records node present on the first path, according to the first path and the
Node determines guiding node on one path.
Preferably, the step S6 includes: by the corresponding term vector of the multiple vocabulary using the guiding node as origin
It is input in the Huffman tree structure and is attached with the leaf node of Huffman tree structure, the second of record connection generation
Path, and node present on second path is recorded, according on second path and the second path node and
The guiding node determines target term vector.
According to a second aspect of the present invention, the embodiment of the present invention provides a kind of term vector training system for power industry,
For realizing the method, comprising:
Tree construction unit, for obtaining power industry relative words and generating vocabulary according to the power industry relative words
Table, the vocabulary are Huffman tree structure, and a leaf node in Huffman tree structure indicates a vocabulary;
Corpus processing unit obtains multiple words for obtaining training corpus and carrying out word segmentation processing to the training corpus
It converges, and multiple vocabulary is assigned semantic;Wherein, the power industry relative words include multiple vocabulary;
Introducer vector location, for generating corresponding guidance term vector according to the multiple vocabulary;
First node connection unit, for finding guidance section corresponding with the guidance term vector in Huffman tree structure
Point;
Second node connection unit, for the guiding node input an initial term vector, and by the initial word to
Amount is connect with the target leaves node of Huffman tree structure, wherein the initial term vector and the guidance term vector part of speech phase
Together;
Target word vector determination unit, for determining mesh according to the connection path of the target leaves node and guiding node
Mark term vector.
Preferably, the corpus processing unit includes:
First processing subelement, for in training corpus punctuation mark and stop words be filtered removal and size lattice
Formula carries out unified conversion, and then further progress is decomposed to form the training corpus phrase set of multiple phrase compositions;
Second processing subelement, for expanding the training corpus phrase set, wherein described by training corpus
Expand includes: the training corpus phrase set to be decomposed into the set of N number of phrase, and N number of phrase is carried out in order
Sequence, is combined into a x tuple, then is defaulted as can skip by participle group in the middle part of N number of phrase, carries out great-jump-forward row to N number of phrase
Sequence is combined into b new x tuples, so that training corpus is extended to a+b from a;Wherein, N is that training corpus decomposes phrase
Total quantity, x tuple is made of x phrase in N number of phrase;
Third handles subelement, semantic for assigning to the corpus phrase after expansion;Wherein, the corpus after described pair of expansion
It includes: to carry out assigning part of speech semanteme to phrase according to the hyponymy of phrase that phrase, which assigns semanteme, then traverses entire phrase
Set, finally assign semantic phrases to all and classify, the phrase with identical part of speech semanteme is clustered.
Preferably, the guidance term vector is term vector corresponding to the upperseat concept of the multiple vocabulary;
Wherein, the first node connection unit is specifically used in the guidance term vector that will be generated input Huffman tree structure
When being connect with the leaf node of Huffman tree structure, the first path that record connection generates, and to present on the first path
Node is recorded, and determines guiding node according to node in the first path and first path;
Wherein, the second node connection unit is specifically used for the corresponding term vector of the multiple vocabulary with the guidance
Node is that origin is input in the Huffman tree structure and is attached with the leaf node of Huffman tree structure, record connection
The second path generated, and node present on second path is recorded, according to second path and the second path
On node and the guiding node determine target term vector.
According to a third aspect of the present invention, the embodiment of the present invention provides a kind of computer readable storage medium, is stored thereon with
Computer program realizes the term vector training method for being directed to power industry when the program is executed by processor.
In embodiments of the present invention, by constructing the effect of dictionary, choose the related text of power industry and recorded simultaneously
Vocabulary is made in the form of Huffman tree structure, so that the training of term vector is carried out for power industry, by training language
Material carries out pretreated effect, in training corpus punctuation mark and stop words be filtered removal, and will be in training corpus
Existing format size carries out unified conversion, reduces the disturbing factor of corpus, and training corpus is expanded, improves trained essence
Exactness assigns semantic effect according to specified by phrase to make the term vector obtained that can more reflect true text meaning
The hyponymy of phrase judges semanteme of the phrase in initial corpus, and by comparing the phrase of similar semantic, mentions
The degree of association between high word;By the way that by treated, the analysis of corpus phrase generates corresponding guidance term vector, according to corpus word
Group part of speech semantically subordinate concept relationship reverse push export one include part of speech semanteme guidance term vector so that phrase into
Row divides, and the phrase of similar semantic is categorized in inside the same guidance term vector, guides term vector by importing, searching is drawn
The node of introductory word vector avoids the phrase of similar semantic from carrying out duplicate calculating on same route when finding node, passes through
Initial term vector (i.e. to be trained vocabulary corresponding term vector) is inputted to node, and corresponding introducer is input to initial term vector
The guiding node of vector is origin, it is attached with Huffman tree target leaves node, the path that record connection generates, and
Node present on the path is recorded, to be calculated for the node of record, reduces calculation amount, improves training effect
Rate.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification
It obtains it is clear that being emerged from by implementing the present invention.The objectives and other advantages of the invention can by specification,
Specifically noted structure is achieved and obtained in claims and attached drawing.Certainly, implement any of the products of the present invention or
Method does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of term vector training method flow chart for power industry in the embodiment of the present invention one.
Fig. 2 is a kind of term vector training system schematic diagram for power industry in the embodiment of the present invention two.
Specific embodiment
Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing
Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove
It non-specifically points out, it is not necessary to attached drawing drawn to scale.
In addition, in order to better illustrate the present invention, numerous details is given in specific embodiment below.This
Field is it will be appreciated by the skilled person that without certain details, the present invention equally be can be implemented.In some instances, for this
Means known to the technical staff of field are not described in detail, in order to highlight purport of the invention.
As shown in Figure 1, the embodiment of the present invention provides a kind of term vector training method for power industry, including walk as follows
It is rapid:
Step S1, power industry relative words are obtained and generate vocabulary according to the power industry relative words, it is described
Vocabulary is Huffman tree structure, and in Huffman tree structure a leaf node indicates a vocabulary;
Step S2, it obtains training corpus and word segmentation processing is carried out to the training corpus and obtain multiple vocabulary, and is more to this
A vocabulary assigns semantic;Wherein, the power industry relative words include multiple vocabulary;
Step S3, corresponding guidance term vector is generated according to the multiple vocabulary;
Step S4, guiding node corresponding with the guidance term vector in Huffman tree structure is found;
Step S5, Xiang Suoshu guiding node inputs an initial term vector, and by the initial term vector and Huffman tree knot
The target leaves node of structure connects, wherein the initial term vector is identical as the guidance term vector part of speech;
Step S6, target term vector is determined according to the connection path of the target leaves node and guiding node.
Specifically, Huffman tree is given n weight as n leaf node, a binary tree is constructed, if the tree
Cum rights path length reaches minimum, and such binary tree is referred to as optimum binary tree, also referred to as Huffman tree (Huffman Tree).
Huffman tree is the shortest tree of cum rights path length, and the biggish node of weight is closer from root.Building Huffman tree step is to input
The node of different weights, is seen as n forest, first chooses minimum two weight nodes in these nodes and merges,
A new tree is obtained, node originally becomes the left and right subtree of this new tree, and the weight newly set is the corresponding power of two nodes
The sum of value deletes original tree by new tree as the one tree being newly added, and chooses two the smallest tree again and merges, with such
It pushes away until all trees all merge, wherein the cum rights path length of the tree reaches minimum, and each leaf node represents electric power row
All words in industry vocabulary.Wherein the first row inputs a several n, indicates the number of leaf node, needs raw with these leaf nodes
At Huffman tree, according to the concept of Huffman tree, the imparting of these nodes has corresponding weight.
Wherein, the step S2 include: in training corpus punctuation mark and stop words be filtered removal and size
Format carries out unified conversion, and then further progress is decomposed to form the training corpus phrase set of multiple phrase compositions;It is finally right
The training corpus phrase set is expanded, and is assigned to the corpus phrase after expansion semantic;
Wherein, it is described by training corpus carry out expand include: that the training corpus phrase set is decomposed into N number of phrase
Set, and N number of phrase is ranked up in order, it is combined into a x tuple, such as if N includes that 1-20 divides a x:2 x, x to divide
It Wei not 1-10,11-20;Then participle group in the middle part of N number of phrase is defaulted as can skip, great-jump-forward sequence, example is carried out to N number of phrase
Such as 123456 points are 135 and 246;B new x tuples are combined into, so that training corpus is extended to a+b from a;Its
In, N is the total quantity that training corpus decomposes phrase, and x tuple is made of x phrase in N number of phrase.
Wherein, semanteme is assigned to phrase refer to that described pair is expanded to the semanteme of the phrase imparting relative weighting of word segmentation processing
It includes: to carry out assigning part of speech semanteme to phrase according to the hyponymy of phrase that corpus phrase afterwards, which assigns semanteme, is then traversed
The set of entire phrase finally classifies to all phrases for assigning semanteme, will carry out with the phrase of identical part of speech semanteme
Cluster;Wherein, semantic weight is determined according to cluster result.
Wherein, the stronger word of generality is called the hypernym of the stronger word of specificity in hyponymy,
The stronger word of specificity is called the hyponym of the stronger word of generality.
Wherein, the guidance term vector is term vector corresponding to the upperseat concept of the multiple vocabulary.Such as at the beginning of two groups
Beginning phrase is voltmeter and ammeter, and boot vector can be ammeter at this time, and wherein ammeter includes voltmeter and ammeter, and is being breathed out
The height of node of ammeter is higher than the height of node of voltmeter and ammeter in Fu Man tree.
Wherein, the step S4 includes:
When by being connect in the guidance term vector input Huffman tree structure of generation with the leaf node of Huffman tree structure, note
The first path that record connection generates, and records node present on the first path, according to the first path and the
Node determines guiding node on one path.
Wherein, the step S6 includes: that the corresponding term vector of the multiple vocabulary is defeated as origin using the guiding node
Enter into the Huffman tree structure and is attached with the leaf node of Huffman tree structure, the second tunnel that record connection generates
Diameter, and node present on second path is recorded, according on second path and the second path node and institute
It states guiding node and determines target term vector.
The embodiment of the present invention chooses the related text of power industry and carries out record and with Hough by the effect of building dictionary
Vocabulary is made in the form of graceful tree construction, so that the training of term vector is carried out for power industry, by carrying out to training corpus
Pretreated effect, in training corpus punctuation mark and stop words be filtered removal, and will be present in training corpus
Format size carries out unified conversion, reduces the disturbing factor of corpus, and training corpus is expanded, improves trained accuracy,
To make the term vector of acquisition that can more reflect true text meaning, semantic effect is assigned according to specified phrase by phrase
Hyponymy judges semanteme of the phrase in initial corpus, and by comparing the phrase of similar semantic, improves word
Between the degree of association;By the way that by treated, the analysis of corpus phrase generates corresponding guidance term vector, according to the word of corpus phrase
Property semantically subordinate concept relationship reverse push export one include part of speech semanteme guidance term vector so that phrase is drawn
Point, and the phrase of similar semantic is categorized in inside the same guidance term vector, term vector is guided by importing, finds introducer
The node of vector avoids the phrase of similar semantic from carrying out duplicate calculating on same route when finding node, by section
Point place inputs initial term vector (i.e. train vocabulary corresponding term vector), is input to correspondence guidance term vector with initial term vector
Guiding node be origin, it is attached with Huffman tree target leaves node, the path that record connection generates, and to this
Node is recorded present on path, to be calculated for the node of record, is reduced calculation amount, is improved training effectiveness.
As shown in Fig. 2, second embodiment of the present invention provides a kind of term vector training system for power industry, for realizing
The method, comprising:
Tree construction unit 1, for obtaining power industry relative words and generating word according to the power industry relative words
Remittance table, the vocabulary are Huffman tree structure, and a leaf node in Huffman tree structure indicates a vocabulary;
Corpus processing unit 2 obtains multiple words for obtaining training corpus and carrying out word segmentation processing to the training corpus
It converges, and multiple vocabulary is assigned semantic;Wherein, the power industry relative words include multiple vocabulary;
Introducer vector location 3, for generating corresponding guidance term vector according to the multiple vocabulary;
First node connection unit 4, for finding guidance section corresponding with the guidance term vector in Huffman tree structure
Point;
Second node connection unit 5, for the guiding node input an initial term vector, and by the initial word to
Amount is connect with the target leaves node of Huffman tree structure, wherein the initial term vector and the guidance term vector part of speech phase
Together;
Target word vector determination unit 6, for being determined according to the connection path of the target leaves node and guiding node
Target term vector.
Wherein, the corpus processing unit includes:
First processing subelement, for in training corpus punctuation mark and stop words be filtered removal and size lattice
Formula carries out unified conversion, and then further progress is decomposed to form the training corpus phrase set of multiple phrase compositions;
Second processing subelement, for expanding the training corpus phrase set, wherein described by training corpus
Expand includes: the training corpus phrase set to be decomposed into the set of N number of phrase, and N number of phrase is carried out in order
Sequence, is combined into a x tuple, then is defaulted as can skip by participle group in the middle part of N number of phrase, carries out great-jump-forward row to N number of phrase
Sequence is combined into b new x tuples, so that training corpus is extended to a+b from a;Wherein, N is that training corpus decomposes phrase
Total quantity, x tuple is made of x phrase in N number of phrase;
Third handles subelement, semantic for assigning to the corpus phrase after expansion;Wherein, the corpus after described pair of expansion
It includes: to carry out assigning part of speech semanteme to phrase according to the hyponymy of phrase that phrase, which assigns semanteme, then traverses entire phrase
Set, finally assign semantic phrases to all and classify, the phrase with identical part of speech semanteme is clustered.
Wherein, the guidance term vector is term vector corresponding to the upperseat concept of the multiple vocabulary;
Wherein, the first node connection unit is specifically used in the guidance term vector that will be generated input Huffman tree structure
When being connect with the leaf node of Huffman tree structure, the first path that record connection generates, and to present on the first path
Node is recorded, and determines guiding node according to node in the first path and first path;
Wherein, the second node connection unit is specifically used for the corresponding term vector of the multiple vocabulary with the guidance
Node is that origin is input in the Huffman tree structure and is attached with the leaf node of Huffman tree structure, record connection
The second path generated, and node present on second path is recorded, according to second path and the second path
On node and the guiding node determine target term vector.
It should be noted that system described in the present embodiment two be it is corresponding with one the method for embodiment, be used to implement
One the method for example, therefore, other contents not described of system described in related embodiment two can be refering to described in embodiment one
Method content obtains, and details are not described herein again.
It should also be understood that system described in one the method for embodiment and embodiment two can be implemented in many ways, including
As process, device or system.Method described herein partly can execute this method by being used to indicate processor
Program instruction and the instruction being recorded in non-transient computer readable storage medium and implement, non-transient computer is readable
Storage medium hard drive, floppy disk, optical disc (small-sized dish (CD) or digital universal dish (DVD)), flash memory etc..
In some embodiments, program instruction can be stored remotely and be sent out on network via optics or electronic communication link
It send.
The embodiment of the present invention three provides a kind of computer readable storage medium, is stored thereon with computer program, the program
The term vector training method that power industry is directed to described in embodiment one is realized when being executed by processor.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport
In principle, the practical application or to the technological improvement in market for best explaining each embodiment, or make the art its
Its those of ordinary skill can understand each embodiment disclosed herein.
Claims (10)
1. a kind of term vector training method for power industry, which comprises the steps of:
Step S1, power industry relative words are obtained and generate vocabulary, the vocabulary according to the power industry relative words
Table is Huffman tree structure, and in Huffman tree structure a leaf node indicates a vocabulary;
Step S2, it obtains training corpus and word segmentation processing is carried out to the training corpus and obtain multiple vocabulary, and to multiple word
It converges and assigns semanteme;Wherein, the power industry relative words include multiple vocabulary;
Step S3, corresponding guidance term vector is generated according to the multiple vocabulary;
Step S4, guiding node corresponding with the guidance term vector in Huffman tree structure is found;
Step S5, Xiang Suoshu guiding node inputs an initial term vector, and by the initial term vector and Huffman tree structure
Target leaves node connection, wherein the initial term vector is identical as the guidance term vector part of speech;
Step S6, target term vector is determined according to the connection path of the target leaves node and guiding node.
2. being directed to the term vector training method of power industry as described in claim 1, which is characterized in that the step S2 packet
Include: in training corpus punctuation mark and stop words is filtered removal and format size carries out unified conversion, then into one
Step carries out the training corpus phrase set for being decomposed to form multiple phrase compositions;Finally the training corpus phrase set is expanded
It fills, and the corpus phrase after expansion is assigned semantic;
Wherein, it is described by training corpus carry out expand include: the collection that the training corpus phrase set is decomposed into N number of phrase
It closes, and N number of phrase is ranked up in order, be combined into a x tuple, then be defaulted as to jump by participle group in the middle part of N number of phrase
It crosses, great-jump-forward sequence is carried out to N number of phrase, be combined into b new x tuples, so that training corpus is extended to a+b from a;
Wherein, N is the total quantity that training corpus decomposes phrase, and x tuple is made of x phrase in N number of phrase.
3. being directed to the term vector training method of power industry as claimed in claim 2, which is characterized in that after described pair is expanded
It includes: to carry out assigning part of speech semanteme to phrase according to the hyponymy of phrase that corpus phrase, which assigns semanteme, and then traversal is entire
The set of phrase finally classifies to all phrases for assigning semanteme, the phrase with identical part of speech semanteme is clustered.
4. being directed to the term vector training method of power industry as claimed in claim 3, which is characterized in that the guidance term vector
Term vector corresponding to upperseat concept for the multiple vocabulary.
5. being directed to the term vector training method of power industry as claimed in claim 4, which is characterized in that the step S4 packet
It includes:
When by connecting in the guidance term vector input Huffman tree structure of generation with the leaf node of Huffman tree structure, record connects
It practices midwifery raw first path, and node present on the first path is recorded, according to the first path and the first via
Node determines guiding node on diameter.
6. being directed to the term vector training method of power industry as claimed in claim 5, which is characterized in that the step S6 packet
Include: by the corresponding term vector of the multiple vocabulary be input in the Huffman tree structure using the guiding node as origin and with
The leaf node of Huffman tree structure is attached, the second path that record connection generates, and to present on second path
Node is recorded, according on second path and the second path node and the guiding node determine target term vector.
7. a kind of term vector training system for power industry, special for realizing any one of claim 1-6 the method
Sign is, comprising:
Tree construction unit, for obtaining power industry relative words and generating vocabulary according to the power industry relative words,
The vocabulary is Huffman tree structure, and a leaf node in Huffman tree structure indicates a vocabulary;
Corpus processing unit obtains multiple vocabulary for obtaining training corpus and carrying out word segmentation processing to the training corpus, and
Multiple vocabulary is assigned semantic;Wherein, the power industry relative words include multiple vocabulary;
Introducer vector location, for generating corresponding guidance term vector according to the multiple vocabulary;
First node connection unit, for finding guiding node corresponding with the guidance term vector in Huffman tree structure;
Second node connection unit, for the guiding node input an initial term vector, and will the initial term vector and
The target leaves node of Huffman tree structure connects, wherein the initial term vector is identical as the guidance term vector part of speech;
Target word vector determination unit, for determining target word according to the connection path of the target leaves node and guiding node
Vector.
8. being directed to the term vector training system of power industry as claimed in claim 7, which is characterized in that the corpus processing is single
Member includes:
First processing subelement, for in training corpus punctuation mark and stop words be filtered removal and format size into
The unified conversion of row, then further progress is decomposed to form the training corpus phrase set of multiple phrase compositions;
Second processing subelement, for expanding the training corpus phrase set, wherein described to carry out training corpus
Expansion includes: the training corpus phrase set to be decomposed into the set of N number of phrase, and N number of phrase is ranked up in order,
It is combined into a x tuple, then is defaulted as can skip by participle group in the middle part of N number of phrase, great-jump-forward sequence, group are carried out to N number of phrase
B new x tuples are synthesized, so that training corpus is extended to a+b from a;Wherein, N is that training corpus decomposes the total of phrase
Quantity, x tuple are made of x phrase in N number of phrase;
Third handles subelement, semantic for assigning to the corpus phrase after expansion;Wherein, the corpus phrase after described pair of expansion
Assigning semanteme includes: to carry out assigning part of speech semanteme to phrase according to the hyponymy of phrase, then traverses the collection of entire phrase
It closes, finally classifies to all phrases for assigning semanteme, the phrase with identical part of speech semanteme is clustered.
9. being directed to the term vector training method of power industry as claimed in claim 8, which is characterized in that the guidance term vector
Term vector corresponding to upperseat concept for the multiple vocabulary;
Wherein, the first node connection unit is specifically used in the guidance term vector that will be generated input Huffman tree structure and Kazakhstan
When the leaf node connection of the graceful tree construction of husband, the first path that record connection generates, and to node present on the first path
It is recorded, guiding node is determined according to node in the first path and first path;
Wherein, the second node connection unit is specifically used for the corresponding term vector of the multiple vocabulary with the guiding node
It is input in the Huffman tree structure for origin and is attached with the leaf node of Huffman tree structure, record connection generates
The second path, and node present on second path is recorded, according on second path and the second path
Node and the guiding node determine target term vector.
10. a kind of computer readable storage medium, is stored thereon with computer program, power is realized when which is executed by processor
Benefit require any one of 1~6 described in be directed to power industry term vector training method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910638876.1A CN110413779B (en) | 2019-07-16 | 2019-07-16 | Word vector training method, system and medium for power industry |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910638876.1A CN110413779B (en) | 2019-07-16 | 2019-07-16 | Word vector training method, system and medium for power industry |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110413779A true CN110413779A (en) | 2019-11-05 |
CN110413779B CN110413779B (en) | 2022-05-03 |
Family
ID=68361538
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910638876.1A Active CN110413779B (en) | 2019-07-16 | 2019-07-16 | Word vector training method, system and medium for power industry |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110413779B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325026A (en) * | 2020-02-18 | 2020-06-23 | 北京声智科技有限公司 | Training method and system for word vector model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930318A (en) * | 2016-04-11 | 2016-09-07 | 深圳大学 | Word vector training method and system |
WO2017090051A1 (en) * | 2015-11-27 | 2017-06-01 | Giridhari Devanathan | A method for text classification and feature selection using class vectors and the system thereof |
CN107291693A (en) * | 2017-06-15 | 2017-10-24 | 广州赫炎大数据科技有限公司 | A kind of semantic computation method for improving term vector model |
CN108319666A (en) * | 2018-01-19 | 2018-07-24 | 国网浙江省电力有限公司电力科学研究院 | A kind of electric service appraisal procedure based on multi-modal the analysis of public opinion |
-
2019
- 2019-07-16 CN CN201910638876.1A patent/CN110413779B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017090051A1 (en) * | 2015-11-27 | 2017-06-01 | Giridhari Devanathan | A method for text classification and feature selection using class vectors and the system thereof |
CN105930318A (en) * | 2016-04-11 | 2016-09-07 | 深圳大学 | Word vector training method and system |
CN107291693A (en) * | 2017-06-15 | 2017-10-24 | 广州赫炎大数据科技有限公司 | A kind of semantic computation method for improving term vector model |
CN108319666A (en) * | 2018-01-19 | 2018-07-24 | 国网浙江省电力有限公司电力科学研究院 | A kind of electric service appraisal procedure based on multi-modal the analysis of public opinion |
Non-Patent Citations (1)
Title |
---|
马存: "基于Word2Vec的中文短文本聚类算法研究与应用", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325026A (en) * | 2020-02-18 | 2020-06-23 | 北京声智科技有限公司 | Training method and system for word vector model |
CN111325026B (en) * | 2020-02-18 | 2023-10-10 | 北京声智科技有限公司 | Training method and system for word vector model |
Also Published As
Publication number | Publication date |
---|---|
CN110413779B (en) | 2022-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6842167B2 (en) | Summary generator, summary generation method and computer program | |
CN111090461B (en) | Code annotation generation method based on machine translation model | |
CN109918657A (en) | A method of extracting target keyword from text | |
WO2021073298A1 (en) | Speech information processing method and apparatus, and intelligent terminal and storage medium | |
CN115393692A (en) | Generation formula pre-training language model-based association text-to-image generation method | |
JP2014229122A (en) | Translation device, method, and program | |
JP7337770B2 (en) | Method and system for training a document-level natural language processing model | |
CN116701431A (en) | Data retrieval method and system based on large language model | |
CN109933602A (en) | A kind of conversion method and device of natural language and structured query language | |
CN110019779B (en) | Text classification method, model training method and device | |
CN106202395A (en) | Text clustering method and device | |
CN109117474A (en) | Calculation method, device and the storage medium of statement similarity | |
US20220189468A1 (en) | Abstract generation device, method, program, and recording medium | |
CN110781297B (en) | Classification method of multi-label scientific research papers based on hierarchical discriminant trees | |
CN107506345A (en) | The construction method and device of language model | |
JP5975938B2 (en) | Speech recognition apparatus, speech recognition method and program | |
CN107977368B (en) | Information extraction method and system | |
CN114912425A (en) | Presentation generation method and device | |
JP5355483B2 (en) | Abbreviation Complete Word Restoration Device, Method and Program | |
CN113343692B (en) | Search intention recognition method, model training method, device, medium and equipment | |
JP2018084627A (en) | Language model learning device and program thereof | |
CN110413779A (en) | It is a kind of for the term vector training method and its system of power industry, medium | |
CN111881264B (en) | Method and electronic equipment for searching long text in question-answering task in open field | |
CN110019875A (en) | The generation method and device of index file | |
CN110580280A (en) | Method, device and storage medium for discovering new words |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |