CN110413779A - It is a kind of for the term vector training method and its system of power industry, medium - Google Patents

It is a kind of for the term vector training method and its system of power industry, medium Download PDF

Info

Publication number
CN110413779A
CN110413779A CN201910638876.1A CN201910638876A CN110413779A CN 110413779 A CN110413779 A CN 110413779A CN 201910638876 A CN201910638876 A CN 201910638876A CN 110413779 A CN110413779 A CN 110413779A
Authority
CN
China
Prior art keywords
term vector
phrase
node
vocabulary
power industry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910638876.1A
Other languages
Chinese (zh)
Other versions
CN110413779B (en
Inventor
张云翔
饶竹一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Power Supply Bureau Co Ltd
Original Assignee
Shenzhen Power Supply Bureau Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Power Supply Bureau Co Ltd filed Critical Shenzhen Power Supply Bureau Co Ltd
Priority to CN201910638876.1A priority Critical patent/CN110413779B/en
Publication of CN110413779A publication Critical patent/CN110413779A/en
Application granted granted Critical
Publication of CN110413779B publication Critical patent/CN110413779B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The present invention relates to a kind of for the term vector training method and its system of power industry, storage medium, the described method includes: obtaining power industry relative words and generating vocabulary according to the power industry relative words, the vocabulary is Huffman tree structure, and a leaf node in Huffman tree structure indicates a vocabulary;It obtains training corpus and word segmentation processing is carried out to the training corpus and obtain multiple vocabulary, and multiple vocabulary is assigned semantic;Wherein, the power industry relative words include multiple vocabulary;Corresponding guidance term vector is generated according to the multiple vocabulary;Find guiding node corresponding with the guidance term vector in Huffman tree structure;An initial term vector is inputted to the guiding node, and the target leaves node of the initial term vector and Huffman tree structure is connected, wherein the initial term vector is identical as the guidance term vector part of speech;Target term vector is determined according to the connection path of the target leaves node and guiding node.

Description

It is a kind of for the term vector training method and its system of power industry, medium
Technical field
The present invention relates to power industry language processing techniques fields, and in particular to a kind of term vector instruction for power industry Practice method and its system, computer readable storage medium.
Background technique
With the rapid development of the national economy, power industry is rapidly developed, the letter of phrase included in power industry Breath is enriched constantly, need the phrase information for including to power industry training, by the natural language used with it includes phrase letter Breath is associated, wherein term vector has become a kind of popular tool in natural language processing field.Term vector at present Training method generally using word as essential characteristic, vocabulary is shown as binary-coded term vector.
At present there is following technical problem in the training method of term vector: it is not only easy to produce feature sparsity, but also It is mutually indepedent between any two word, the semanteme lain between word and morphology association can not be correctly captured, and when to several It when a word is trained, needs to carry out operation to entire parameter matrix, increases calculation amount, reduce training effectiveness, so being badly in need of A kind of term vector training method for power industry solves the above problems.
Summary of the invention
It is an object of the invention to propose it is a kind of can for the term vector training method and its system of power industry, computer Storage medium is read, to solve current term vector training method.
In order to achieve the object of the present invention, according to a first aspect of the present invention, the embodiment of the present invention provides a kind of for electric power row The term vector training method of industry, includes the following steps:
Step S1, power industry relative words are obtained and generate vocabulary according to the power industry relative words, it is described Vocabulary is Huffman tree structure, and in Huffman tree structure a leaf node indicates a vocabulary;
Step S2, it obtains training corpus and word segmentation processing is carried out to the training corpus and obtain multiple vocabulary, and is more to this A vocabulary assigns semantic;Wherein, the power industry relative words include multiple vocabulary;
Step S3, corresponding guidance term vector is generated according to the multiple vocabulary;
Step S4, guiding node corresponding with the guidance term vector in Huffman tree structure is found;
Step S5, Xiang Suoshu guiding node inputs an initial term vector, and by the initial term vector and Huffman tree knot The target leaves node of structure connects, wherein the initial term vector is identical as the guidance term vector part of speech;
Step S6, target term vector is determined according to the connection path of the target leaves node and guiding node.
Preferably, the step S2 include: in training corpus punctuation mark and stop words be filtered removal and big Small-format carries out unified conversion, and then further progress is decomposed to form the training corpus phrase set of multiple phrase compositions;Finally The training corpus phrase set is expanded, and the corpus phrase after expansion is assigned semantic;
Wherein, it is described by training corpus carry out expand include: that the training corpus phrase set is decomposed into N number of phrase Set, and N number of phrase is ranked up in order, it is combined into a x tuple, being then defaulted as participle group in the middle part of N number of phrase can It skips, great-jump-forward sequence is carried out to N number of phrase, be combined into b new x tuples, so that training corpus is extended to a+b from a It is a;Wherein, N is the total quantity that training corpus decomposes phrase, and x tuple is made of x phrase in N number of phrase.
Preferably, it includes: the hyponymy according to phrase to phrase that the corpus phrase after described pair of expansion, which assigns semanteme, It carries out assigning part of speech semanteme, then traverses the set of entire phrase, finally classify to all phrases for assigning semanteme, will have There is the phrase of identical part of speech semanteme to be clustered.
Preferably, the guidance term vector is term vector corresponding to the upperseat concept of the multiple vocabulary.
Preferably, the step S4 includes:
When by being connect in the guidance term vector input Huffman tree structure of generation with the leaf node of Huffman tree structure, note The first path that record connection generates, and records node present on the first path, according to the first path and the Node determines guiding node on one path.
Preferably, the step S6 includes: by the corresponding term vector of the multiple vocabulary using the guiding node as origin It is input in the Huffman tree structure and is attached with the leaf node of Huffman tree structure, the second of record connection generation Path, and node present on second path is recorded, according on second path and the second path node and The guiding node determines target term vector.
According to a second aspect of the present invention, the embodiment of the present invention provides a kind of term vector training system for power industry, For realizing the method, comprising:
Tree construction unit, for obtaining power industry relative words and generating vocabulary according to the power industry relative words Table, the vocabulary are Huffman tree structure, and a leaf node in Huffman tree structure indicates a vocabulary;
Corpus processing unit obtains multiple words for obtaining training corpus and carrying out word segmentation processing to the training corpus It converges, and multiple vocabulary is assigned semantic;Wherein, the power industry relative words include multiple vocabulary;
Introducer vector location, for generating corresponding guidance term vector according to the multiple vocabulary;
First node connection unit, for finding guidance section corresponding with the guidance term vector in Huffman tree structure Point;
Second node connection unit, for the guiding node input an initial term vector, and by the initial word to Amount is connect with the target leaves node of Huffman tree structure, wherein the initial term vector and the guidance term vector part of speech phase Together;
Target word vector determination unit, for determining mesh according to the connection path of the target leaves node and guiding node Mark term vector.
Preferably, the corpus processing unit includes:
First processing subelement, for in training corpus punctuation mark and stop words be filtered removal and size lattice Formula carries out unified conversion, and then further progress is decomposed to form the training corpus phrase set of multiple phrase compositions;
Second processing subelement, for expanding the training corpus phrase set, wherein described by training corpus Expand includes: the training corpus phrase set to be decomposed into the set of N number of phrase, and N number of phrase is carried out in order Sequence, is combined into a x tuple, then is defaulted as can skip by participle group in the middle part of N number of phrase, carries out great-jump-forward row to N number of phrase Sequence is combined into b new x tuples, so that training corpus is extended to a+b from a;Wherein, N is that training corpus decomposes phrase Total quantity, x tuple is made of x phrase in N number of phrase;
Third handles subelement, semantic for assigning to the corpus phrase after expansion;Wherein, the corpus after described pair of expansion It includes: to carry out assigning part of speech semanteme to phrase according to the hyponymy of phrase that phrase, which assigns semanteme, then traverses entire phrase Set, finally assign semantic phrases to all and classify, the phrase with identical part of speech semanteme is clustered.
Preferably, the guidance term vector is term vector corresponding to the upperseat concept of the multiple vocabulary;
Wherein, the first node connection unit is specifically used in the guidance term vector that will be generated input Huffman tree structure When being connect with the leaf node of Huffman tree structure, the first path that record connection generates, and to present on the first path Node is recorded, and determines guiding node according to node in the first path and first path;
Wherein, the second node connection unit is specifically used for the corresponding term vector of the multiple vocabulary with the guidance Node is that origin is input in the Huffman tree structure and is attached with the leaf node of Huffman tree structure, record connection The second path generated, and node present on second path is recorded, according to second path and the second path On node and the guiding node determine target term vector.
According to a third aspect of the present invention, the embodiment of the present invention provides a kind of computer readable storage medium, is stored thereon with Computer program realizes the term vector training method for being directed to power industry when the program is executed by processor.
In embodiments of the present invention, by constructing the effect of dictionary, choose the related text of power industry and recorded simultaneously Vocabulary is made in the form of Huffman tree structure, so that the training of term vector is carried out for power industry, by training language Material carries out pretreated effect, in training corpus punctuation mark and stop words be filtered removal, and will be in training corpus Existing format size carries out unified conversion, reduces the disturbing factor of corpus, and training corpus is expanded, improves trained essence Exactness assigns semantic effect according to specified by phrase to make the term vector obtained that can more reflect true text meaning The hyponymy of phrase judges semanteme of the phrase in initial corpus, and by comparing the phrase of similar semantic, mentions The degree of association between high word;By the way that by treated, the analysis of corpus phrase generates corresponding guidance term vector, according to corpus word Group part of speech semantically subordinate concept relationship reverse push export one include part of speech semanteme guidance term vector so that phrase into Row divides, and the phrase of similar semantic is categorized in inside the same guidance term vector, guides term vector by importing, searching is drawn The node of introductory word vector avoids the phrase of similar semantic from carrying out duplicate calculating on same route when finding node, passes through Initial term vector (i.e. to be trained vocabulary corresponding term vector) is inputted to node, and corresponding introducer is input to initial term vector The guiding node of vector is origin, it is attached with Huffman tree target leaves node, the path that record connection generates, and Node present on the path is recorded, to be calculated for the node of record, reduces calculation amount, improves training effect Rate.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that being emerged from by implementing the present invention.The objectives and other advantages of the invention can by specification, Specifically noted structure is achieved and obtained in claims and attached drawing.Certainly, implement any of the products of the present invention or Method does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of term vector training method flow chart for power industry in the embodiment of the present invention one.
Fig. 2 is a kind of term vector training system schematic diagram for power industry in the embodiment of the present invention two.
Specific embodiment
Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove It non-specifically points out, it is not necessary to attached drawing drawn to scale.
In addition, in order to better illustrate the present invention, numerous details is given in specific embodiment below.This Field is it will be appreciated by the skilled person that without certain details, the present invention equally be can be implemented.In some instances, for this Means known to the technical staff of field are not described in detail, in order to highlight purport of the invention.
As shown in Figure 1, the embodiment of the present invention provides a kind of term vector training method for power industry, including walk as follows It is rapid:
Step S1, power industry relative words are obtained and generate vocabulary according to the power industry relative words, it is described Vocabulary is Huffman tree structure, and in Huffman tree structure a leaf node indicates a vocabulary;
Step S2, it obtains training corpus and word segmentation processing is carried out to the training corpus and obtain multiple vocabulary, and is more to this A vocabulary assigns semantic;Wherein, the power industry relative words include multiple vocabulary;
Step S3, corresponding guidance term vector is generated according to the multiple vocabulary;
Step S4, guiding node corresponding with the guidance term vector in Huffman tree structure is found;
Step S5, Xiang Suoshu guiding node inputs an initial term vector, and by the initial term vector and Huffman tree knot The target leaves node of structure connects, wherein the initial term vector is identical as the guidance term vector part of speech;
Step S6, target term vector is determined according to the connection path of the target leaves node and guiding node.
Specifically, Huffman tree is given n weight as n leaf node, a binary tree is constructed, if the tree Cum rights path length reaches minimum, and such binary tree is referred to as optimum binary tree, also referred to as Huffman tree (Huffman Tree). Huffman tree is the shortest tree of cum rights path length, and the biggish node of weight is closer from root.Building Huffman tree step is to input The node of different weights, is seen as n forest, first chooses minimum two weight nodes in these nodes and merges, A new tree is obtained, node originally becomes the left and right subtree of this new tree, and the weight newly set is the corresponding power of two nodes The sum of value deletes original tree by new tree as the one tree being newly added, and chooses two the smallest tree again and merges, with such It pushes away until all trees all merge, wherein the cum rights path length of the tree reaches minimum, and each leaf node represents electric power row All words in industry vocabulary.Wherein the first row inputs a several n, indicates the number of leaf node, needs raw with these leaf nodes At Huffman tree, according to the concept of Huffman tree, the imparting of these nodes has corresponding weight.
Wherein, the step S2 include: in training corpus punctuation mark and stop words be filtered removal and size Format carries out unified conversion, and then further progress is decomposed to form the training corpus phrase set of multiple phrase compositions;It is finally right The training corpus phrase set is expanded, and is assigned to the corpus phrase after expansion semantic;
Wherein, it is described by training corpus carry out expand include: that the training corpus phrase set is decomposed into N number of phrase Set, and N number of phrase is ranked up in order, it is combined into a x tuple, such as if N includes that 1-20 divides a x:2 x, x to divide It Wei not 1-10,11-20;Then participle group in the middle part of N number of phrase is defaulted as can skip, great-jump-forward sequence, example is carried out to N number of phrase Such as 123456 points are 135 and 246;B new x tuples are combined into, so that training corpus is extended to a+b from a;Its In, N is the total quantity that training corpus decomposes phrase, and x tuple is made of x phrase in N number of phrase.
Wherein, semanteme is assigned to phrase refer to that described pair is expanded to the semanteme of the phrase imparting relative weighting of word segmentation processing It includes: to carry out assigning part of speech semanteme to phrase according to the hyponymy of phrase that corpus phrase afterwards, which assigns semanteme, is then traversed The set of entire phrase finally classifies to all phrases for assigning semanteme, will carry out with the phrase of identical part of speech semanteme Cluster;Wherein, semantic weight is determined according to cluster result.
Wherein, the stronger word of generality is called the hypernym of the stronger word of specificity in hyponymy,
The stronger word of specificity is called the hyponym of the stronger word of generality.
Wherein, the guidance term vector is term vector corresponding to the upperseat concept of the multiple vocabulary.Such as at the beginning of two groups Beginning phrase is voltmeter and ammeter, and boot vector can be ammeter at this time, and wherein ammeter includes voltmeter and ammeter, and is being breathed out The height of node of ammeter is higher than the height of node of voltmeter and ammeter in Fu Man tree.
Wherein, the step S4 includes:
When by being connect in the guidance term vector input Huffman tree structure of generation with the leaf node of Huffman tree structure, note The first path that record connection generates, and records node present on the first path, according to the first path and the Node determines guiding node on one path.
Wherein, the step S6 includes: that the corresponding term vector of the multiple vocabulary is defeated as origin using the guiding node Enter into the Huffman tree structure and is attached with the leaf node of Huffman tree structure, the second tunnel that record connection generates Diameter, and node present on second path is recorded, according on second path and the second path node and institute It states guiding node and determines target term vector.
The embodiment of the present invention chooses the related text of power industry and carries out record and with Hough by the effect of building dictionary Vocabulary is made in the form of graceful tree construction, so that the training of term vector is carried out for power industry, by carrying out to training corpus Pretreated effect, in training corpus punctuation mark and stop words be filtered removal, and will be present in training corpus Format size carries out unified conversion, reduces the disturbing factor of corpus, and training corpus is expanded, improves trained accuracy, To make the term vector of acquisition that can more reflect true text meaning, semantic effect is assigned according to specified phrase by phrase Hyponymy judges semanteme of the phrase in initial corpus, and by comparing the phrase of similar semantic, improves word Between the degree of association;By the way that by treated, the analysis of corpus phrase generates corresponding guidance term vector, according to the word of corpus phrase Property semantically subordinate concept relationship reverse push export one include part of speech semanteme guidance term vector so that phrase is drawn Point, and the phrase of similar semantic is categorized in inside the same guidance term vector, term vector is guided by importing, finds introducer The node of vector avoids the phrase of similar semantic from carrying out duplicate calculating on same route when finding node, by section Point place inputs initial term vector (i.e. train vocabulary corresponding term vector), is input to correspondence guidance term vector with initial term vector Guiding node be origin, it is attached with Huffman tree target leaves node, the path that record connection generates, and to this Node is recorded present on path, to be calculated for the node of record, is reduced calculation amount, is improved training effectiveness.
As shown in Fig. 2, second embodiment of the present invention provides a kind of term vector training system for power industry, for realizing The method, comprising:
Tree construction unit 1, for obtaining power industry relative words and generating word according to the power industry relative words Remittance table, the vocabulary are Huffman tree structure, and a leaf node in Huffman tree structure indicates a vocabulary;
Corpus processing unit 2 obtains multiple words for obtaining training corpus and carrying out word segmentation processing to the training corpus It converges, and multiple vocabulary is assigned semantic;Wherein, the power industry relative words include multiple vocabulary;
Introducer vector location 3, for generating corresponding guidance term vector according to the multiple vocabulary;
First node connection unit 4, for finding guidance section corresponding with the guidance term vector in Huffman tree structure Point;
Second node connection unit 5, for the guiding node input an initial term vector, and by the initial word to Amount is connect with the target leaves node of Huffman tree structure, wherein the initial term vector and the guidance term vector part of speech phase Together;
Target word vector determination unit 6, for being determined according to the connection path of the target leaves node and guiding node Target term vector.
Wherein, the corpus processing unit includes:
First processing subelement, for in training corpus punctuation mark and stop words be filtered removal and size lattice Formula carries out unified conversion, and then further progress is decomposed to form the training corpus phrase set of multiple phrase compositions;
Second processing subelement, for expanding the training corpus phrase set, wherein described by training corpus Expand includes: the training corpus phrase set to be decomposed into the set of N number of phrase, and N number of phrase is carried out in order Sequence, is combined into a x tuple, then is defaulted as can skip by participle group in the middle part of N number of phrase, carries out great-jump-forward row to N number of phrase Sequence is combined into b new x tuples, so that training corpus is extended to a+b from a;Wherein, N is that training corpus decomposes phrase Total quantity, x tuple is made of x phrase in N number of phrase;
Third handles subelement, semantic for assigning to the corpus phrase after expansion;Wherein, the corpus after described pair of expansion It includes: to carry out assigning part of speech semanteme to phrase according to the hyponymy of phrase that phrase, which assigns semanteme, then traverses entire phrase Set, finally assign semantic phrases to all and classify, the phrase with identical part of speech semanteme is clustered.
Wherein, the guidance term vector is term vector corresponding to the upperseat concept of the multiple vocabulary;
Wherein, the first node connection unit is specifically used in the guidance term vector that will be generated input Huffman tree structure When being connect with the leaf node of Huffman tree structure, the first path that record connection generates, and to present on the first path Node is recorded, and determines guiding node according to node in the first path and first path;
Wherein, the second node connection unit is specifically used for the corresponding term vector of the multiple vocabulary with the guidance Node is that origin is input in the Huffman tree structure and is attached with the leaf node of Huffman tree structure, record connection The second path generated, and node present on second path is recorded, according to second path and the second path On node and the guiding node determine target term vector.
It should be noted that system described in the present embodiment two be it is corresponding with one the method for embodiment, be used to implement One the method for example, therefore, other contents not described of system described in related embodiment two can be refering to described in embodiment one Method content obtains, and details are not described herein again.
It should also be understood that system described in one the method for embodiment and embodiment two can be implemented in many ways, including As process, device or system.Method described herein partly can execute this method by being used to indicate processor Program instruction and the instruction being recorded in non-transient computer readable storage medium and implement, non-transient computer is readable Storage medium hard drive, floppy disk, optical disc (small-sized dish (CD) or digital universal dish (DVD)), flash memory etc.. In some embodiments, program instruction can be stored remotely and be sent out on network via optics or electronic communication link It send.
The embodiment of the present invention three provides a kind of computer readable storage medium, is stored thereon with computer program, the program The term vector training method that power industry is directed to described in embodiment one is realized when being executed by processor.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In principle, the practical application or to the technological improvement in market for best explaining each embodiment, or make the art its Its those of ordinary skill can understand each embodiment disclosed herein.

Claims (10)

1. a kind of term vector training method for power industry, which comprises the steps of:
Step S1, power industry relative words are obtained and generate vocabulary, the vocabulary according to the power industry relative words Table is Huffman tree structure, and in Huffman tree structure a leaf node indicates a vocabulary;
Step S2, it obtains training corpus and word segmentation processing is carried out to the training corpus and obtain multiple vocabulary, and to multiple word It converges and assigns semanteme;Wherein, the power industry relative words include multiple vocabulary;
Step S3, corresponding guidance term vector is generated according to the multiple vocabulary;
Step S4, guiding node corresponding with the guidance term vector in Huffman tree structure is found;
Step S5, Xiang Suoshu guiding node inputs an initial term vector, and by the initial term vector and Huffman tree structure Target leaves node connection, wherein the initial term vector is identical as the guidance term vector part of speech;
Step S6, target term vector is determined according to the connection path of the target leaves node and guiding node.
2. being directed to the term vector training method of power industry as described in claim 1, which is characterized in that the step S2 packet Include: in training corpus punctuation mark and stop words is filtered removal and format size carries out unified conversion, then into one Step carries out the training corpus phrase set for being decomposed to form multiple phrase compositions;Finally the training corpus phrase set is expanded It fills, and the corpus phrase after expansion is assigned semantic;
Wherein, it is described by training corpus carry out expand include: the collection that the training corpus phrase set is decomposed into N number of phrase It closes, and N number of phrase is ranked up in order, be combined into a x tuple, then be defaulted as to jump by participle group in the middle part of N number of phrase It crosses, great-jump-forward sequence is carried out to N number of phrase, be combined into b new x tuples, so that training corpus is extended to a+b from a; Wherein, N is the total quantity that training corpus decomposes phrase, and x tuple is made of x phrase in N number of phrase.
3. being directed to the term vector training method of power industry as claimed in claim 2, which is characterized in that after described pair is expanded It includes: to carry out assigning part of speech semanteme to phrase according to the hyponymy of phrase that corpus phrase, which assigns semanteme, and then traversal is entire The set of phrase finally classifies to all phrases for assigning semanteme, the phrase with identical part of speech semanteme is clustered.
4. being directed to the term vector training method of power industry as claimed in claim 3, which is characterized in that the guidance term vector Term vector corresponding to upperseat concept for the multiple vocabulary.
5. being directed to the term vector training method of power industry as claimed in claim 4, which is characterized in that the step S4 packet It includes:
When by connecting in the guidance term vector input Huffman tree structure of generation with the leaf node of Huffman tree structure, record connects It practices midwifery raw first path, and node present on the first path is recorded, according to the first path and the first via Node determines guiding node on diameter.
6. being directed to the term vector training method of power industry as claimed in claim 5, which is characterized in that the step S6 packet Include: by the corresponding term vector of the multiple vocabulary be input in the Huffman tree structure using the guiding node as origin and with The leaf node of Huffman tree structure is attached, the second path that record connection generates, and to present on second path Node is recorded, according on second path and the second path node and the guiding node determine target term vector.
7. a kind of term vector training system for power industry, special for realizing any one of claim 1-6 the method Sign is, comprising:
Tree construction unit, for obtaining power industry relative words and generating vocabulary according to the power industry relative words, The vocabulary is Huffman tree structure, and a leaf node in Huffman tree structure indicates a vocabulary;
Corpus processing unit obtains multiple vocabulary for obtaining training corpus and carrying out word segmentation processing to the training corpus, and Multiple vocabulary is assigned semantic;Wherein, the power industry relative words include multiple vocabulary;
Introducer vector location, for generating corresponding guidance term vector according to the multiple vocabulary;
First node connection unit, for finding guiding node corresponding with the guidance term vector in Huffman tree structure;
Second node connection unit, for the guiding node input an initial term vector, and will the initial term vector and The target leaves node of Huffman tree structure connects, wherein the initial term vector is identical as the guidance term vector part of speech;
Target word vector determination unit, for determining target word according to the connection path of the target leaves node and guiding node Vector.
8. being directed to the term vector training system of power industry as claimed in claim 7, which is characterized in that the corpus processing is single Member includes:
First processing subelement, for in training corpus punctuation mark and stop words be filtered removal and format size into The unified conversion of row, then further progress is decomposed to form the training corpus phrase set of multiple phrase compositions;
Second processing subelement, for expanding the training corpus phrase set, wherein described to carry out training corpus Expansion includes: the training corpus phrase set to be decomposed into the set of N number of phrase, and N number of phrase is ranked up in order, It is combined into a x tuple, then is defaulted as can skip by participle group in the middle part of N number of phrase, great-jump-forward sequence, group are carried out to N number of phrase B new x tuples are synthesized, so that training corpus is extended to a+b from a;Wherein, N is that training corpus decomposes the total of phrase Quantity, x tuple are made of x phrase in N number of phrase;
Third handles subelement, semantic for assigning to the corpus phrase after expansion;Wherein, the corpus phrase after described pair of expansion Assigning semanteme includes: to carry out assigning part of speech semanteme to phrase according to the hyponymy of phrase, then traverses the collection of entire phrase It closes, finally classifies to all phrases for assigning semanteme, the phrase with identical part of speech semanteme is clustered.
9. being directed to the term vector training method of power industry as claimed in claim 8, which is characterized in that the guidance term vector Term vector corresponding to upperseat concept for the multiple vocabulary;
Wherein, the first node connection unit is specifically used in the guidance term vector that will be generated input Huffman tree structure and Kazakhstan When the leaf node connection of the graceful tree construction of husband, the first path that record connection generates, and to node present on the first path It is recorded, guiding node is determined according to node in the first path and first path;
Wherein, the second node connection unit is specifically used for the corresponding term vector of the multiple vocabulary with the guiding node It is input in the Huffman tree structure for origin and is attached with the leaf node of Huffman tree structure, record connection generates The second path, and node present on second path is recorded, according on second path and the second path Node and the guiding node determine target term vector.
10. a kind of computer readable storage medium, is stored thereon with computer program, power is realized when which is executed by processor Benefit require any one of 1~6 described in be directed to power industry term vector training method.
CN201910638876.1A 2019-07-16 2019-07-16 Word vector training method, system and medium for power industry Active CN110413779B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910638876.1A CN110413779B (en) 2019-07-16 2019-07-16 Word vector training method, system and medium for power industry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910638876.1A CN110413779B (en) 2019-07-16 2019-07-16 Word vector training method, system and medium for power industry

Publications (2)

Publication Number Publication Date
CN110413779A true CN110413779A (en) 2019-11-05
CN110413779B CN110413779B (en) 2022-05-03

Family

ID=68361538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910638876.1A Active CN110413779B (en) 2019-07-16 2019-07-16 Word vector training method, system and medium for power industry

Country Status (1)

Country Link
CN (1) CN110413779B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325026A (en) * 2020-02-18 2020-06-23 北京声智科技有限公司 Training method and system for word vector model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
WO2017090051A1 (en) * 2015-11-27 2017-06-01 Giridhari Devanathan A method for text classification and feature selection using class vectors and the system thereof
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
CN108319666A (en) * 2018-01-19 2018-07-24 国网浙江省电力有限公司电力科学研究院 A kind of electric service appraisal procedure based on multi-modal the analysis of public opinion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017090051A1 (en) * 2015-11-27 2017-06-01 Giridhari Devanathan A method for text classification and feature selection using class vectors and the system thereof
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
CN108319666A (en) * 2018-01-19 2018-07-24 国网浙江省电力有限公司电力科学研究院 A kind of electric service appraisal procedure based on multi-modal the analysis of public opinion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
马存: "基于Word2Vec的中文短文本聚类算法研究与应用", 《中国优秀硕士学位论文全文数据库》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325026A (en) * 2020-02-18 2020-06-23 北京声智科技有限公司 Training method and system for word vector model
CN111325026B (en) * 2020-02-18 2023-10-10 北京声智科技有限公司 Training method and system for word vector model

Also Published As

Publication number Publication date
CN110413779B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
JP6842167B2 (en) Summary generator, summary generation method and computer program
CN111090461B (en) Code annotation generation method based on machine translation model
CN109918657A (en) A method of extracting target keyword from text
WO2021073298A1 (en) Speech information processing method and apparatus, and intelligent terminal and storage medium
CN115393692A (en) Generation formula pre-training language model-based association text-to-image generation method
JP2014229122A (en) Translation device, method, and program
JP7337770B2 (en) Method and system for training a document-level natural language processing model
CN116701431A (en) Data retrieval method and system based on large language model
CN109933602A (en) A kind of conversion method and device of natural language and structured query language
CN110019779B (en) Text classification method, model training method and device
CN106202395A (en) Text clustering method and device
CN109117474A (en) Calculation method, device and the storage medium of statement similarity
US20220189468A1 (en) Abstract generation device, method, program, and recording medium
CN110781297B (en) Classification method of multi-label scientific research papers based on hierarchical discriminant trees
CN107506345A (en) The construction method and device of language model
JP5975938B2 (en) Speech recognition apparatus, speech recognition method and program
CN107977368B (en) Information extraction method and system
CN114912425A (en) Presentation generation method and device
JP5355483B2 (en) Abbreviation Complete Word Restoration Device, Method and Program
CN113343692B (en) Search intention recognition method, model training method, device, medium and equipment
JP2018084627A (en) Language model learning device and program thereof
CN110413779A (en) It is a kind of for the term vector training method and its system of power industry, medium
CN111881264B (en) Method and electronic equipment for searching long text in question-answering task in open field
CN110019875A (en) The generation method and device of index file
CN110580280A (en) Method, device and storage medium for discovering new words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant