CN107239444B - A kind of term vector training method and system merging part of speech and location information - Google Patents

A kind of term vector training method and system merging part of speech and location information Download PDF

Info

Publication number
CN107239444B
CN107239444B CN201710384135.6A CN201710384135A CN107239444B CN 107239444 B CN107239444 B CN 107239444B CN 201710384135 A CN201710384135 A CN 201710384135A CN 107239444 B CN107239444 B CN 107239444B
Authority
CN
China
Prior art keywords
speech
word
matrix
term vector
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710384135.6A
Other languages
Chinese (zh)
Other versions
CN107239444A (en
Inventor
文坤梅
李瑞轩
刘其磊
李玉华
辜希武
昝杰
杨琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710384135.6A priority Critical patent/CN107239444B/en
Publication of CN107239444A publication Critical patent/CN107239444A/en
Application granted granted Critical
Publication of CN107239444B publication Critical patent/CN107239444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses a kind of term vector training methods and system for merging part of speech and location information, this method comprises: being pre-processed to obtain target text to data;Participle and part-of-speech tagging are carried out to target text;Part-of-speech information is modeled and location information is modeled;Part of speech is merged on the basis of the skip-gram model based on negative sampling policy and location information carries out term vector and learns to obtain target term vector, which assesses for word analogy task and word similarity task.The present invention considers the part-of-speech information and location information of word, and to word part of speech and location information model on the basis of, the location information between the part-of-speech information and part of speech of word is made full use of to help the training of term vector, and also more reasonable for the update of parameter during training.

Description

A kind of term vector training method and system merging part of speech and location information
Technical field
The invention belongs to natural language processing technique field, more particularly, to a kind of fusion part of speech and location information Term vector training method and system.
Background technique
In recent years, with the rapid development of development of Mobile Internet technology, so that the scale of data rapidly increases in internet, So that the complexity of data sharply increases.This allows for becoming the processing analysis without structure, unlabeled data of these magnanimity A great problem.
Traditional machine learning method carries out symbolism table to data using Feature Engineering (Feature engineering) Show the modeling and solution in order to model, but in Feature Engineering common bag of words presentation technology such as One-hot vector with data The growth of complexity, the dimension of feature can also sharply increase so as to cause dimension disaster problem.And it is based on One-hot vector table There is also semantic gap phenomenons for the method shown.With " if two word contexts are similar, their semanteme is also similar " Distribution hypothesis (distributional hypothesis) is suggested, and the word distribution presentation technology based on distribution hypothesis is continuous Ground is suggested.Wherein the most important distribution having based on matrix indicates, the distribution based on cluster indicates and point based on term vector Cloth indicates.Although but either based on matrix indicate or based on cluster indicate distribution representation method can characteristic dimension compared with Hour expresses simple contextual information.But when characteristic dimension is higher, model for context expression especially to complexity The expression of context is with regard to helpless.And the presentation technology based on term vector, so that the expression either for each word, also It is to indicate that the context of word all avoids and the problem of dimension disaster occurs by the method for linear combination.And due to word The distance between can be measured by COS distance between term vector corresponding to them or Euclidean distance, this is also in very great Cheng The problem of semantic gap in traditional bag of words is eliminated on degree.
However, at present existing term vector research work mostly concentrate on by the structure of neural network in simplified model come Reduce model complexity, some work merged the information such as emotion, theme, and merge the research work of part-of-speech information seldom and The part of speech granularity being directed in these seldom work is bigger, very insufficient for the utilization of part-of-speech information, for part-of-speech information Update also less rationally.
Summary of the invention
Aiming at the above defects or improvement requirements of the prior art, the object of the present invention is to provide a kind of fusion part of speech with The term vector training method and system of location information, thus solve to merge in the prior art and are directed in the research work of part-of-speech information Part of speech granularity it is bigger, it is very insufficient for the utilization of part-of-speech information, for the update also less reasonable skill of part-of-speech information Art problem.
To achieve the above object, according to one aspect of the present invention, a kind of word for merging part of speech and location information is provided Vector training method, includes the following steps:
S1, urtext is pre-processed to obtain target text;
S2, the contextual information according to word, the part of speech concentrated using part-of-speech tagging carry out the word in target text Part-of-speech tagging;
S3, according to the part-of-speech information of mark carry out modeling building part of speech associated weights matrix M, and for part of speech to pair It answers the relative position i of word pair to model, constructs position part of speech associated weights matrix M corresponding with positioni', wherein matrix The ranks dimension of M is the type size that part-of-speech tagging concentrates part of speech, and the element in matrix M is the word of the corresponding word of row of the element The co-occurrence probabilities of the part of speech of property word corresponding with the column of the element, matrix Mi' ranks dimension it is identical as matrix M, matrix Mi' in Element be the element the corresponding word of row part of speech word corresponding with the column of the element co-occurrence of the part of speech in relative position i Probability;
S4, by the matrix M and matrix M after modelingi' be fused in skip-gram term vector model and construct object module, by Object module carries out term vector and learns to obtain target term vector, wherein target term vector is used for word analogy task and word Similarity task.
Preferably, step S2 specifically includes following sub-step:
S2.1, target text is segmented, to distinguish all words in target text;
S2.2, to each sentence in target text, according to contextual information of the word in sentence, using part-of-speech tagging The part of speech of concentration carries out part-of-speech tagging to word.
Preferably, step S3 specifically includes following sub-step:
S3.1, word-part of speech that each word in target text, generation are constituted for word and its corresponding part of speech It is right, according to word-part of speech to building part of speech associated weights matrix M, wherein the ranks dimension of matrix M is that part-of-speech tagging concentrates word Property type size, the element in matrix M is the word of the part of speech word corresponding with the column of the element of the corresponding word of row of the element The co-occurrence probabilities of property;
S3.2, it is modeled for relative position i of the part of speech to corresponding word pair, constructs position corresponding with position word Property associated weights matrix M 'i, wherein matrix M 'iRanks dimension it is identical as matrix M, matrix M 'iIn element be the element Co-occurrence probabilities of the part of speech of the part of speech word corresponding with the column of the element of the corresponding word of row in relative position i.
Preferably, step S4 specifically includes following sub-step:
S4.1, building initial target function:Wherein, C table Show that the vocabulary in entire training corpus, Context (w) indicate to be made of each c word in front and back of target word w upper and lower Literary set of words, c indicate window size;
S4.2, by the matrix M and matrix M after modelingi' be fused in the skip-gram term vector model based on negative sampling Object module is constructed, and constructs the fresh target function of object module according to initial target function:Wherein,NEG (w) is the negative sample collection sampled to target word w, LwIt (u) is sample u Marking, positive sample marking be 1, negative sample marking be 0, θuFor the auxiliary vector that sample word is used during model training,For context wordsCorresponding term vectorTransposition,For TuWithTwo parts of speech are in relative position Co-occurrence probabilities when relationship is i;
S4.3, fresh target function is optimized, fresh target function value is maximized, and to parameter θuAndGradient calculating and update are carried out, and obtains target term vector when traversing and completing to entire training corpus.
It is another aspect of this invention to provide that providing a kind of term vector training system for merging part of speech and location information, packet It includes:
Preprocessing module obtains target text for being pre-processed to urtext;
Part-of-speech tagging module, for the contextual information according to word, the part of speech concentrated using part-of-speech tagging is to target text Word in this carries out part-of-speech tagging;
Position part of speech Fusion Module carries out modeling building part of speech associated weights matrix M for the part-of-speech information according to mark, And it is modeled for relative position i of the part of speech to corresponding word pair, building position corresponding with position part of speech association power Weight matrix M 'i, wherein the ranks dimension of matrix M is the type size that part-of-speech tagging concentrates part of speech, and the element in matrix M is should The co-occurrence probabilities of the part of speech of the part of speech word corresponding with the column of the element of the corresponding word of the row of element, matrix M 'iRanks dimension It is identical as matrix M, matrix M 'iIn element be the element the corresponding word of row part of speech word corresponding with the column of the element word Co-occurrence probabilities of the property in relative position i;
Term vector study module, for the matrix M and matrix M ' after modelingiIt is fused to skip-gram term vector model Middle building object module carries out term vector by object module and learns to obtain target term vector, wherein target term vector is used for word Analogy task and word similarity task.
Preferably, the part-of-speech tagging module includes:
Word segmentation module, for being segmented to target text, to distinguish all words in target text;
Part-of-speech tagging submodule, for being believed according to context of the word in sentence to each sentence in target text Breath carries out part-of-speech tagging to word using the part of speech that part-of-speech tagging is concentrated.
Preferably, the position part of speech Fusion Module includes:
Part-of-speech information modeling module, for generating and being directed to word and its corresponding word to each word in target text Property constitute word-part of speech pair, according to word-part of speech to building part of speech associated weights matrix M, wherein the ranks dimension of matrix M The type size of part of speech is concentrated for part-of-speech tagging, the element in matrix M is the part of speech and the element of the corresponding word of row of the element The corresponding word of column part of speech co-occurrence probabilities;
Location information modeling module, for being modeled to the relative position i of corresponding word pair for part of speech, building with The corresponding position part of speech associated weights matrix M ' in positioni, wherein matrix M 'iRanks dimension it is identical as matrix M, matrix M 'iIn Element be the element the corresponding word of row part of speech word corresponding with the column of the element co-occurrence of the part of speech in relative position i Probability.
Preferably, the term vector study module includes:
Initial target function constructs module, for constructing initial target function:Wherein, C indicates the vocabulary in entire training corpus, Context (w) indicates that the context words collection being made of each c word in front and back of target word w, c indicate window size;
Fresh target function constructs module, for the matrix M and matrix M ' after modelingiIt is fused to based on negative sampling Object module is constructed in skip-gram term vector model, and the fresh target function of object module is constructed according to initial target function:Wherein,NEG (w) is the negative sample collection sampled to target word w, LwIt (u) is sample u Marking, positive sample marking be 1, negative sample marking be 0, θuFor the auxiliary vector that sample word is used during model training,For context wordsCorresponding term vectorTransposition,For TuWithTwo parts of speech are in relative position Co-occurrence probabilities when relationship is i;
Term vector learns submodule, for optimizing to fresh target function, fresh target function value is maximized, and right Parameter θuAndGradient calculating and update are carried out, and is obtained when entire training corpus is traversed and completed Target term vector.
In general, the method for the present invention can achieve the following beneficial effects compared with prior art:
It (1), can be well to word by incidence matrix of the building based on part of speech incidence relation Yu position incidence relation Between part of speech and location information modeled.
(2) it is adopted by being fused to the modeled good incidence matrix based on part-of-speech information and location information based on negative In the skip-gram term vector learning model of sample, on the one hand on the other hand available better term vector is as a result, can also obtain Incidence relation weight into the corpus for model training between part of speech.
(3) since model uses the optimisation strategy of negative sampling, so that the training speed of model is also than very fast.
Detailed description of the invention
Fig. 1 is that the process of the term vector training method of a kind of fusion part of speech disclosed by the embodiments of the present invention and location information is shown It is intended to;
Fig. 2 is a kind of modeler model figure of part of speech and location information disclosed by the embodiments of the present invention;
Fig. 3 is a kind of overall flow rough schematic view disclosed by the embodiments of the present invention;
Fig. 4 is the process of the term vector training method of another fusion part of speech disclosed by the embodiments of the present invention and location information Schematic diagram.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below that Not constituting conflict between this can be combined with each other.
Due to having, term vector learning method ignores part of speech and its importance in natural language, the present invention provide one kind Merge the term vector learning method of part of speech and location information.This method is intended to consider list on the basis of original skip-gram model Part of speech incidence relation and positional relationship between word, to allow model that can train the term vector of fusion more information as a result, simultaneously Word analogy task and word similarity task are preferably completed using the term vector learnt.
It is as shown in Figure 1 the term vector learning method of a kind of fusion part of speech and location information disclosed by the embodiments of the present invention Flow diagram, in method shown in Fig. 1 the following steps are included:
S1, urtext is pre-processed to obtain target text;
Due in the urtext of acquisition there is a large amount of garbage for example XML tag, web page interlinkage, image link with And as " [", "@", " & ", " # " etc., these garbages are not only unhelpful to the training of term vector, in addition can become noise data, Influence the study of term vector, it is therefore desirable to these information filterings be fallen, can use perl script and fall these information filterings.
S2, the contextual information according to word, the part of speech concentrated using part-of-speech tagging carry out the word in target text Part-of-speech tagging;
Due to use the part-of-speech information of word in the method as proposed in the present invention, it is therefore desirable to utilize some part-of-speech taggings Tool carries out part-of-speech tagging to text.The difference of the context as locating for a word causes it that may have multiple parts of speech, is Text can be carried out part-of-speech tagging in advance by solution this problem, carry out part-of-speech tagging by its contextual information.Step S2 Specifically include following sub-step:
S2.1, target text is segmented, to distinguish all words in target text;
Wherein it is possible to text be segmented using the tokenize participle tool in openNLP, such as " I buy an Apple. common word " apple " will become " apple. " this word for being not present if not segmenting in ", influence word The study of vector.
S2.2, to each sentence in target text, according to contextual information of the word in sentence, using part-of-speech tagging The part of speech of concentration carries out part-of-speech tagging to word.
Wherein, part-of-speech tagging disposably is carried out to an entire sentence, it thus can be by the same word according to its institute The multiple parts of speech locating context and can have distinguish.Here the part of speech that word is endowed belongs to Penn Treebank POS part-of-speech tagging collection.
Two after carrying out word mark such as " i love you. " and " she give her son too much love. " Sentence just becomes:
I_PRP (pronoun) love_VBP (verb) you_PRP (pronoun) ._.;
She_PRP (pronoun) give_VB (verb) her_PRP $ (pronoun) son_NN (noun) too_RB (adverbial word) much_ JJ (adjective) love_NN (noun) ._..
S3, according to the part-of-speech information of mark carry out modeling building part of speech associated weights matrix M, and for part of speech to pair It answers the relative position i of word pair to model, constructs position part of speech associated weights matrix M ' corresponding with positioni, wherein matrix The ranks dimension of M is the type size that part-of-speech tagging concentrates part of speech, and the element in matrix M is the word of the corresponding word of row of the element The co-occurrence probabilities of the part of speech of property word corresponding with the column of the element, matrix M 'iRanks dimension it is identical as matrix M, matrix M 'iIn Element be the element the corresponding word of row part of speech word corresponding with the column of the element co-occurrence of the part of speech in relative position i Probability;A kind of modeler model figure of part of speech and location information disclosed by the embodiments of the present invention as shown in Figure 2, wherein in ranks T0~TNIndicate part of speech, M 'i(Tt,Tt-2) indicate part of speech TtWith part of speech Tt-2Co-occurrence probabilities in relative position when i.
Wherein, after the part of speech for obtaining word, how part-of-speech information to be participated in term vector learning model and to new Model is solved, it is necessary to be modeled first to part-of-speech information.The target of modeling is that establish ranks dimension all be part of speech mark Note concentrates the part of speech incidence relation matrix of the type size of part of speech, and the element in matrix is the probability that two parts of speech occur.It removes Except this, also to be modeled for positional relationship, because positional relationship when two part of speech co-occurrences between them is also very Important.Step S3 specifically includes following sub-step:
S3.1, word-part of speech that each word in target text, generation are constituted for word and its corresponding part of speech It is right, according to word-part of speech to building part of speech associated weights matrix M, wherein the ranks dimension of matrix M is that part-of-speech tagging concentrates word Property type size, the element in matrix M is the word of the part of speech word corresponding with the column of the element of the corresponding word of row of the element The co-occurrence probabilities of property;
Such as the word son in " she give her son too much love. ", part of speech NN, The part of speech of word her is PRP, then element specified by the corresponding row of part of speech PRP and the corresponding column of part of speech NN is in matrix The co-occurrence probabilities (i.e. weight) of two parts of speech.
S3.2, it is modeled for relative position i of the part of speech to corresponding word pair, constructs position corresponding with position word Property associated weights matrix M 'i, wherein matrix M 'iRanks dimension it is identical as matrix M, matrix M 'iIn element be the element Co-occurrence probabilities (weight) of the part of speech of the part of speech word corresponding with the column of the element of the corresponding word of row in relative position i.
For example, if window size is 2c, i ∈ [- c, c].When window size is 6, then M ' will be established-3、M′-2、 M′-1、M′1、M′2、M′3Totally 6 matrixes.
Such as the son and her in " she give her son too much love. ", when son is target word When, part of speech and the associated weight value of position corresponding to the two word parts of speech are M '-1(PRP,NN)。
S4, by the matrix M and matrix M ' after modelingiIt is fused in skip-gram term vector model and constructs object module, by Object module carries out term vector and learns to obtain target term vector, wherein target term vector is used for word analogy task and word Similarity task.
Wherein, step S4 specifically includes following sub-step:
S4.1, building initial target function:Wherein, C table Show that the vocabulary in entire training corpus, Context (w) indicate to be made of each c word in front and back of target word w upper and lower Literary set of words, c indicate window size;
Pass through target word w since Skip-gram model thought is identicaltPredict the word v (w in contextt+i) wherein, i Indicate wt+iWith wtBetween positional relationship.With sample (Context (wt), wt) for, wherein | Context (wt) |=2c, In, Context (wt) it is by word wtEach c word composition in front and back.The final optimization pass target of object module is still to entire training For corpus, so that all pass through target word wtTo predict the maximization of context words namely optimize initial target Function.
Such as sample " she give her son too much love. " word son is target word wt, c 3, then Context(wt)={ she, give, her, too, much, love }.
S4.2, by the matrix M and matrix M ' after modelingiIt is fused in the skip-gram term vector model based on negative sampling Object module is constructed, and constructs the fresh target function of object module according to initial target function:Wherein,NEG (w) is the negative sample collection sampled to target word w, LwIt (u) is sample u Marking, positive sample marking be 1, negative sample marking be 0, θuFor the auxiliary vector that sample word is used during model training,For context wordsCorresponding term vectorTransposition,For TuWithTwo parts of speech are in relative position Co-occurrence probabilities when relationship is i;
Such as sample " she give her son too much love. " word son is positive sample, at this time word son Label be 1, be exactly negative sample, label 0 for other words such as dog, flower etc..
It is illustrated in figure 3 a kind of overall flow rough schematic view disclosed by the embodiments of the present invention, the object module tool of building There are input layer, projection layer, three layers of output layer.Wherein:
The input of input layer is center word w (t), and output is the corresponding term vector of center word w (t);
Projection layer mainly projects the output result of input layer, and projection layer is output and input all in the model It is the term vector of center word w (t);
Output layer mainly utilizes center word w (t) to predict such as w (t-2), w (t-1), w (t+1), w (t+2) etc. or more The term vector of literary word.
When present invention is primarily intended to predict its context words using center word w (t), center word and its are considered The part of speech and positional relationship of context words.
S4.3, fresh target function is optimized, fresh target function value is maximized, and to parameter θuAndGradient calculating and update are carried out, and obtains target term vector when traversing and completing to entire training corpus.
Such as it can be using stochastic gradient rise method (Stochastic Gradient Ascent, SGA) to fresh target letter Number, which optimizes, maximizes fresh target function value.And to parameter θuWithGradient is calculated and is updated, Target term vector is then just obtained when having traversed to entire training corpus.
It is alternatively possible to be updated by the way of as follows and target term vector is calculated in gradient:
It is illustrated in figure 4 the term vector training method of another fusion part of speech and location information provided in an embodiment of the present invention Flow diagram, in method shown in Fig. 4, including data prediction, participle and part-of-speech tagging, part of speech and location information are built Mould, term vector training, task assess five steps.Wherein data prediction, participle and part-of-speech tagging, part of speech and location information are built Mould, term vector training method and step as described in Example 1, task assessment can use learn above with part of speech with After the target term vector of location information, target term vector can be used for the tasks such as word analogy task and word similarity In.Mainly include following two step:
Word analogy task is done with the target term vector learnt.Such as two words to<king, queen>and< Man, woman >, by these words calculate corresponding term vector can find there are v (king)-v (queen)= Relationship as v (man)-v (woman).
The similar task of word is done with the target term vector learnt.Such as a given word is such as " dog ", passes through calculating The COS distance or Euclidean distance of other words and " dog " may can obtain " puppy ", " cat " etc. and " dog " have it is close Cut the N number of word of preceding top of relationship.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims (8)

1. a kind of term vector training method for merging part of speech and location information, which comprises the steps of:
S1, urtext is pre-processed to obtain target text;
S2, the contextual information according to word, the part of speech concentrated using part-of-speech tagging carry out part of speech to the word in target text Mark;
S3, modeling building part of speech associated weights matrix M is carried out according to the part-of-speech information of mark, and for part of speech to corresponding list The relative position i of word pair is modeled, and constructs position part of speech associated weights matrix M ' corresponding with positioni, wherein matrix M's Ranks dimension is the type size that part-of-speech tagging concentrates part of speech, and the element in matrix M is the part of speech of the corresponding word of row of the element The co-occurrence probabilities of the part of speech of word corresponding with the column of the element, matrix M 'iRanks dimension it is identical as matrix M, matrix M 'iIn Element is that co-occurrence of the part of speech of the part of speech word corresponding with the column of the element of the corresponding word of row of the element in relative position i is general Rate;
S4, by the matrix M and matrix M ' after modelingiIt is fused in skip-gram term vector model and constructs object module, by target Model carries out term vector and learns to obtain target term vector, wherein target term vector is similar for word analogy task and word Degree task.
2. the method according to claim 1, wherein step S2 specifically includes following sub-step:
S2.1, target text is segmented, to distinguish all words in target text;
S2.2, each sentence in target text is concentrated according to contextual information of the word in sentence using part-of-speech tagging Part of speech to word carry out part-of-speech tagging.
3. method according to claim 1 or 2, which is characterized in that step S3 specifically includes following sub-step:
S3.1, word-part of speech pair that each word in target text, generation are constituted for word and its corresponding part of speech, According to word-part of speech to building part of speech associated weights matrix M, wherein the ranks dimension of matrix M is that part-of-speech tagging concentrates part of speech Type size, the element in matrix M are the part of speech of the part of speech word corresponding with the column of the element of the corresponding word of row of the element Co-occurrence probabilities;
S3.2, it is modeled for relative position i of the part of speech to corresponding word pair, building position corresponding with position part of speech is closed Join weight matrix M 'i, wherein matrix M 'iRanks dimension it is identical as matrix M, matrix M 'iIn element be the element row pair Answer co-occurrence probabilities of the part of speech of the part of speech of word word corresponding with the column of the element in relative position i.
4. according to the method described in claim 3, it is characterized in that, step S4 specifically includes following sub-step:
S4.1, building initial target function:Wherein, C is indicated whole Vocabulary in a training corpus, Context (w) indicate the context list being made of each c word in front and back of target word w Word set, c indicate window size;
S4.2, by the matrix M and matrix M ' after modelingiIt is fused in the skip-gram term vector model based on negative sampling and constructs mesh Model is marked, and constructs the fresh target function of object module according to initial target function:Wherein,NEG (w) is the negative sample collection sampled to target word w, LwIt (u) is sample u Marking, positive sample marking be 1, negative sample marking be 0, θuFor the auxiliary vector that sample word is used during model training,For context wordsCorresponding term vectorTransposition,For TuWithTwo parts of speech are in relative position Co-occurrence probabilities when relationship is i;
S4.3, fresh target function is optimized, fresh target function value is maximized, and to parameter θuAndGradient calculating and update are carried out, and obtains target term vector when traversing and completing to entire training corpus.
5. a kind of term vector training system for merging part of speech and location information characterized by comprising
Preprocessing module obtains target text for being pre-processed to urtext;
Part-of-speech tagging module, for the contextual information according to word, the part of speech concentrated using part-of-speech tagging is in target text Word carry out part-of-speech tagging;
Position part of speech Fusion Module carries out modeling building part of speech associated weights matrix M for the part-of-speech information according to mark, and It is modeled for relative position i of the part of speech to corresponding word pair, constructs position part of speech associated weights square corresponding with position Battle array M 'i, wherein the ranks dimension of matrix M is the type size that part-of-speech tagging concentrates part of speech, and the element in matrix M is the element The corresponding word of row part of speech word corresponding with the column of the element part of speech co-occurrence probabilities, matrix M 'iRanks dimension and square Battle array M is identical, matrix M 'iIn element be that the part of speech of part of speech word corresponding with the column of the element of the corresponding word of row of the element exists The co-occurrence probabilities when i of relative position;
Term vector study module, for the matrix M and matrix M ' after modelingiIt is fused in skip-gram term vector model and constructs Object module carries out term vector by object module and learns to obtain target term vector, wherein target term vector is appointed for word analogy Business and word similarity task.
6. system according to claim 5, which is characterized in that the part-of-speech tagging module includes:
Word segmentation module, for being segmented to target text, to distinguish all words in target text;
Part-of-speech tagging submodule, for being adopted to each sentence in target text according to contextual information of the word in sentence Part-of-speech tagging is carried out to word with the part of speech that part-of-speech tagging is concentrated.
7. system according to claim 5 or 6, which is characterized in that the position part of speech Fusion Module includes:
Part-of-speech information modeling module, for generating and being directed to word and its corresponding part of speech structure to each word in target text At word-part of speech pair, according to word-part of speech to building part of speech associated weights matrix M, wherein the ranks dimension of matrix M be word Property mark concentrate the type size of part of speech, the element in matrix M is the part of speech of the corresponding word of row of the element and the column of the element The co-occurrence probabilities of the part of speech of corresponding word;
Location information modeling module, for being modeled for part of speech to the relative position i of corresponding word pair, building and position Corresponding position part of speech associated weights matrix M 'i, wherein matrix M 'iRanks dimension it is identical as matrix M, matrix M 'iIn member Element is that co-occurrence of the part of speech of the part of speech word corresponding with the column of the element of the corresponding word of row of the element in relative position i is general Rate.
8. system according to claim 7, which is characterized in that the term vector study module includes:
Initial target function constructs module, for constructing initial target function:Wherein, C indicates the vocabulary in entire training corpus, Context (w) indicates that the context words collection being made of each c word in front and back of target word w, c indicate window size;
Fresh target function constructs module, for the matrix M and matrix M after modelingi' it is fused to the skip-gram based on negative sampling Object module is constructed in term vector model, and the fresh target function of object module is constructed according to initial target function:Wherein,NEG (w) is the negative sample collection sampled to target word w, LwIt (u) is sample u Marking, positive sample marking be 1, negative sample marking be 0, θuFor the auxiliary vector that sample word is used during model training,For context wordsCorresponding term vectorTransposition,For TuWithTwo parts of speech are closed in relative position Co-occurrence probabilities when system is i;
Term vector learns submodule, for optimizing to fresh target function, fresh target function value is maximized, and to parameter θuAndGradient calculating and update are carried out, and obtains target when entire training corpus is traversed and completed Term vector.
CN201710384135.6A 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information Active CN107239444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710384135.6A CN107239444B (en) 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710384135.6A CN107239444B (en) 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information

Publications (2)

Publication Number Publication Date
CN107239444A CN107239444A (en) 2017-10-10
CN107239444B true CN107239444B (en) 2019-10-08

Family

ID=59985183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710384135.6A Active CN107239444B (en) 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information

Country Status (1)

Country Link
CN (1) CN107239444B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305612B (en) * 2017-11-21 2020-07-31 腾讯科技(深圳)有限公司 Text processing method, text processing device, model training method, model training device, storage medium and computer equipment
CN108229818B (en) * 2017-12-29 2021-07-13 网智天元科技集团股份有限公司 Method and device for establishing talent value measuring and calculating coordinate system
CN110348001B (en) * 2018-04-04 2022-11-25 腾讯科技(深圳)有限公司 Word vector training method and server
CN108628834B (en) * 2018-05-14 2022-04-15 国家计算机网络与信息安全管理中心 Word expression learning method based on syntactic dependency relationship
CN108733653B (en) * 2018-05-18 2020-07-10 华中科技大学 Sentiment analysis method of Skip-gram model based on fusion of part-of-speech and semantic information
CN108875810B (en) * 2018-06-01 2020-04-28 阿里巴巴集团控股有限公司 Method and device for sampling negative examples from word frequency table aiming at training corpus
CN109308353B (en) * 2018-09-17 2023-08-15 鼎富智能科技有限公司 Training method and device for word embedding model
CN109271636B (en) * 2018-09-17 2023-08-11 鼎富智能科技有限公司 Training method and device for word embedding model
CN109190126B (en) * 2018-09-17 2023-08-15 北京神州泰岳软件股份有限公司 Training method and device for word embedding model
CN109344403B (en) * 2018-09-20 2020-11-06 中南大学 Text representation method for enhancing semantic feature embedding
CN109271422B (en) * 2018-09-20 2021-10-08 华中科技大学 Social network subject matter expert searching method driven by unreal information
CN109325231B (en) * 2018-09-21 2023-07-04 中山大学 Method for generating word vector by multitasking model
CN109639452A (en) * 2018-10-31 2019-04-16 深圳大学 Social modeling training method, device, server and storage medium
CN109858024B (en) * 2019-01-04 2023-04-11 中山大学 Word2 vec-based room source word vector training method and device
CN109858031B (en) * 2019-02-14 2023-05-23 北京小米智能科技有限公司 Neural network model training and context prediction method and device
CN110276052B (en) * 2019-06-10 2021-02-12 北京科技大学 Ancient Chinese automatic word segmentation and part-of-speech tagging integrated method and device
CN110287236B (en) * 2019-06-25 2024-03-19 平安科技(深圳)有限公司 Data mining method, system and terminal equipment based on interview information
CN110534087B (en) * 2019-09-04 2022-02-15 清华大学深圳研究生院 Text prosody hierarchical structure prediction method, device, equipment and storage medium
CN111506726B (en) * 2020-03-18 2023-09-22 大箴(杭州)科技有限公司 Short text clustering method and device based on part-of-speech coding and computer equipment
CN111859910B (en) * 2020-07-15 2022-03-18 山西大学 Word feature representation method for semantic role recognition and fusing position information
CN111832282B (en) * 2020-07-16 2023-04-14 平安科技(深圳)有限公司 External knowledge fused BERT model fine adjustment method and device and computer equipment
CN113010670B (en) * 2021-02-22 2023-09-19 腾讯科技(深圳)有限公司 Account information clustering method, detection method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866337A (en) * 2009-04-14 2010-10-20 日电(中国)有限公司 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
CN106649275A (en) * 2016-12-28 2017-05-10 成都数联铭品科技有限公司 Relation extraction method based on part-of-speech information and convolutional neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866337A (en) * 2009-04-14 2010-10-20 日电(中国)有限公司 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
CN106649275A (en) * 2016-12-28 2017-05-10 成都数联铭品科技有限公司 Relation extraction method based on part-of-speech information and convolutional neural network

Also Published As

Publication number Publication date
CN107239444A (en) 2017-10-10

Similar Documents

Publication Publication Date Title
CN107239444B (en) A kind of term vector training method and system merging part of speech and location information
CN112001185B (en) Emotion classification method combining Chinese syntax and graph convolution neural network
Liao et al. CNN for situations understanding based on sentiment analysis of twitter data
CN107133211B (en) Composition scoring method based on attention mechanism
CN110287481B (en) Named entity corpus labeling training system
CN106776581B (en) Subjective text emotion analysis method based on deep learning
CN103207855B (en) For the fine granularity sentiment analysis system and method for product review information
CN109325112B (en) A kind of across language sentiment analysis method and apparatus based on emoji
CN109344391A (en) Multiple features fusion Chinese newsletter archive abstraction generating method neural network based
CN108182295A (en) A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN108519890A (en) A kind of robustness code abstraction generating method based on from attention mechanism
CN107729309A (en) A kind of method and device of the Chinese semantic analysis based on deep learning
CN112487143A (en) Public opinion big data analysis-based multi-label text classification method
CN110807328A (en) Named entity identification method and system oriented to multi-strategy fusion of legal documents
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
CN107066445A (en) The deep learning method of one attribute emotion word vector
Zheng et al. Automatic generation of news comments based on gated attention neural networks
CN106844345B (en) A kind of multitask segmenting method based on parameter linear restriction
CN109359297A (en) A kind of Relation extraction method and system
CN111710428B (en) Biomedical text representation method for modeling global and local context interaction
CN111966820B (en) Method and system for constructing and extracting generative abstract model
Yang et al. Deep learning and its applications to natural language processing
CN114756681A (en) Evaluation text fine-grained suggestion mining method based on multi-attention fusion
Wang et al. Tdjee: A document-level joint model for financial event extraction
CN113901208A (en) Method for analyzing emotion tendentiousness of intermediate-crossing language comments blended with theme characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant