CN107239444B - A kind of term vector training method and system merging part of speech and location information - Google Patents

A kind of term vector training method and system merging part of speech and location information Download PDF

Info

Publication number
CN107239444B
CN107239444B CN201710384135.6A CN201710384135A CN107239444B CN 107239444 B CN107239444 B CN 107239444B CN 201710384135 A CN201710384135 A CN 201710384135A CN 107239444 B CN107239444 B CN 107239444B
Authority
CN
China
Prior art keywords
speech
word
matrix
target
row
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710384135.6A
Other languages
Chinese (zh)
Other versions
CN107239444A (en
Inventor
文坤梅
李瑞轩
刘其磊
李玉华
辜希武
昝杰
杨琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710384135.6A priority Critical patent/CN107239444B/en
Publication of CN107239444A publication Critical patent/CN107239444A/en
Application granted granted Critical
Publication of CN107239444B publication Critical patent/CN107239444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of term vector training methods and system for merging part of speech and location information, this method comprises: being pre-processed to obtain target text to data;Participle and part-of-speech tagging are carried out to target text;Part-of-speech information is modeled and location information is modeled;Part of speech is merged on the basis of the skip-gram model based on negative sampling policy and location information carries out term vector and learns to obtain target term vector, which assesses for word analogy task and word similarity task.The present invention considers the part-of-speech information and location information of word, and to word part of speech and location information model on the basis of, the location information between the part-of-speech information and part of speech of word is made full use of to help the training of term vector, and also more reasonable for the update of parameter during training.

Description

Word vector training method and system fusing part of speech and position information
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a word vector training method and system fusing part of speech and position information.
Background
In recent years, with the rapid development of mobile internet technology, the scale of data in the internet is rapidly increased, and the complexity of data is also rapidly increased. This makes the processing and analysis of these massive unstructured and unmarked data a difficult problem.
The traditional machine learning method adopts Feature engineering (Feature engineering) to symbolize data so as to facilitate modeling and solving of a model, but a common bag-of-words representation technology in the Feature engineering, such as One-hot vectors, increases along with the increase of data complexity, and the dimension of the features also increases rapidly, so that the dimension disaster problem is caused. And the method based on One-hot vector representation has semantic gap phenomenon. As a distribution hypothesis (distributed hypothesis) that "if two words are similar in context, their semantics are also similar" is proposed, a word distribution representation technique based on the distribution hypothesis is continuously proposed. The most important of them are matrix-based distribution representation, cluster-based distribution representation and word vector-based distribution representation. However, the distribution representation method based on either matrix representation or cluster representation can express simple context information when the feature dimension is small. But when the feature dimension is high, the model is unable to express the context, especially the complex context. And the expression technology based on the word vector avoids the problem of dimension disaster regardless of the expression of each word or the expression of the context of the word by a linear combination method. And because the distance between words can be measured by the cosine distance or Euclidean distance between the word vectors corresponding to the words, the problem of semantic gap in the traditional bag-of-words model is eliminated to a great extent.
However, most of the existing word vector research works focus on reducing model complexity by simplifying the structure of a neural network in a model, some work integrates emotion, theme and other information, the research work for integrating part-of-speech information is very little, the part-of-speech granularity for which the work is few is large, the part-of-speech information is not fully utilized, and the part-of-speech information is not reasonably updated.
Disclosure of Invention
In view of the above defects or improvement requirements of the prior art, the present invention aims to provide a word vector training method and system fusing part of speech and position information, so as to solve the technical problems in the prior art that the part of speech granularity targeted in the research work of fusing part of speech information is large, the utilization of part of speech information is not sufficient, and the updating of part of speech information is not reasonable.
To achieve the above object, according to an aspect of the present invention, there is provided a word vector training method fusing part of speech and position information, including the steps of:
s1, preprocessing the original text to obtain a target text;
s2, according to the context information of the words, the parts of speech in the part of speech tagging set is adopted to tag the parts of speech of the words in the target text;
s3, modeling according to the tagged part-of-speech information to construct a part-of-speech associated weight matrix M, modeling the relative position i of the corresponding word pair according to the part-of-speech, and constructing a position part-of-speech associated weight matrix M corresponding to the positioni' where, the dimension of the matrix M is the category size of the part of speech in the part of speech tagging set, the element in the matrix M is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element, and the matrix Mi' the dimension of the rows and columns is the same as the matrix M, which isiThe element in' is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at the relative position i;
s4, matrix M and matrix M after modelingiThe method comprises the steps of' fusing the target model into a skip-gram word vector model to construct a target model, and performing word vector learning by the target model to obtain a target word vector, wherein the target word vector is used for a word analogy task and a word similarity task.
Preferably, step S2 specifically includes the following sub-steps:
s2.1, segmenting the target text to distinguish all words in the target text;
and S2.2, for each sentence in the target text, according to the context information of the word in the sentence, performing part-of-speech tagging on the word by using the part-of-speech in the part-of-speech tagging set.
Preferably, step S3 specifically includes the following sub-steps:
s3.1, generating a word-part-of-speech pair formed by the words and the parts of speech corresponding to the words for each word in the target text, and constructing a part-of-speech associated weight matrix M according to the word-part-of-speech pair, wherein the row-column dimension of the matrix M is the category size of parts of speech in a part-of-speech tagging set, and the element in the matrix M is the co-occurrence probability of the parts of speech of the word corresponding to the row of the element and the parts of speech of the word corresponding to the column of the element;
s3.2, modeling the relative position i of the corresponding word pair according to the part of speech, and constructing a position part of speech associated weight matrix M 'corresponding to the position'iWherein, matrix M'iIs the same as matrix M in row-column dimension of matrix M'iThe element in (b) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at the relative position i.
Preferably, step S4 specifically includes the following sub-steps:
s4.1, constructing an initial objective function:wherein, C represents a vocabulary table in the whole training corpus, context (w) represents a context word set consisting of C words before and after a target word w, and C represents the window size;
s4.2, matrix M and matrix M after modelingi' fusing to a skip-gram word vector model based on negative sampling to construct a target model, and constructing a new target function of the target model according to the initial target function:wherein,NEG (w) is a negative sample set, L, that samples the target word ww(u) is the score of sample u, with a positive sample score of 1, a negative sample score of 0, θuFor the auxiliary vectors used by the sample words during the model training process,as context wordsCorresponding word vectorThe transpose of (a) is performed,is TuAndthe co-occurrence probability of the two parts of speech when the relative position relation is i;
s4.3, optimizing the new objective function, maximizing the value of the new objective function, and performing optimization on the parameter thetauAndand performing gradient calculation and updating, and obtaining a target word vector when the whole training corpus is traversed.
According to another aspect of the present invention, there is provided a word vector training system fusing part of speech and position information, including:
the preprocessing module is used for preprocessing the original text to obtain a target text;
the part-of-speech tagging module is used for tagging the part of speech of the word in the target text by adopting the part of speech in the part-of-speech tagging set according to the context information of the word;
the position part-of-speech fusion module is used for modeling according to the labeled part-of-speech information to construct a part-of-speech associated weight matrix M, modeling the relative position i of the corresponding word pair according to the part-of-speech, and constructing a position part-of-speech associated weight matrix M 'corresponding to the position'iWherein, the dimension of the matrix M is the category size of the part of speech in the part of speech tagging set, the elements in the matrix M are the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element, and the matrix M'iIs the same as matrix M in row-column dimension of matrix M'iThe element in (1) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at a relative position i;
a word vector learning module for summing the modeled matrix MMatrix M'iAnd fusing the target word vectors into a skip-gram word vector model to construct a target model, and performing word vector learning by using the target model to obtain a target word vector, wherein the target word vector is used for a word analogy task and a word similarity task.
Preferably, the part of speech tagging module comprises:
the word segmentation module is used for segmenting the target text to distinguish all words in the target text;
and the part-of-speech tagging submodule is used for performing part-of-speech tagging on each sentence in the target text by adopting the part-of-speech in the part-of-speech tagging set according to the context information of the word in the sentence.
Preferably, the location part-of-speech fusion module includes:
the part-of-speech information modeling module is used for generating a word-part-of-speech pair formed by aiming at each word and the part of speech corresponding to the word for each word in the target text, and constructing a part-of-speech associated weight matrix M according to the word-part-of-speech pair, wherein the row-column dimension of the matrix M is the category size of part-of-speech tagging concentrated parts of speech, and the element in the matrix M is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element;
the position information modeling module is used for modeling the relative position i of the corresponding word pair according to the part of speech and constructing a position part of speech associated weight matrix M 'corresponding to the position'iWherein, matrix M'iIs the same as matrix M in row-column dimension of matrix M'iThe element in (b) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at the relative position i.
Preferably, the word vector learning module comprises:
an initial objective function construction module, configured to construct an initial objective function:wherein, C represents a vocabulary table in the whole training corpus, context (w) represents a context word set consisting of C words before and after a target word w, and C represents the window size;
a new objective function constructing module for constructing the matrix M and the matrix M 'after modeling'iFusing the target model into a skip-gram word vector model based on negative sampling to construct a target model, and constructing a new target function of the target model according to the initial target function:wherein,NEG (w) is a negative sample set, L, that samples the target word ww(u) is the score of sample u, with a positive sample score of 1, a negative sample score of 0, θuFor the auxiliary vectors used by the sample words during the model training process,as context wordsCorresponding word vectorThe transpose of (a) is performed,is TuAndthe co-occurrence probability of the two parts of speech when the relative position relation is i;
the word vector learning submodule is used for optimizing the new objective function, maximizing the value of the new objective function and carrying out optimization on the parameter thetauAndand performing gradient calculation and updating, and obtaining a target word vector when the whole training corpus is traversed.
In general, compared with the prior art, the method of the invention can obtain the following beneficial effects:
(1) by constructing the incidence matrix based on the incidence relation of the part of speech and the incidence relation of the position, the information of the part of speech and the position among the words can be well modeled.
(2) By fusing the well-modeled incidence matrix based on the part-of-speech information and the position information into the negative sampling-based skip-gram word vector learning model, on one hand, a better word vector result can be obtained, and on the other hand, incidence relation weight among the parts-of-speech in a corpus used for model training can also be obtained.
(3) Because the model adopts the optimization strategy of negative sampling, the training speed of the model is higher.
Drawings
Fig. 1 is a schematic flowchart of a word vector training method fusing part of speech and position information according to an embodiment of the present invention;
FIG. 2 is a modeling model diagram of part of speech and location information according to an embodiment of the present invention;
FIG. 3 is a simplified overall flow chart according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating another word vector training method for fusing parts of speech and location information according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a word vector learning method integrating part of speech and position information, which is characterized in that the part of speech and the importance of the part of speech in a natural language are ignored by the existing word vector learning method. The method aims to consider the part-of-speech incidence relation and the position relation among words on the basis of an original skip-gram model so as to enable the model to train a word vector result fused with more information and better complete a word analogy task and a word similarity task by utilizing the learned word vector.
Fig. 1 is a schematic flow chart of a word vector learning method with fusion of parts of speech and position information according to an embodiment of the present invention, where the method shown in fig. 1 includes the following steps:
s1, preprocessing the original text to obtain a target text;
since a large amount of useless information such as XML tags, web page links, picture links, and the like exists in the obtained original text, the useless information is not beneficial to training of word vectors, and even can become noise data to influence the learning of the word vectors, so that the useless information needs to be filtered, and the useless information can be filtered by using a perl script.
S2, according to the context information of the words, the parts of speech in the part of speech tagging set is adopted to tag the parts of speech of the words in the target text;
because part-of-speech information of words is used in the method provided by the invention, part-of-speech tagging needs to be performed on the text by using some part-of-speech tagging tools. In order to solve the problem that a word may have multiple parts of speech due to different contexts, the text can be part of speech tagged in advance, and the part of speech tagging can be performed by means of the context information of the text. Step S2 specifically includes the following substeps:
s2.1, segmenting the target text to distinguish all words in the target text;
the tokenize segmentation tool in openNLP can be used to segment the text, for example, if the common word "applet" in "I buy an applet" is not segmented, the common word "applet" becomes the word "applet", which does not exist, and the learning of the word vector is influenced.
And S2.2, for each sentence in the target text, according to the context information of the word in the sentence, performing part-of-speech tagging on the word by using the part-of-speech in the part-of-speech tagging set.
The part-of-speech tagging is performed on the whole sentence at one time, so that a plurality of parts-of-speech of the same word can be distinguished according to the context of the word. The part of speech to which the word is assigned here belongs to the Penn TreebankPOS part of speech tag set.
Two sentences after word labeling such as "i love you." and "she love her son to mu love":
i _ PRP (pronoun) love _ VBP (verb) you _ PRP. _;
she _ PRP (pronoun) give _ VB (verb) her _ PRP $ (pronoun) son _ NN _ to _ RB (adverb) much _ JJ (adjective) love _ NN.
S3, modeling according to the tagged part-of-speech information to construct a part-of-speech associated weight matrix M, modeling the relative position i of the corresponding word pair according to the part-of-speech, and constructing a position part-of-speech associated weight matrix M 'corresponding to the position'iWherein, the dimension of the matrix M is the category size of the part of speech in the part of speech tagging set, the elements in the matrix M are the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element, and the matrix M'iIs the same as matrix M in row-column dimension of matrix M'iThe element in (1) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at a relative position i; FIG. 2 is a modeling model diagram of part of speech and location information according to an embodiment of the present invention, wherein T is in a row and column0~TNDenotes the part of speech, M'i(Tt,Tt-2) Representing part of speech TtWith part of speech Tt-2Probability of co-occurrence at relative position i.
After the part of speech of the word is obtained, how to participate the part of speech information in the word vector learning model and solve the new model needs to model the part of speech information first. The modeling aims to establish a part-of-speech incidence relation matrix with row and column dimensions of the part-of-speech tagging concentrated part-of-speech category, and elements in the matrix are the probability of occurrence of two parts-of-speech. In addition, the position relationship is modeled, because the position relationship between two parts of speech is important when they co-occur. Step S3 specifically includes the following substeps:
s3.1, generating a word-part-of-speech pair formed by the words and the parts of speech corresponding to the words for each word in the target text, and constructing a part-of-speech associated weight matrix M according to the word-part-of-speech pair, wherein the row-column dimension of the matrix M is the category size of parts of speech in a part-of-speech tagging set, and the element in the matrix M is the co-occurrence probability of the parts of speech of the word corresponding to the row of the element and the parts of speech of the word corresponding to the column of the element;
for example, for a word son in "she give her son to multi-love." whose part of speech is NN and part of speech of word her is PRP, the elements specified by the row corresponding to the PRP and the column corresponding to NN in the matrix are the co-occurrence probabilities (i.e., weights) of the two parts of speech.
S3.2, modeling the relative position i of the corresponding word pair according to the part of speech, and constructing a position part of speech associated weight matrix M 'corresponding to the position'iWherein, matrix M'iIs the same as matrix M in row-column dimension of matrix M'iThe element in (1) is the co-occurrence probability (i.e. weight) of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at the relative position i.
For example, if the window size is 2c, i ∈ [ -c, c]. When the window size is 6, then M 'is established'-3、M′-2、M′-1、M′1、M′2、M′3There are 6 matrices.
For example, for son and her in "she give her son to mu multiple love.", when son is the target word, the associated weight of the part of speech and the position corresponding to the parts of speech of the two words is M'-1(PRP,NN)。
S4, matrix M and matrix M 'after modeling'iAnd fusing the target word vectors into a skip-gram word vector model to construct a target model, and performing word vector learning by using the target model to obtain a target word vector, wherein the target word vector is used for a word analogy task and a word similarity task.
The step S4 specifically includes the following sub-steps:
s4.1, constructing an initial objective function:Wherein, C represents a vocabulary table in the whole training corpus, context (w) represents a context word set consisting of C words before and after a target word w, and C represents the window size;
the Skip-gram model is the same in concept, namely, the target word w is passedtWord v (w) in the prediction contextt+i) Wherein i represents wt+iAnd wtThe positional relationship therebetween. With sample (Context (w)t),wt) For example, where | Context (w)t) 2c, where Context (w)t) Is composed of a word wtThe front and back words are composed of c words. The final optimization goal of the target model is still for the entire training corpus such that all the passing target words wtTo predict the probability maximization of the context word, i.e. to optimize the initial objective function.
For example, the sample "she give her son to mu ch love." word son is the target word wtAnd c is 3, then Context (w)t)={she,give,her,too,much,love}。
S4.2, modeling matrix M and matrix M'iFusing the target model into a skip-gram word vector model based on negative sampling to construct a target model, and constructing a new target function of the target model according to the initial target function:wherein,NEG (w) is a negative sample set, L, that samples the target word ww(u) is the score of sample u, with a positive sample score of 1, a negative sample score of 0, θuFor the auxiliary vectors used by the sample words during the model training process,as context wordsCorresponding word vectorThe transpose of (a) is performed,is TuAndthe co-occurrence probability of the two parts of speech when the relative position relation is i;
for example, the sample "she give her son to mu ch love." the word son is a positive sample, where the label of the word son is 1, and for other words such as dog, flower, etc., the label is a negative sample, and the label is 0.
Fig. 3 is a simplified overall flow diagram disclosed in the embodiment of the present invention, and the constructed target model has three layers, i.e., an input layer, a projection layer, and an output layer. Wherein:
the input of the input layer is a central word w (t), and the output is a word vector corresponding to the central word w (t);
the projection layer mainly projects the output result of the input layer, and the input and output of the projection layer in the model are word vectors of a central word w (t);
the output layer mainly uses the central word w (t) to predict the word vectors of the context words such as w (t-2), w (t-1), w (t +1), w (t +2), etc.
The main purpose of the invention is to consider the part of speech and the position relation of the central word and the context word when predicting the context word by using the central word w (t).
S4.3, optimizing the new objective function, maximizing the value of the new objective function, and performing optimization on the parameter thetauAndperforming gradient calculation and updating, and obtaining target word vectors when the whole training corpus is traversed。
For example, a random Gradient Ascent (SGA) method may be used to optimize the new objective function, i.e., maximize the value of the new objective function. And for the parameter thetauAndand gradient calculation and updating, and obtaining the target word vector when the whole training corpus is traversed.
Optionally, the target word vector may be obtained by performing updating and gradient calculation in the following manner:
fig. 4 is a schematic flow chart of another word vector training method for fusing parts of speech and location information according to an embodiment of the present invention, where the method shown in fig. 4 includes five steps of data preprocessing, word segmentation and part of speech tagging, modeling of part of speech and location information, word vector training, and task evaluation. The method steps described in embodiment 1 are data preprocessing, word segmentation and part-of-speech tagging, part-of-speech and position information modeling, and word vector training, and after the task evaluation can utilize the learned target word vector with part-of-speech and position information, the target word vector can be used in tasks such as word analogy task and word similarity. The method mainly comprises the following two steps:
and performing word analogy task by using the learned target word vector. For example, for two word pairs < king, queen > and < man, wman >, the word vectors corresponding to these word pairs are calculated to find that there is a relationship of v (king) -v (queen) -v (man) -v (wman).
And performing word similarity tasks by using the learned target word vectors. For example, given a word such as "dog", the top N words such as "puppy", "cat", etc. that are closely related to "dog" may be obtained by calculating the cosine or Euclidean distance of the other words from "dog".
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A word vector training method fusing part of speech and position information is characterized by comprising the following steps:
s1, preprocessing the original text to obtain a target text;
s2, according to the context information of the words, the parts of speech in the part of speech tagging set is adopted to tag the parts of speech of the words in the target text;
s3, modeling according to the tagged part-of-speech information to construct a part-of-speech associated weight matrix M, modeling the relative position i of the corresponding word pair according to the part-of-speech, and constructing a position part-of-speech associated weight matrix M 'corresponding to the position'iWherein, the dimension of the matrix M is the category size of the part of speech in the part of speech tagging set, the elements in the matrix M are the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element, and the matrix M'iIs the same as matrix M in row-column dimension of matrix M'iThe element in (1) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at a relative position i;
s4, matrix M and matrix M 'after modeling'iAnd fusing the target word vectors into a skip-gram word vector model to construct a target model, and performing word vector learning by using the target model to obtain a target word vector, wherein the target word vector is used for a word analogy task and a word similarity task.
2. The method according to claim 1, wherein step S2 comprises the following sub-steps:
s2.1, segmenting the target text to distinguish all words in the target text;
and S2.2, for each sentence in the target text, according to the context information of the word in the sentence, performing part-of-speech tagging on the word by using the part-of-speech in the part-of-speech tagging set.
3. The method according to claim 1 or 2, characterized in that step S3 comprises the following sub-steps:
s3.1, generating a word-part-of-speech pair formed by the words and the parts of speech corresponding to the words for each word in the target text, and constructing a part-of-speech associated weight matrix M according to the word-part-of-speech pair, wherein the row-column dimension of the matrix M is the category size of parts of speech in a part-of-speech tagging set, and the element in the matrix M is the co-occurrence probability of the parts of speech of the word corresponding to the row of the element and the parts of speech of the word corresponding to the column of the element;
s3.2, modeling the relative position i of the corresponding word pair according to the part of speech, and constructing a position part of speech associated weight matrix M 'corresponding to the position'iWherein, matrix M'iIs the same as matrix M in row-column dimension of matrix M'iThe element in (b) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at the relative position i.
4. The method according to claim 3, wherein step S4 comprises the following sub-steps:
s4.1, constructing an initial objective function:wherein, C represents a vocabulary table in the whole training corpus, context (w) represents a context word set consisting of C words before and after a target word w, and C represents the window size;
s4.2, modeling matrix M and matrix M'iFusing the target model into a skip-gram word vector model based on negative sampling to construct a target model, and constructing a new target function of the target model according to the initial target function:wherein,NEG (w) is a negative sample set, L, that samples the target word ww(u) is the score of sample u, with a positive sample score of 1, a negative sample score of 0, θuFor the auxiliary vectors used by the sample words during the model training process,as context wordsCorresponding word vectorThe transpose of (a) is performed,is TuAndthe co-occurrence probability of the two parts of speech when the relative position relation is i;
s4.3, optimizing the new objective function, maximizing the value of the new objective function, and performing optimization on the parameter thetauAndand performing gradient calculation and updating, and obtaining a target word vector when the whole training corpus is traversed.
5. A word vector training system fusing part of speech and position information is characterized by comprising:
the preprocessing module is used for preprocessing the original text to obtain a target text;
the part-of-speech tagging module is used for tagging the part of speech of the word in the target text by adopting the part of speech in the part-of-speech tagging set according to the context information of the word;
the position part-of-speech fusion module is used for modeling according to the labeled part-of-speech information to construct a part-of-speech associated weight matrix M, modeling the relative position i of the corresponding word pair according to the part-of-speech, and constructing a position part-of-speech associated weight matrix M 'corresponding to the position'iWherein, the dimension of the matrix M is the category size of the part of speech in the part of speech tagging set, the elements in the matrix M are the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element, and the matrix M'iIs the same as matrix M in row-column dimension of matrix M'iThe element in (1) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at a relative position i;
the word vector learning module is used for calculating the matrix M and the matrix M 'after modeling'iAnd fusing the target word vectors into a skip-gram word vector model to construct a target model, and performing word vector learning by using the target model to obtain a target word vector, wherein the target word vector is used for a word analogy task and a word similarity task.
6. The system of claim 5, wherein the part-of-speech tagging module comprises:
the word segmentation module is used for segmenting the target text to distinguish all words in the target text;
and the part-of-speech tagging submodule is used for performing part-of-speech tagging on each sentence in the target text by adopting the part-of-speech in the part-of-speech tagging set according to the context information of the word in the sentence.
7. The system according to claim 5 or 6, wherein the location part-of-speech fusion module comprises:
the part-of-speech information modeling module is used for generating a word-part-of-speech pair formed by aiming at each word and the part of speech corresponding to the word for each word in the target text, and constructing a part-of-speech associated weight matrix M according to the word-part-of-speech pair, wherein the row-column dimension of the matrix M is the category size of part-of-speech tagging concentrated parts of speech, and the element in the matrix M is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element;
the position information modeling module is used for modeling the relative position i of the corresponding word pair according to the part of speech and constructing a position part of speech associated weight matrix M 'corresponding to the position'iWherein, matrix M'iIs the same as matrix M in row-column dimension of matrix M'iThe element in (b) is the co-occurrence probability of the part of speech of the word corresponding to the row of the element and the part of speech of the word corresponding to the column of the element at the relative position i.
8. The system of claim 7, wherein the word vector learning module comprises:
an initial objective function construction module, configured to construct an initial objective function:wherein, C represents a vocabulary table in the whole training corpus, context (w) represents a context word set consisting of C words before and after a target word w, and C represents the window size;
a new objective function construction module for constructing the matrix M and the matrix M after modelingi' fusing to a skip-gram word vector model based on negative sampling to construct a target model, and constructing a new target function of the target model according to the initial target function:wherein,NEG (w) is a negative sample set, L, that samples the target word ww(u) is the score of sample u, with a positive sample score of 1, a negative sample score of 0, θuFor the auxiliary vectors used by the sample words during the model training process,as context wordsCorresponding word vectorThe transpose of (a) is performed,is TuAndthe co-occurrence probability of the two parts of speech when the relative position relation is i;
the word vector learning submodule is used for optimizing the new objective function, maximizing the value of the new objective function and carrying out optimization on the parameter thetauAndand performing gradient calculation and updating, and obtaining a target word vector when the whole training corpus is traversed.
CN201710384135.6A 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information Active CN107239444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710384135.6A CN107239444B (en) 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710384135.6A CN107239444B (en) 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information

Publications (2)

Publication Number Publication Date
CN107239444A CN107239444A (en) 2017-10-10
CN107239444B true CN107239444B (en) 2019-10-08

Family

ID=59985183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710384135.6A Active CN107239444B (en) 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information

Country Status (1)

Country Link
CN (1) CN107239444B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305612B (en) * 2017-11-21 2020-07-31 腾讯科技(深圳)有限公司 Text processing method, text processing device, model training method, model training device, storage medium and computer equipment
CN108229818B (en) * 2017-12-29 2021-07-13 网智天元科技集团股份有限公司 Method and device for establishing talent value measuring and calculating coordinate system
CN110348001B (en) * 2018-04-04 2022-11-25 腾讯科技(深圳)有限公司 Word vector training method and server
CN108628834B (en) * 2018-05-14 2022-04-15 国家计算机网络与信息安全管理中心 Word expression learning method based on syntactic dependency relationship
CN108733653B (en) * 2018-05-18 2020-07-10 华中科技大学 Sentiment analysis method of Skip-gram model based on fusion of part-of-speech and semantic information
CN108875810B (en) * 2018-06-01 2020-04-28 阿里巴巴集团控股有限公司 Method and device for sampling negative examples from word frequency table aiming at training corpus
CN109190126B (en) * 2018-09-17 2023-08-15 北京神州泰岳软件股份有限公司 Training method and device for word embedding model
CN109308353B (en) * 2018-09-17 2023-08-15 鼎富智能科技有限公司 Training method and device for word embedding model
CN109271636B (en) * 2018-09-17 2023-08-11 鼎富智能科技有限公司 Training method and device for word embedding model
CN109344403B (en) * 2018-09-20 2020-11-06 中南大学 Text representation method for enhancing semantic feature embedding
CN109271422B (en) * 2018-09-20 2021-10-08 华中科技大学 Social network subject matter expert searching method driven by unreal information
CN109325231B (en) * 2018-09-21 2023-07-04 中山大学 Method for generating word vector by multitasking model
CN109639452A (en) * 2018-10-31 2019-04-16 深圳大学 Social modeling training method, device, server and storage medium
CN109858024B (en) * 2019-01-04 2023-04-11 中山大学 Word2 vec-based room source word vector training method and device
CN109858031B (en) * 2019-02-14 2023-05-23 北京小米智能科技有限公司 Neural network model training and context prediction method and device
CN110276052B (en) * 2019-06-10 2021-02-12 北京科技大学 Ancient Chinese automatic word segmentation and part-of-speech tagging integrated method and device
CN110287236B (en) * 2019-06-25 2024-03-19 平安科技(深圳)有限公司 Data mining method, system and terminal equipment based on interview information
CN110534087B (en) * 2019-09-04 2022-02-15 清华大学深圳研究生院 Text prosody hierarchical structure prediction method, device, equipment and storage medium
CN111506726B (en) * 2020-03-18 2023-09-22 大箴(杭州)科技有限公司 Short text clustering method and device based on part-of-speech coding and computer equipment
CN111859910B (en) * 2020-07-15 2022-03-18 山西大学 Word feature representation method for semantic role recognition and fusing position information
CN111832282B (en) * 2020-07-16 2023-04-14 平安科技(深圳)有限公司 External knowledge fused BERT model fine adjustment method and device and computer equipment
CN113010670B (en) * 2021-02-22 2023-09-19 腾讯科技(深圳)有限公司 Account information clustering method, detection method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866337A (en) * 2009-04-14 2010-10-20 日电(中国)有限公司 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
CN106649275A (en) * 2016-12-28 2017-05-10 成都数联铭品科技有限公司 Relation extraction method based on part-of-speech information and convolutional neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866337A (en) * 2009-04-14 2010-10-20 日电(中国)有限公司 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
CN106649275A (en) * 2016-12-28 2017-05-10 成都数联铭品科技有限公司 Relation extraction method based on part-of-speech information and convolutional neural network

Also Published As

Publication number Publication date
CN107239444A (en) 2017-10-10

Similar Documents

Publication Publication Date Title
CN107239444B (en) A kind of term vector training method and system merging part of speech and location information
CN112001187B (en) Emotion classification system based on Chinese syntax and graph convolution neural network
CN111160037B (en) Fine-grained emotion analysis method supporting cross-language migration
CN109325231B (en) Method for generating word vector by multitasking model
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN112001185A (en) Emotion classification method combining Chinese syntax and graph convolution neural network
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
Shilpa et al. Sentiment analysis using deep learning
CN110489523B (en) Fine-grained emotion analysis method based on online shopping evaluation
CN109325112A (en) A kind of across language sentiment analysis method and apparatus based on emoji
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN112836051B (en) Online self-learning court electronic file text classification method
CN109492105B (en) Text emotion classification method based on multi-feature ensemble learning
CN111339260A (en) BERT and QA thought-based fine-grained emotion analysis method
CN111400584A (en) Association word recommendation method and device, computer equipment and storage medium
CN114756681B (en) Evaluation and education text fine granularity suggestion mining method based on multi-attention fusion
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN115906816A (en) Text emotion analysis method of two-channel Attention model based on Bert
CN115017884B (en) Text parallel sentence pair extraction method based on graphic multi-mode gating enhancement
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
Jacob et al. Data Intelligence and Cognitive Informatics: Proceedings of ICDICI 2023
CN117056451A (en) New energy automobile complaint text aspect-viewpoint pair extraction method based on context enhancement
CN115906824A (en) Text fine-grained emotion analysis method, system, medium and computing equipment
CN116108840A (en) Text fine granularity emotion analysis method, system, medium and computing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant