CN113627176A - Method for calculating Chinese word vector by using principal component analysis - Google Patents
Method for calculating Chinese word vector by using principal component analysis Download PDFInfo
- Publication number
- CN113627176A CN113627176A CN202110942291.6A CN202110942291A CN113627176A CN 113627176 A CN113627176 A CN 113627176A CN 202110942291 A CN202110942291 A CN 202110942291A CN 113627176 A CN113627176 A CN 113627176A
- Authority
- CN
- China
- Prior art keywords
- vector
- chinese
- word
- words
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000013598 vector Substances 0.000 title claims abstract description 201
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000000513 principal component analysis Methods 0.000 title claims abstract description 15
- 239000011159 matrix material Substances 0.000 claims abstract description 100
- 238000004364 calculation method Methods 0.000 claims abstract description 16
- 230000001131 transforming effect Effects 0.000 claims abstract description 4
- 239000002131 composite material Substances 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 8
- 230000017105 transposition Effects 0.000 claims description 4
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000003786 synthesis reaction Methods 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 abstract description 7
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
Abstract
The invention relates to a method for calculating Chinese word vectors by using principal component analysis, belonging to the field of language processing. The method selects representative words in Chinese as the basis of principal component analysis; expressing the Chinese characters by vectors formed by numerical values; combining Chinese character lattice vectors in Chinese words into a synthetic vector of the word itself, and converting the word into a numerical value vector form; calculating an average synthetic vector of all words of the reference vocabulary; subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words; obtaining the characteristic of a covariance matrix; calculating a matrix for transforming the synthetic vector of the word according to the covariance matrix characteristic; and for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word. The method is simple in calculation, can avoid the common problem of 'unknown words' during vectorization of Chinese words, and has important application value in natural language processing of Chinese.
Description
Technical Field
The invention belongs to the field of language processing, and particularly relates to a method for calculating word vectors of Chinese words by using principal component analysis, in particular to a method for calculating word vectors of Chinese words by using a Chinese character dot matrix and principal component analysis.
Background
Natural language processing is a technique for processing human language with a computer. Since computers are good at numerical computation, to process natural language, it is first necessary to convert the natural language into a numerical form. The process of converting natural language into numerical form is called vectorization of characters, words and sentences, i.e. a character, a word and a sentence are respectively expressed by a plurality of numbers.
Common word vectorization technologies include one hot (one hot) technology and continuous bag of words (CBOW) technology. In the one-hot technique, a vocabulary is determined in advance, for example, 10000 vocabularies, each word is represented by 10000 ordered numbers (10000-dimensional vector), and if a word is arranged in the ith bit in the vocabulary, the corresponding vector has an ith component of 1 and the remaining components of 0.
The one-hot expression redundancy is large, people develop continuous word bag expression, a word in a sentence is used as a central word, n words before and after the word are used as associated words, an average one-hot vector of the one-hot expression of the n song associated words is input into a neural network for training, and the output of the neural network is the one-hot expression of the central word. And after the neural network is stabilized, the weight of the connection between the ith output node of the neural network and the hidden layer node is the word vector of the ith word.
Both the one-hot representation and the continuous bag-of-words representation require a pre-determined vocabulary size, and if there is a change in the vocabulary, a recalculation of the word vector for each word is required. Furthermore, training of neural networks consumes a significant amount of computing power and time when there are many words in the vocabulary. This is the case in the natural language processing of chinese,
the Chinese character is expressed as a matrix synthesized by the dot matrix of Chinese characters in words, then orthogonal transformation is carried out, partial coefficients after orthogonal transformation are removed to be used as word vectors, new words are allowed to be added, but the number of the coefficients (the dimension of the word vectors) is difficult to determine.
If a simple word vector calculation method is available, the characteristics of the natural language are fully utilized, the defects of the method are overcome, and the application range of natural language processing can be expanded without being influenced by adding new words. The present invention has been developed in response to such real needs.
Disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is how to provide a method for calculating Chinese word vectors by using principal component analysis, so as to solve the problem that the common word vectorization technology needs to consume a large amount of computing power and time.
(II) technical scheme
In order to solve the above technical problem, the present invention provides a method for calculating a chinese word vector by using principal component analysis, which is characterized in that the method comprises the following steps:
s1, selecting a standard Chinese vocabulary, selecting a representative word in Chinese as a standard of pivot analysis;
s2, acquiring dot matrix vectors of each Chinese character in the Chinese words, and representing the Chinese characters by vectors formed by numerical values, so as to facilitate the further processing of a computer;
s3, calculating the synthetic vector of each Chinese word, combining the Chinese lattice vectors in the Chinese words into the synthetic vector of the word itself, and converting the word into a numerical value vector form;
s4, calculating the average synthetic vector of the reference vocabulary, and calculating the average synthetic vector of all words of the reference vocabulary;
s5, calculating a covariance matrix of the reference words, and obtaining a covariance matrix of differences among the words by subtracting the average synthetic vector from the synthetic vector of each word in the reference words and then multiplying the resultant vectors;
s6, calculating the eigenvalue and the eigenvector of the covariance matrix to obtain the characteristic of the covariance matrix;
s7, calculating a projection matrix of the Chinese word synthetic vector, and calculating a matrix for transforming the word synthetic vector according to the covariance matrix characteristic;
and S8, calculating word vectors of Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vectors of the words.
Further, the step S1 specifically includes: selecting M Chinese words WkK is 1,2, …, M, including words with only 1 chinese character, as well as words consisting of multiple chinese characters.
Further, the step S2 specifically includes: obtain word WkEach Chinese character CkiLattice vector MC ofkiThe size of the lattice is dxd, and the values of elements in the lattice are 1 and 0; arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns1,a2,…,aD) D ═ D × D, where ai1 or ai=0,i=1,2,…,D。
Further, d-16 or d-24.
Further, the step S3 specifically includes: for a Chinese word W composed of n wordskComposite vector MW of wordkIs the lattice vector MC of each Chinese character in the wordkiWeighted sum of, MWk=w1×MCk1+w2×MCk2+…+wn×MCknEach Chinese character CkiWeight w ofiThe calculation method comprises the following steps:
further, the step S4 specifically includes: the calculation method of the average synthesis vector MW of the M reference words comprises the following steps: MW ═ MW (MW)1+MW2+…+MWM)/M。
Further, the step S5 specifically includes: the resultant vector MW of m reference wordsk(ak1,ak2,…,akD) Subtract the average resultant vector MW (a)1,a2,…,aD) Then, forming a matrix A with M rows and D columns;
according to the matrix operation rule, calculating X as AT×A;ATDenotes the transposition of A, ATThe covariance matrix X is a matrix of D rows and M columns, and the covariance matrix X is a matrix of D rows and D columns.
Further, the step S6 specifically includes: calculating eigenvalues and eigenvectors of the covariance matrix, and dividing the eigenvalues λjArranged in descending order, λ1≥λ2≥…≥λD。
Further, the step S7 specifically includes: selecting a minimum number L satisfying: (lambda1+λ2+…+λL)/(λ1+λ2+…+λD) ≧ 0.99, the L larger eigenvalues λjCorresponding eigenvector VjAnd forming a projection matrix P with D rows and L columns.
Further, the step S8 specifically includes: for any Chinese word WjCalculating W according to the step (2) and the step (3)jResultant vector MW ofj(aj1,aj2,…,ajD) Subtract the average resultant vector MW (a)1,a2,…,aD) Obtaining the vector Y ═ aj1-a1,aj2-a2,…,ajD-aD) (ii) a According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word WjThe word vector of (2).
(III) advantageous effects
The invention provides a method for calculating Chinese word vectors by using principal component analysis, which fully utilizes the characteristics of Chinese characters, has simple calculation, can avoid the common problem of 'unknown words' during Chinese word vectorization, is easy to determine the dimension of word vectors and has important application value in the natural language processing of Chinese.
Drawings
FIG. 1 is a flowchart of a method for computing Chinese word vectors using principal component analysis according to the present invention.
Detailed Description
In order to make the objects, contents and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
The invention discloses a method for calculating Chinese word vectors by using principal component analysis, which comprises the following steps: (1) and selecting a reference Chinese vocabulary. And selecting representative words in Chinese as the benchmark of pivot analysis. (2) And acquiring a dot matrix vector of each Chinese character in the Chinese words. The Chinese characters are expressed by vectors formed by numerical values, so that the further processing of a computer is facilitated. (3) And calculating the synthetic vector of each Chinese word. The Chinese lattice vectors in the Chinese words are combined into the synthetic vector of the word itself, and the word is also converted into a numerical value vector form. (4) And calculating the average synthetic vector of the reference vocabulary. An average composite vector of all words of the reference vocabulary is calculated. (5) And calculating a covariance matrix of the reference vocabulary. And subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words. (6) And calculating eigenvalues and eigenvectors of the covariance matrix. The characteristic of the covariance matrix is obtained. (7) And calculating a projection matrix of the Chinese word synthetic vector. A matrix is calculated that transforms the composite vector of words based on the covariance matrix properties. (8) And calculating word vectors of the Chinese words. And for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word. The Chinese words are firstly expressed as the synthetic vectors of the Chinese characters in the words to form a vector space, the basic vectors of the vector space are calculated, the synthetic vectors of the words are projected into the vector space, the projection coordinates are used as the word vectors of the Chinese words, the calculation is simple, the common problem of 'unknown words' during vectorization of the Chinese words can be avoided, and the Chinese word vector calculation method has important application value in natural language processing of Chinese.
The purpose of the invention is: a method for calculating Chinese word vectors by using principal component analysis is provided, which meets the requirement of natural language processing for calculating word vectors.
In order to achieve the above object, the present invention provides a method for calculating a Chinese word vector by using pivot analysis, the method comprising:
and S1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as the reference of pivot analysis.
S2, obtaining each Chinese character lattice vector in the Chinese words, and expressing the Chinese characters by the vector composed of numerical values, which is convenient for the further processing of the computer.
S3, calculating the synthetic vector of each Chinese word, combining the Chinese lattice vectors in the Chinese words into the synthetic vector of the word itself, and converting the word into a numerical value vector form.
S4, an average synthetic vector of the reference word is calculated, and an average synthetic vector of all the words of the reference word is calculated.
And S5, calculating a covariance matrix of the reference words, and multiplying the average synthetic vector subtracted by the synthetic vector of each word in the reference words to obtain a covariance matrix of the difference between the words.
And S6, calculating the eigenvalue and the eigenvector of the covariance matrix to obtain the characteristic of the covariance matrix.
S7, calculating a projection matrix of the Chinese word synthetic vector, and calculating a matrix for transforming the word synthetic vector according to the covariance matrix characteristic.
And S8, calculating word vectors of Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vectors of the words.
FIG. 1 is a flow chart of a method for computing Chinese word vectors using principal component analysis in accordance with the present invention. As shown in fig. 1, the method includes:
and S1, selecting a reference Chinese vocabulary. And selecting representative words in Chinese as the benchmark of pivot analysis.
In specific implementation, M (not less than 10000) Chinese words W are selectedkK-1, 2, …, M, including words with only 1 chinese character (the characters in GB 2312 may be selected), and words consisting of multiple chinese characters(common chinese words published by the relevant departments may be selected).
S2, obtaining each Chinese character lattice vector in the Chinese words. The Chinese characters are expressed by vectors formed by numerical values, so that the further processing of a computer is facilitated.
In specific implementation, the word W is obtainedkEach Chinese character CkiIs a lattice MCkiThe lattice size is d × d, d equals 16 or d equals 24. The values of the elements in the lattice are 1 and 0. Arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns1,a2,…,aD) D ═ D × D, where ai1 or ai=0,i=1,2,…,D。
And S3, calculating the synthetic vector of each Chinese word. The Chinese lattice vectors in the Chinese words are combined into the synthetic vector of the word itself, and the word is also converted into a numerical value vector form.
In specific implementation, for a Chinese word W composed of n wordskComposite vector MW of wordkIs the lattice vector MC of each Chinese character in the wordkiWeighted sum of, MWk=w1×MCk1+w2×MCk2+…+wn×MCkn. Each Chinese character CkiWeight w ofiThe calculation method comprises the following steps:
and S4, calculating the average synthetic vector of the reference words. An average composite vector of all words of the reference vocabulary is calculated.
In specific implementation, the calculation method of the average synthesis vector MW of the M reference words is as follows: MW ═ MW (MW)1+MW2+…+MWM)/M。
And S5, calculating the covariance matrix of the reference words. And subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words.
The resultant vector MW of m reference wordsk(ak1,ak2,…,akD) Subtract the average resultant vector MW (a)1,a2,…,aD) After that, a matrix a of M rows and D columns is formed.
According to the matrix operation rule, calculating X as AT×A。ATDenotes the transposition of A, ATThe covariance matrix X is a matrix of D rows and M columns, and the covariance matrix X is a matrix of D rows and D columns.
And S6, calculating eigenvalues and eigenvectors of the covariance matrix. The characteristic of the covariance matrix is obtained.
In specific implementation, the eigenvalue and eigenvector of the covariance matrix can be calculated by using the Jacobi method or other methods, and the eigenvalue λ is calculatedjArranged in descending order, λ1≥λ2≥…≥λD。
And S7, calculating a projection matrix of the Chinese word synthetic vector. A matrix is calculated that transforms the composite vector of words based on the covariance matrix properties.
In specific implementation, a minimum number L is selected to satisfy: (lambda1+λ2+…+λL)/(λ1+λ2+…+λD) ≧ 0.99, the L larger eigenvalues λjCorresponding eigenvector VjAnd forming a projection matrix P with D rows and L columns.
And S8, calculating word vectors of the Chinese words. And for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word.
In specific implementation, for any Chinese word WjCalculating W according to the step (2) and the step (3)jResultant vector MW ofj(aj1,aj2,…,ajD) Subtract the average resultant vector MW (a)1,a2,…,aD) Obtaining the vector Y ═ aj1-a1,aj2-a2,…,ajD-aD)。According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word WjThe word vector of (2).
The invention provides a method for calculating Chinese word vectors by using principal component analysis, which comprises the following steps:
(1) and selecting a reference Chinese vocabulary. And selecting representative words in Chinese as the benchmark of pivot analysis.
(2) And acquiring a dot matrix vector of each Chinese character in the Chinese words. The Chinese characters are expressed by vectors formed by numerical values, so that the further processing of a computer is facilitated.
(3) And calculating the synthetic vector of each Chinese word. The Chinese lattice vectors in the Chinese words are combined into the synthetic vector of the word itself, and the word is also converted into a numerical value vector form.
(4) And calculating the average synthetic vector of the reference vocabulary. An average composite vector of all words of the reference vocabulary is calculated.
(5) And calculating a covariance matrix of the reference vocabulary. And subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words.
(6) And calculating eigenvalues and eigenvectors of the covariance matrix. The characteristic of the covariance matrix is obtained.
(7) And calculating a projection matrix of the Chinese word synthetic vector. A matrix is calculated that transforms the composite vector of words based on the covariance matrix properties.
(8) And calculating word vectors of the Chinese words. And for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word.
Further, in the step (1), M (not less than 10000) Chinese words W are selectedkAnd k is 1,2, …, M, including words of only 1 chinese character in GB 2312, and words consisting of multiple chinese characters in common chinese words published by the relevant departments.
Further, in the step (2), a word W is obtainedkEach Chinese character CkiIs a lattice MCkiThe lattice size is d × d, d equals 16 or d equals 24.The values of the elements in the lattice are 1 and 0. Arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns1,a2,…,aD) D ═ D × D, where ai1 or ai=0,i=1,2,…,D。
Further, in the step (3), for a Chinese word W composed of n wordskComposite vector MW of wordkIs the lattice vector MC of each Chinese character in the wordkiWeighted sum of, MWk=w1×MCk1+w2×MCk2+…+wn×MCkn. Each Chinese character CkiWeight w ofiThe calculation method comprises the following steps:
further, in the step (4), the average resultant vector MW of the M reference words is calculated as: MW ═ MW (MW)1+MW2+…+MWM)/M。
Further, in the step (5), a resultant vector MW of the m reference words is obtainedk(ak1,ak2,…,akD) Subtract the average resultant vector MW (a)1,a2,…,aD) After that, a matrix a of M rows and D columns is formed.
According to the matrix operation rule, calculating X as AT×A。ATDenotes the transposition of A, ATThe covariance matrix X is a matrix of D rows and M columns, and the covariance matrix X is a matrix of D rows and D columns.
Further, in the step (6), after calculating eigenvalues and eigenvectors of the covariance matrix, the eigenvalue λ is calculatedjArranged in descending order, λ1≥λ2≥…≥λD。
Further, in the step (7)In specific implementation, a minimum number L is selected to satisfy: (lambda1+λ2+…+λL)/(λ1+λ2+…+λD) ≧ 0.99, the L larger eigenvalues λjCorresponding eigenvector VjAnd forming a projection matrix P with D rows and L columns.
Further, in the step (8), for any Chinese word WjCalculating W according to the step (2) and the step (3)jResultant vector MW ofj(aj1,aj2,…,ajD) Subtract the average resultant vector MW (a)1,a2,…,aD) Obtaining the vector Y ═ aj1-a1,aj2-a2,…,ajD-aD). According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word WjThe word vector of (2).
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A method for calculating Chinese word vectors by using principal component analysis is characterized by comprising the following steps:
s1, selecting a standard Chinese vocabulary, selecting a representative word in Chinese as a standard of pivot analysis;
s2, acquiring dot matrix vectors of each Chinese character in the Chinese words, and representing the Chinese characters by vectors formed by numerical values, so as to facilitate the further processing of a computer;
s3, calculating the synthetic vector of each Chinese word, combining the Chinese lattice vectors in the Chinese words into the synthetic vector of the word itself, and converting the word into a numerical value vector form;
s4, calculating the average synthetic vector of the reference vocabulary, and calculating the average synthetic vector of all words of the reference vocabulary;
s5, calculating a covariance matrix of the reference words, and obtaining a covariance matrix of differences among the words by subtracting the average synthetic vector from the synthetic vector of each word in the reference words and then multiplying the resultant vectors;
s6, calculating the eigenvalue and the eigenvector of the covariance matrix to obtain the characteristic of the covariance matrix;
s7, calculating a projection matrix of the Chinese word synthetic vector, and calculating a matrix for transforming the word synthetic vector according to the covariance matrix characteristic;
and S8, calculating word vectors of Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vectors of the words.
2. The method of claim 1, wherein the step S1 specifically includes: selecting M Chinese words WkK is 1,2, …, M, including words with only 1 chinese character, as well as words consisting of multiple chinese characters.
3. The method for calculating Chinese word vectors using pivot analysis as claimed in claim 1 or 2, wherein said step S2 specifically includes: obtain word WkEach Chinese character CkiLattice vector MC ofkiThe size of the lattice is dxd, and the values of elements in the lattice are 1 and 0; arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns1,a2,…,aD) D ═ D × D, where ai1 or ai=0,i=1,2,…,D。
4. The method of claim 3, wherein d-16 or d-24 is used for calculating the Chinese word vector.
5. The method of claim 3, wherein the step S3 specifically includes: for a Chinese word W composed of n wordskComposite vector MW of wordkIs the lattice vector MC of each Chinese character in the wordkiWeighted sum of, MWk=w1×MCk1+w2×MCk2+…+wn×MCknEach Chinese character CkiWeight w ofiThe calculation method comprises the following steps:
6. the method of claim 4, wherein the step S4 specifically includes: the calculation method of the average synthesis vector MW of the M reference words comprises the following steps: MW ═ MW (MW)1+MW2+…+MWM)/M。
7. The method of claim 5, wherein the step S5 specifically includes: the resultant vector MW of m reference wordsk(ak1,ak2,…,akD) Subtract the average resultant vector MW (a)1,a2,…,aD) Then, forming a matrix A with M rows and D columns;
according to the matrix operation rule, calculating X as AT×A;ATDenotes the transposition of A, ATThe covariance matrix X is a matrix of D rows and M columns, and the covariance matrix X is a matrix of D rows and D columns.
8. The method of claim 6, wherein the step S6 specifically includes: calculating eigenvalues and eigenvectors of the covariance matrix, and dividing the eigenvalues λjArranged in descending order, λ1≥λ2≥…≥λD。
9. The method of claim 7, wherein the step S7 specifically includes: selecting a minimum number L satisfying: (lambda1+λ2+…+λL)/(λ1+λ2+…+λD) ≧ 0.99, the L larger eigenvalues λjCorresponding eigenvector VjAnd forming a projection matrix P with D rows and L columns.
10. The method of claim 8, wherein the step S8 specifically includes: for any Chinese word WjCalculating W according to the step (2) and the step (3)jResultant vector MW ofj(aj1,aj2,…,ajD) Subtract the average resultant vector MW (a)1,a2,…,aD) Obtaining the vector Y ═ aj1-a1,aj2-a2,…,ajD-aD) (ii) a According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word WjThe word vector of (2).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110942291.6A CN113627176B (en) | 2021-08-17 | Method for calculating Chinese word vector by principal component analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110942291.6A CN113627176B (en) | 2021-08-17 | Method for calculating Chinese word vector by principal component analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113627176A true CN113627176A (en) | 2021-11-09 |
CN113627176B CN113627176B (en) | 2024-04-19 |
Family
ID=
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1786966A (en) * | 2004-12-09 | 2006-06-14 | 索尼英国有限公司 | Information treatment |
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
US20080255839A1 (en) * | 2004-09-14 | 2008-10-16 | Zentian Limited | Speech Recognition Circuit and Method |
US20110010319A1 (en) * | 2007-09-14 | 2011-01-13 | The University Of Tokyo | Correspondence learning apparatus and method and correspondence learning program, annotation apparatus and method and annotation program, and retrieval apparatus and method and retrieval program |
CN102135820A (en) * | 2011-01-18 | 2011-07-27 | 浙江大学 | Planarization pre-processing method |
JP2011164126A (en) * | 2010-02-04 | 2011-08-25 | Nippon Telegr & Teleph Corp <Ntt> | Noise suppression filter calculation method, and device and program therefor |
CN104598441A (en) * | 2014-12-25 | 2015-05-06 | 上海科阅信息技术有限公司 | Method for splitting Chinese sentences through computer |
CN107194408A (en) * | 2017-06-21 | 2017-09-22 | 安徽大学 | A kind of method for tracking target of the sparse coordination model of mixed block |
CN107273355A (en) * | 2017-06-12 | 2017-10-20 | 大连理工大学 | A kind of Chinese word vector generation method based on words joint training |
CN108154167A (en) * | 2017-12-04 | 2018-06-12 | 昆明理工大学 | A kind of Chinese character pattern similarity calculating method |
CN109582951A (en) * | 2018-10-19 | 2019-04-05 | 昆明理工大学 | A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm |
CN109992716A (en) * | 2019-03-29 | 2019-07-09 | 电子科技大学 | A kind of similar news recommended method of Indonesian based on ITQ algorithm |
CN110059191A (en) * | 2019-05-07 | 2019-07-26 | 山东师范大学 | A kind of text sentiment classification method and device |
CN110196893A (en) * | 2019-05-05 | 2019-09-03 | 平安科技(深圳)有限公司 | Non- subjective item method to go over files, device and storage medium based on text similarity |
CN112417153A (en) * | 2020-11-20 | 2021-02-26 | 虎博网络技术(上海)有限公司 | Text classification method and device, terminal equipment and readable storage medium |
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080255839A1 (en) * | 2004-09-14 | 2008-10-16 | Zentian Limited | Speech Recognition Circuit and Method |
CN1786966A (en) * | 2004-12-09 | 2006-06-14 | 索尼英国有限公司 | Information treatment |
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
US20110010319A1 (en) * | 2007-09-14 | 2011-01-13 | The University Of Tokyo | Correspondence learning apparatus and method and correspondence learning program, annotation apparatus and method and annotation program, and retrieval apparatus and method and retrieval program |
JP2011164126A (en) * | 2010-02-04 | 2011-08-25 | Nippon Telegr & Teleph Corp <Ntt> | Noise suppression filter calculation method, and device and program therefor |
CN102135820A (en) * | 2011-01-18 | 2011-07-27 | 浙江大学 | Planarization pre-processing method |
CN104598441A (en) * | 2014-12-25 | 2015-05-06 | 上海科阅信息技术有限公司 | Method for splitting Chinese sentences through computer |
CN107273355A (en) * | 2017-06-12 | 2017-10-20 | 大连理工大学 | A kind of Chinese word vector generation method based on words joint training |
CN107194408A (en) * | 2017-06-21 | 2017-09-22 | 安徽大学 | A kind of method for tracking target of the sparse coordination model of mixed block |
CN108154167A (en) * | 2017-12-04 | 2018-06-12 | 昆明理工大学 | A kind of Chinese character pattern similarity calculating method |
CN109582951A (en) * | 2018-10-19 | 2019-04-05 | 昆明理工大学 | A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm |
CN109992716A (en) * | 2019-03-29 | 2019-07-09 | 电子科技大学 | A kind of similar news recommended method of Indonesian based on ITQ algorithm |
CN110196893A (en) * | 2019-05-05 | 2019-09-03 | 平安科技(深圳)有限公司 | Non- subjective item method to go over files, device and storage medium based on text similarity |
CN110059191A (en) * | 2019-05-07 | 2019-07-26 | 山东师范大学 | A kind of text sentiment classification method and device |
CN112417153A (en) * | 2020-11-20 | 2021-02-26 | 虎博网络技术(上海)有限公司 | Text classification method and device, terminal equipment and readable storage medium |
Non-Patent Citations (5)
Title |
---|
YUANXIN LI等: "Compressive parameter estimation with multiple measurement vectors via structured low-rank covariance estimation", 2014 IEEE WORKSHOP ON STATISTICAL SIGNAL PROCESSING, pages 384 * |
丁维: "基于专家知识和深度学习的领域术语网络模型构建", 中国优秀硕士学位论文全文数据库信息科技辑, no. 1, pages 138 - 2560 * |
李照耀: "藏语连续语音识别的语言模型研究", 中国优秀硕士学位论文全文数据库信息科技辑, no. 5, pages 136 - 180 * |
翟海超: "基于流形学习方法的中文文本分类研究", 中国优秀硕士学位论文全文数据库信息科技辑, no. 3, pages 138 - 2799 * |
赵彦斌;李庆华;: "汉字关联性量化方法及其在文本相似性分析中的应用", 计算机应用, vol. 26, no. 06, pages 1398 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020174826A1 (en) | Answer generating device, answer learning device, answer generating method, and answer generating program | |
Bucak et al. | Incremental subspace learning via non-negative matrix factorization | |
CN111460807B (en) | Sequence labeling method, device, computer equipment and storage medium | |
CN107220220A (en) | Electronic equipment and method for text-processing | |
JP7139626B2 (en) | Phrase generation relationship estimation model learning device, phrase generation device, method, and program | |
KR101939209B1 (en) | Apparatus for classifying category of a text based on neural network, method thereof and computer recordable medium storing program to perform the method | |
Shah et al. | Image captioning using deep neural architectures | |
CN116095089B (en) | Remote sensing satellite data processing method and system | |
Ye et al. | MultiTL-KELM: A multi-task learning algorithm for multi-step-ahead time series prediction | |
Lin et al. | Intelligent decision support for new product development: a consumer-oriented approach | |
JP2017016384A (en) | Mixed coefficient parameter learning device, mixed occurrence probability calculation device, and programs thereof | |
WO2020170881A1 (en) | Question answering device, learning device, question answering method, and program | |
CN113157919A (en) | Sentence text aspect level emotion classification method and system | |
CN116561410A (en) | Course teaching resource recommendation method | |
WO2020040255A1 (en) | Word coding device, analysis device, language model learning device, method, and program | |
US20210089904A1 (en) | Learning method of neural network model for language generation and apparatus for performing the learning method | |
Poghosyan et al. | Short-term memory with read-only unit in neural image caption generator | |
CN113627176A (en) | Method for calculating Chinese word vector by using principal component analysis | |
CN113627176B (en) | Method for calculating Chinese word vector by principal component analysis | |
WO2019163752A1 (en) | Morpheme analysis learning device, morpheme analysis device, method, and program | |
CN111259106A (en) | Relation extraction method combining neural network and feature calculation | |
CN111177381A (en) | Slot filling and intention detection joint modeling method based on context vector feedback | |
CN113221551B (en) | Fine-grained sentiment analysis method based on sequence generation | |
CN113449517B (en) | Entity relationship extraction method based on BERT gated multi-window attention network model | |
CN112465929B (en) | Image generation method based on improved graph convolution network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |