CN113627176A - Method for calculating Chinese word vector by using principal component analysis - Google Patents

Method for calculating Chinese word vector by using principal component analysis Download PDF

Info

Publication number
CN113627176A
CN113627176A CN202110942291.6A CN202110942291A CN113627176A CN 113627176 A CN113627176 A CN 113627176A CN 202110942291 A CN202110942291 A CN 202110942291A CN 113627176 A CN113627176 A CN 113627176A
Authority
CN
China
Prior art keywords
vector
chinese
word
words
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110942291.6A
Other languages
Chinese (zh)
Other versions
CN113627176B (en
Inventor
蒋遂平
袁晓光
刘轩
王璐静
臧小滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Aerospace Aiwei Electronic Technology Ltd
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Aerospace Aiwei Electronic Technology Ltd
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Aerospace Aiwei Electronic Technology Ltd, Beijing Institute of Computer Technology and Applications filed Critical Beijing Aerospace Aiwei Electronic Technology Ltd
Priority to CN202110942291.6A priority Critical patent/CN113627176B/en
Priority claimed from CN202110942291.6A external-priority patent/CN113627176B/en
Publication of CN113627176A publication Critical patent/CN113627176A/en
Application granted granted Critical
Publication of CN113627176B publication Critical patent/CN113627176B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Abstract

The invention relates to a method for calculating Chinese word vectors by using principal component analysis, belonging to the field of language processing. The method selects representative words in Chinese as the basis of principal component analysis; expressing the Chinese characters by vectors formed by numerical values; combining Chinese character lattice vectors in Chinese words into a synthetic vector of the word itself, and converting the word into a numerical value vector form; calculating an average synthetic vector of all words of the reference vocabulary; subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words; obtaining the characteristic of a covariance matrix; calculating a matrix for transforming the synthetic vector of the word according to the covariance matrix characteristic; and for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word. The method is simple in calculation, can avoid the common problem of 'unknown words' during vectorization of Chinese words, and has important application value in natural language processing of Chinese.

Description

Method for calculating Chinese word vector by using principal component analysis
Technical Field
The invention belongs to the field of language processing, and particularly relates to a method for calculating word vectors of Chinese words by using principal component analysis, in particular to a method for calculating word vectors of Chinese words by using a Chinese character dot matrix and principal component analysis.
Background
Natural language processing is a technique for processing human language with a computer. Since computers are good at numerical computation, to process natural language, it is first necessary to convert the natural language into a numerical form. The process of converting natural language into numerical form is called vectorization of characters, words and sentences, i.e. a character, a word and a sentence are respectively expressed by a plurality of numbers.
Common word vectorization technologies include one hot (one hot) technology and continuous bag of words (CBOW) technology. In the one-hot technique, a vocabulary is determined in advance, for example, 10000 vocabularies, each word is represented by 10000 ordered numbers (10000-dimensional vector), and if a word is arranged in the ith bit in the vocabulary, the corresponding vector has an ith component of 1 and the remaining components of 0.
The one-hot expression redundancy is large, people develop continuous word bag expression, a word in a sentence is used as a central word, n words before and after the word are used as associated words, an average one-hot vector of the one-hot expression of the n song associated words is input into a neural network for training, and the output of the neural network is the one-hot expression of the central word. And after the neural network is stabilized, the weight of the connection between the ith output node of the neural network and the hidden layer node is the word vector of the ith word.
Both the one-hot representation and the continuous bag-of-words representation require a pre-determined vocabulary size, and if there is a change in the vocabulary, a recalculation of the word vector for each word is required. Furthermore, training of neural networks consumes a significant amount of computing power and time when there are many words in the vocabulary. This is the case in the natural language processing of chinese,
the Chinese character is expressed as a matrix synthesized by the dot matrix of Chinese characters in words, then orthogonal transformation is carried out, partial coefficients after orthogonal transformation are removed to be used as word vectors, new words are allowed to be added, but the number of the coefficients (the dimension of the word vectors) is difficult to determine.
If a simple word vector calculation method is available, the characteristics of the natural language are fully utilized, the defects of the method are overcome, and the application range of natural language processing can be expanded without being influenced by adding new words. The present invention has been developed in response to such real needs.
Disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is how to provide a method for calculating Chinese word vectors by using principal component analysis, so as to solve the problem that the common word vectorization technology needs to consume a large amount of computing power and time.
(II) technical scheme
In order to solve the above technical problem, the present invention provides a method for calculating a chinese word vector by using principal component analysis, which is characterized in that the method comprises the following steps:
s1, selecting a standard Chinese vocabulary, selecting a representative word in Chinese as a standard of pivot analysis;
s2, acquiring dot matrix vectors of each Chinese character in the Chinese words, and representing the Chinese characters by vectors formed by numerical values, so as to facilitate the further processing of a computer;
s3, calculating the synthetic vector of each Chinese word, combining the Chinese lattice vectors in the Chinese words into the synthetic vector of the word itself, and converting the word into a numerical value vector form;
s4, calculating the average synthetic vector of the reference vocabulary, and calculating the average synthetic vector of all words of the reference vocabulary;
s5, calculating a covariance matrix of the reference words, and obtaining a covariance matrix of differences among the words by subtracting the average synthetic vector from the synthetic vector of each word in the reference words and then multiplying the resultant vectors;
s6, calculating the eigenvalue and the eigenvector of the covariance matrix to obtain the characteristic of the covariance matrix;
s7, calculating a projection matrix of the Chinese word synthetic vector, and calculating a matrix for transforming the word synthetic vector according to the covariance matrix characteristic;
and S8, calculating word vectors of Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vectors of the words.
Further, the step S1 specifically includes: selecting M Chinese words WkK is 1,2, …, M, including words with only 1 chinese character, as well as words consisting of multiple chinese characters.
Further, the step S2 specifically includes: obtain word WkEach Chinese character CkiLattice vector MC ofkiThe size of the lattice is dxd, and the values of elements in the lattice are 1 and 0; arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns1,a2,…,aD) D ═ D × D, where ai1 or ai=0,i=1,2,…,D。
Further, d-16 or d-24.
Further, the step S3 specifically includes: for a Chinese word W composed of n wordskComposite vector MW of wordkIs the lattice vector MC of each Chinese character in the wordkiWeighted sum of, MWk=w1×MCk1+w2×MCk2+…+wn×MCknEach Chinese character CkiWeight w ofiThe calculation method comprises the following steps:
Figure BDA0003215557270000031
further, the step S4 specifically includes: the calculation method of the average synthesis vector MW of the M reference words comprises the following steps: MW ═ MW (MW)1+MW2+…+MWM)/M。
Further, the step S5 specifically includes: the resultant vector MW of m reference wordsk(ak1,ak2,…,akD) Subtract the average resultant vector MW (a)1,a2,…,aD) Then, forming a matrix A with M rows and D columns;
Figure BDA0003215557270000032
according to the matrix operation rule, calculating X as AT×A;ATDenotes the transposition of A, ATThe covariance matrix X is a matrix of D rows and M columns, and the covariance matrix X is a matrix of D rows and D columns.
Further, the step S6 specifically includes: calculating eigenvalues and eigenvectors of the covariance matrix, and dividing the eigenvalues λjArranged in descending order, λ1≥λ2≥…≥λD
Further, the step S7 specifically includes: selecting a minimum number L satisfying: (lambda12+…+λL)/(λ12+…+λD) ≧ 0.99, the L larger eigenvalues λjCorresponding eigenvector VjAnd forming a projection matrix P with D rows and L columns.
Further, the step S8 specifically includes: for any Chinese word WjCalculating W according to the step (2) and the step (3)jResultant vector MW ofj(aj1,aj2,…,ajD) Subtract the average resultant vector MW (a)1,a2,…,aD) Obtaining the vector Y ═ aj1-a1,aj2-a2,…,ajD-aD) (ii) a According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word WjThe word vector of (2).
(III) advantageous effects
The invention provides a method for calculating Chinese word vectors by using principal component analysis, which fully utilizes the characteristics of Chinese characters, has simple calculation, can avoid the common problem of 'unknown words' during Chinese word vectorization, is easy to determine the dimension of word vectors and has important application value in the natural language processing of Chinese.
Drawings
FIG. 1 is a flowchart of a method for computing Chinese word vectors using principal component analysis according to the present invention.
Detailed Description
In order to make the objects, contents and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
The invention discloses a method for calculating Chinese word vectors by using principal component analysis, which comprises the following steps: (1) and selecting a reference Chinese vocabulary. And selecting representative words in Chinese as the benchmark of pivot analysis. (2) And acquiring a dot matrix vector of each Chinese character in the Chinese words. The Chinese characters are expressed by vectors formed by numerical values, so that the further processing of a computer is facilitated. (3) And calculating the synthetic vector of each Chinese word. The Chinese lattice vectors in the Chinese words are combined into the synthetic vector of the word itself, and the word is also converted into a numerical value vector form. (4) And calculating the average synthetic vector of the reference vocabulary. An average composite vector of all words of the reference vocabulary is calculated. (5) And calculating a covariance matrix of the reference vocabulary. And subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words. (6) And calculating eigenvalues and eigenvectors of the covariance matrix. The characteristic of the covariance matrix is obtained. (7) And calculating a projection matrix of the Chinese word synthetic vector. A matrix is calculated that transforms the composite vector of words based on the covariance matrix properties. (8) And calculating word vectors of the Chinese words. And for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word. The Chinese words are firstly expressed as the synthetic vectors of the Chinese characters in the words to form a vector space, the basic vectors of the vector space are calculated, the synthetic vectors of the words are projected into the vector space, the projection coordinates are used as the word vectors of the Chinese words, the calculation is simple, the common problem of 'unknown words' during vectorization of the Chinese words can be avoided, and the Chinese word vector calculation method has important application value in natural language processing of Chinese.
The purpose of the invention is: a method for calculating Chinese word vectors by using principal component analysis is provided, which meets the requirement of natural language processing for calculating word vectors.
In order to achieve the above object, the present invention provides a method for calculating a Chinese word vector by using pivot analysis, the method comprising:
and S1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as the reference of pivot analysis.
S2, obtaining each Chinese character lattice vector in the Chinese words, and expressing the Chinese characters by the vector composed of numerical values, which is convenient for the further processing of the computer.
S3, calculating the synthetic vector of each Chinese word, combining the Chinese lattice vectors in the Chinese words into the synthetic vector of the word itself, and converting the word into a numerical value vector form.
S4, an average synthetic vector of the reference word is calculated, and an average synthetic vector of all the words of the reference word is calculated.
And S5, calculating a covariance matrix of the reference words, and multiplying the average synthetic vector subtracted by the synthetic vector of each word in the reference words to obtain a covariance matrix of the difference between the words.
And S6, calculating the eigenvalue and the eigenvector of the covariance matrix to obtain the characteristic of the covariance matrix.
S7, calculating a projection matrix of the Chinese word synthetic vector, and calculating a matrix for transforming the word synthetic vector according to the covariance matrix characteristic.
And S8, calculating word vectors of Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vectors of the words.
FIG. 1 is a flow chart of a method for computing Chinese word vectors using principal component analysis in accordance with the present invention. As shown in fig. 1, the method includes:
and S1, selecting a reference Chinese vocabulary. And selecting representative words in Chinese as the benchmark of pivot analysis.
In specific implementation, M (not less than 10000) Chinese words W are selectedkK-1, 2, …, M, including words with only 1 chinese character (the characters in GB 2312 may be selected), and words consisting of multiple chinese characters(common chinese words published by the relevant departments may be selected).
S2, obtaining each Chinese character lattice vector in the Chinese words. The Chinese characters are expressed by vectors formed by numerical values, so that the further processing of a computer is facilitated.
In specific implementation, the word W is obtainedkEach Chinese character CkiIs a lattice MCkiThe lattice size is d × d, d equals 16 or d equals 24. The values of the elements in the lattice are 1 and 0. Arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns1,a2,…,aD) D ═ D × D, where ai1 or ai=0,i=1,2,…,D。
And S3, calculating the synthetic vector of each Chinese word. The Chinese lattice vectors in the Chinese words are combined into the synthetic vector of the word itself, and the word is also converted into a numerical value vector form.
In specific implementation, for a Chinese word W composed of n wordskComposite vector MW of wordkIs the lattice vector MC of each Chinese character in the wordkiWeighted sum of, MWk=w1×MCk1+w2×MCk2+…+wn×MCkn. Each Chinese character CkiWeight w ofiThe calculation method comprises the following steps:
Figure BDA0003215557270000061
and S4, calculating the average synthetic vector of the reference words. An average composite vector of all words of the reference vocabulary is calculated.
In specific implementation, the calculation method of the average synthesis vector MW of the M reference words is as follows: MW ═ MW (MW)1+MW2+…+MWM)/M。
And S5, calculating the covariance matrix of the reference words. And subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words.
The resultant vector MW of m reference wordsk(ak1,ak2,…,akD) Subtract the average resultant vector MW (a)1,a2,…,aD) After that, a matrix a of M rows and D columns is formed.
Figure BDA0003215557270000071
According to the matrix operation rule, calculating X as AT×A。ATDenotes the transposition of A, ATThe covariance matrix X is a matrix of D rows and M columns, and the covariance matrix X is a matrix of D rows and D columns.
And S6, calculating eigenvalues and eigenvectors of the covariance matrix. The characteristic of the covariance matrix is obtained.
In specific implementation, the eigenvalue and eigenvector of the covariance matrix can be calculated by using the Jacobi method or other methods, and the eigenvalue λ is calculatedjArranged in descending order, λ1≥λ2≥…≥λD
And S7, calculating a projection matrix of the Chinese word synthetic vector. A matrix is calculated that transforms the composite vector of words based on the covariance matrix properties.
In specific implementation, a minimum number L is selected to satisfy: (lambda12+…+λL)/(λ12+…+λD) ≧ 0.99, the L larger eigenvalues λjCorresponding eigenvector VjAnd forming a projection matrix P with D rows and L columns.
And S8, calculating word vectors of the Chinese words. And for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word.
In specific implementation, for any Chinese word WjCalculating W according to the step (2) and the step (3)jResultant vector MW ofj(aj1,aj2,…,ajD) Subtract the average resultant vector MW (a)1,a2,…,aD) Obtaining the vector Y ═ aj1-a1,aj2-a2,…,ajD-aD)。According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word WjThe word vector of (2).
The invention provides a method for calculating Chinese word vectors by using principal component analysis, which comprises the following steps:
(1) and selecting a reference Chinese vocabulary. And selecting representative words in Chinese as the benchmark of pivot analysis.
(2) And acquiring a dot matrix vector of each Chinese character in the Chinese words. The Chinese characters are expressed by vectors formed by numerical values, so that the further processing of a computer is facilitated.
(3) And calculating the synthetic vector of each Chinese word. The Chinese lattice vectors in the Chinese words are combined into the synthetic vector of the word itself, and the word is also converted into a numerical value vector form.
(4) And calculating the average synthetic vector of the reference vocabulary. An average composite vector of all words of the reference vocabulary is calculated.
(5) And calculating a covariance matrix of the reference vocabulary. And subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words.
(6) And calculating eigenvalues and eigenvectors of the covariance matrix. The characteristic of the covariance matrix is obtained.
(7) And calculating a projection matrix of the Chinese word synthetic vector. A matrix is calculated that transforms the composite vector of words based on the covariance matrix properties.
(8) And calculating word vectors of the Chinese words. And for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word.
Further, in the step (1), M (not less than 10000) Chinese words W are selectedkAnd k is 1,2, …, M, including words of only 1 chinese character in GB 2312, and words consisting of multiple chinese characters in common chinese words published by the relevant departments.
Further, in the step (2), a word W is obtainedkEach Chinese character CkiIs a lattice MCkiThe lattice size is d × d, d equals 16 or d equals 24.The values of the elements in the lattice are 1 and 0. Arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns1,a2,…,aD) D ═ D × D, where ai1 or ai=0,i=1,2,…,D。
Further, in the step (3), for a Chinese word W composed of n wordskComposite vector MW of wordkIs the lattice vector MC of each Chinese character in the wordkiWeighted sum of, MWk=w1×MCk1+w2×MCk2+…+wn×MCkn. Each Chinese character CkiWeight w ofiThe calculation method comprises the following steps:
Figure BDA0003215557270000091
further, in the step (4), the average resultant vector MW of the M reference words is calculated as: MW ═ MW (MW)1+MW2+…+MWM)/M。
Further, in the step (5), a resultant vector MW of the m reference words is obtainedk(ak1,ak2,…,akD) Subtract the average resultant vector MW (a)1,a2,…,aD) After that, a matrix a of M rows and D columns is formed.
Figure BDA0003215557270000092
According to the matrix operation rule, calculating X as AT×A。ATDenotes the transposition of A, ATThe covariance matrix X is a matrix of D rows and M columns, and the covariance matrix X is a matrix of D rows and D columns.
Further, in the step (6), after calculating eigenvalues and eigenvectors of the covariance matrix, the eigenvalue λ is calculatedjArranged in descending order, λ1≥λ2≥…≥λD
Further, in the step (7)In specific implementation, a minimum number L is selected to satisfy: (lambda12+…+λL)/(λ12+…+λD) ≧ 0.99, the L larger eigenvalues λjCorresponding eigenvector VjAnd forming a projection matrix P with D rows and L columns.
Further, in the step (8), for any Chinese word WjCalculating W according to the step (2) and the step (3)jResultant vector MW ofj(aj1,aj2,…,ajD) Subtract the average resultant vector MW (a)1,a2,…,aD) Obtaining the vector Y ═ aj1-a1,aj2-a2,…,ajD-aD). According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word WjThe word vector of (2).
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for calculating Chinese word vectors by using principal component analysis is characterized by comprising the following steps:
s1, selecting a standard Chinese vocabulary, selecting a representative word in Chinese as a standard of pivot analysis;
s2, acquiring dot matrix vectors of each Chinese character in the Chinese words, and representing the Chinese characters by vectors formed by numerical values, so as to facilitate the further processing of a computer;
s3, calculating the synthetic vector of each Chinese word, combining the Chinese lattice vectors in the Chinese words into the synthetic vector of the word itself, and converting the word into a numerical value vector form;
s4, calculating the average synthetic vector of the reference vocabulary, and calculating the average synthetic vector of all words of the reference vocabulary;
s5, calculating a covariance matrix of the reference words, and obtaining a covariance matrix of differences among the words by subtracting the average synthetic vector from the synthetic vector of each word in the reference words and then multiplying the resultant vectors;
s6, calculating the eigenvalue and the eigenvector of the covariance matrix to obtain the characteristic of the covariance matrix;
s7, calculating a projection matrix of the Chinese word synthetic vector, and calculating a matrix for transforming the word synthetic vector according to the covariance matrix characteristic;
and S8, calculating word vectors of Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vectors of the words.
2. The method of claim 1, wherein the step S1 specifically includes: selecting M Chinese words WkK is 1,2, …, M, including words with only 1 chinese character, as well as words consisting of multiple chinese characters.
3. The method for calculating Chinese word vectors using pivot analysis as claimed in claim 1 or 2, wherein said step S2 specifically includes: obtain word WkEach Chinese character CkiLattice vector MC ofkiThe size of the lattice is dxd, and the values of elements in the lattice are 1 and 0; arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns1,a2,…,aD) D ═ D × D, where ai1 or ai=0,i=1,2,…,D。
4. The method of claim 3, wherein d-16 or d-24 is used for calculating the Chinese word vector.
5. The method of claim 3, wherein the step S3 specifically includes: for a Chinese word W composed of n wordskComposite vector MW of wordkIs the lattice vector MC of each Chinese character in the wordkiWeighted sum of, MWk=w1×MCk1+w2×MCk2+…+wn×MCknEach Chinese character CkiWeight w ofiThe calculation method comprises the following steps:
Figure FDA0003215557260000021
6. the method of claim 4, wherein the step S4 specifically includes: the calculation method of the average synthesis vector MW of the M reference words comprises the following steps: MW ═ MW (MW)1+MW2+…+MWM)/M。
7. The method of claim 5, wherein the step S5 specifically includes: the resultant vector MW of m reference wordsk(ak1,ak2,…,akD) Subtract the average resultant vector MW (a)1,a2,…,aD) Then, forming a matrix A with M rows and D columns;
Figure FDA0003215557260000022
according to the matrix operation rule, calculating X as AT×A;ATDenotes the transposition of A, ATThe covariance matrix X is a matrix of D rows and M columns, and the covariance matrix X is a matrix of D rows and D columns.
8. The method of claim 6, wherein the step S6 specifically includes: calculating eigenvalues and eigenvectors of the covariance matrix, and dividing the eigenvalues λjArranged in descending order, λ1≥λ2≥…≥λD
9. The method of claim 7, wherein the step S7 specifically includes: selecting a minimum number L satisfying: (lambda12+…+λL)/(λ12+…+λD) ≧ 0.99, the L larger eigenvalues λjCorresponding eigenvector VjAnd forming a projection matrix P with D rows and L columns.
10. The method of claim 8, wherein the step S8 specifically includes: for any Chinese word WjCalculating W according to the step (2) and the step (3)jResultant vector MW ofj(aj1,aj2,…,ajD) Subtract the average resultant vector MW (a)1,a2,…,aD) Obtaining the vector Y ═ aj1-a1,aj2-a2,…,ajD-aD) (ii) a According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word WjThe word vector of (2).
CN202110942291.6A 2021-08-17 Method for calculating Chinese word vector by principal component analysis Active CN113627176B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110942291.6A CN113627176B (en) 2021-08-17 Method for calculating Chinese word vector by principal component analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110942291.6A CN113627176B (en) 2021-08-17 Method for calculating Chinese word vector by principal component analysis

Publications (2)

Publication Number Publication Date
CN113627176A true CN113627176A (en) 2021-11-09
CN113627176B CN113627176B (en) 2024-04-19

Family

ID=

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786966A (en) * 2004-12-09 2006-06-14 索尼英国有限公司 Information treatment
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
US20080255839A1 (en) * 2004-09-14 2008-10-16 Zentian Limited Speech Recognition Circuit and Method
US20110010319A1 (en) * 2007-09-14 2011-01-13 The University Of Tokyo Correspondence learning apparatus and method and correspondence learning program, annotation apparatus and method and annotation program, and retrieval apparatus and method and retrieval program
CN102135820A (en) * 2011-01-18 2011-07-27 浙江大学 Planarization pre-processing method
JP2011164126A (en) * 2010-02-04 2011-08-25 Nippon Telegr & Teleph Corp <Ntt> Noise suppression filter calculation method, and device and program therefor
CN104598441A (en) * 2014-12-25 2015-05-06 上海科阅信息技术有限公司 Method for splitting Chinese sentences through computer
CN107194408A (en) * 2017-06-21 2017-09-22 安徽大学 A kind of method for tracking target of the sparse coordination model of mixed block
CN107273355A (en) * 2017-06-12 2017-10-20 大连理工大学 A kind of Chinese word vector generation method based on words joint training
CN108154167A (en) * 2017-12-04 2018-06-12 昆明理工大学 A kind of Chinese character pattern similarity calculating method
CN109582951A (en) * 2018-10-19 2019-04-05 昆明理工大学 A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm
CN109992716A (en) * 2019-03-29 2019-07-09 电子科技大学 A kind of similar news recommended method of Indonesian based on ITQ algorithm
CN110059191A (en) * 2019-05-07 2019-07-26 山东师范大学 A kind of text sentiment classification method and device
CN110196893A (en) * 2019-05-05 2019-09-03 平安科技(深圳)有限公司 Non- subjective item method to go over files, device and storage medium based on text similarity
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080255839A1 (en) * 2004-09-14 2008-10-16 Zentian Limited Speech Recognition Circuit and Method
CN1786966A (en) * 2004-12-09 2006-06-14 索尼英国有限公司 Information treatment
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
US20110010319A1 (en) * 2007-09-14 2011-01-13 The University Of Tokyo Correspondence learning apparatus and method and correspondence learning program, annotation apparatus and method and annotation program, and retrieval apparatus and method and retrieval program
JP2011164126A (en) * 2010-02-04 2011-08-25 Nippon Telegr & Teleph Corp <Ntt> Noise suppression filter calculation method, and device and program therefor
CN102135820A (en) * 2011-01-18 2011-07-27 浙江大学 Planarization pre-processing method
CN104598441A (en) * 2014-12-25 2015-05-06 上海科阅信息技术有限公司 Method for splitting Chinese sentences through computer
CN107273355A (en) * 2017-06-12 2017-10-20 大连理工大学 A kind of Chinese word vector generation method based on words joint training
CN107194408A (en) * 2017-06-21 2017-09-22 安徽大学 A kind of method for tracking target of the sparse coordination model of mixed block
CN108154167A (en) * 2017-12-04 2018-06-12 昆明理工大学 A kind of Chinese character pattern similarity calculating method
CN109582951A (en) * 2018-10-19 2019-04-05 昆明理工大学 A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm
CN109992716A (en) * 2019-03-29 2019-07-09 电子科技大学 A kind of similar news recommended method of Indonesian based on ITQ algorithm
CN110196893A (en) * 2019-05-05 2019-09-03 平安科技(深圳)有限公司 Non- subjective item method to go over files, device and storage medium based on text similarity
CN110059191A (en) * 2019-05-07 2019-07-26 山东师范大学 A kind of text sentiment classification method and device
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
YUANXIN LI等: "Compressive parameter estimation with multiple measurement vectors via structured low-rank covariance estimation", 2014 IEEE WORKSHOP ON STATISTICAL SIGNAL PROCESSING, pages 384 *
丁维: "基于专家知识和深度学习的领域术语网络模型构建", 中国优秀硕士学位论文全文数据库信息科技辑, no. 1, pages 138 - 2560 *
李照耀: "藏语连续语音识别的语言模型研究", 中国优秀硕士学位论文全文数据库信息科技辑, no. 5, pages 136 - 180 *
翟海超: "基于流形学习方法的中文文本分类研究", 中国优秀硕士学位论文全文数据库信息科技辑, no. 3, pages 138 - 2799 *
赵彦斌;李庆华;: "汉字关联性量化方法及其在文本相似性分析中的应用", 计算机应用, vol. 26, no. 06, pages 1398 *

Similar Documents

Publication Publication Date Title
WO2020174826A1 (en) Answer generating device, answer learning device, answer generating method, and answer generating program
Bucak et al. Incremental subspace learning via non-negative matrix factorization
CN111460807B (en) Sequence labeling method, device, computer equipment and storage medium
CN107220220A (en) Electronic equipment and method for text-processing
JP7139626B2 (en) Phrase generation relationship estimation model learning device, phrase generation device, method, and program
KR101939209B1 (en) Apparatus for classifying category of a text based on neural network, method thereof and computer recordable medium storing program to perform the method
Shah et al. Image captioning using deep neural architectures
CN116095089B (en) Remote sensing satellite data processing method and system
Ye et al. MultiTL-KELM: A multi-task learning algorithm for multi-step-ahead time series prediction
Lin et al. Intelligent decision support for new product development: a consumer-oriented approach
JP2017016384A (en) Mixed coefficient parameter learning device, mixed occurrence probability calculation device, and programs thereof
WO2020170881A1 (en) Question answering device, learning device, question answering method, and program
CN113157919A (en) Sentence text aspect level emotion classification method and system
CN116561410A (en) Course teaching resource recommendation method
WO2020040255A1 (en) Word coding device, analysis device, language model learning device, method, and program
US20210089904A1 (en) Learning method of neural network model for language generation and apparatus for performing the learning method
Poghosyan et al. Short-term memory with read-only unit in neural image caption generator
CN113627176A (en) Method for calculating Chinese word vector by using principal component analysis
CN113627176B (en) Method for calculating Chinese word vector by principal component analysis
WO2019163752A1 (en) Morpheme analysis learning device, morpheme analysis device, method, and program
CN111259106A (en) Relation extraction method combining neural network and feature calculation
CN111177381A (en) Slot filling and intention detection joint modeling method based on context vector feedback
CN113221551B (en) Fine-grained sentiment analysis method based on sequence generation
CN113449517B (en) Entity relationship extraction method based on BERT gated multi-window attention network model
CN112465929B (en) Image generation method based on improved graph convolution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant