CN113627176A

CN113627176A - Method for calculating Chinese word vector by using principal component analysis

Info

Publication number: CN113627176A
Application number: CN202110942291.6A
Authority: CN
Inventors: 蒋遂平; 袁晓光; 刘轩; 王璐静; 臧小滨
Original assignee: Beijing Aerospace Aiwei Electronic Technology Ltd; Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Aerospace Aiwei Electronic Technology Ltd; Beijing Institute of Computer Technology and Applications
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2021-11-09
Anticipated expiration: 2041-08-17

Abstract

The invention relates to a method for calculating Chinese word vectors by using principal component analysis, belonging to the field of language processing. The method selects representative words in Chinese as the basis of principal component analysis; expressing the Chinese characters by vectors formed by numerical values; combining Chinese character lattice vectors in Chinese words into a synthetic vector of the word itself, and converting the word into a numerical value vector form; calculating an average synthetic vector of all words of the reference vocabulary; subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words; obtaining the characteristic of a covariance matrix; calculating a matrix for transforming the synthetic vector of the word according to the covariance matrix characteristic; and for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word. The method is simple in calculation, can avoid the common problem of 'unknown words' during vectorization of Chinese words, and has important application value in natural language processing of Chinese.

Description

Method for calculating Chinese word vector by using principal component analysis

Technical Field

The invention belongs to the field of language processing, and particularly relates to a method for calculating word vectors of Chinese words by using principal component analysis, in particular to a method for calculating word vectors of Chinese words by using a Chinese character dot matrix and principal component analysis.

Background

Natural language processing is a technique for processing human language with a computer. Since computers are good at numerical computation, to process natural language, it is first necessary to convert the natural language into a numerical form. The process of converting natural language into numerical form is called vectorization of characters, words and sentences, i.e. a character, a word and a sentence are respectively expressed by a plurality of numbers.

Common word vectorization technologies include one hot (one hot) technology and continuous bag of words (CBOW) technology. In the one-hot technique, a vocabulary is determined in advance, for example, 10000 vocabularies, each word is represented by 10000 ordered numbers (10000-dimensional vector), and if a word is arranged in the ith bit in the vocabulary, the corresponding vector has an ith component of 1 and the remaining components of 0.

The one-hot expression redundancy is large, people develop continuous word bag expression, a word in a sentence is used as a central word, n words before and after the word are used as associated words, an average one-hot vector of the one-hot expression of the n song associated words is input into a neural network for training, and the output of the neural network is the one-hot expression of the central word. And after the neural network is stabilized, the weight of the connection between the ith output node of the neural network and the hidden layer node is the word vector of the ith word.

Both the one-hot representation and the continuous bag-of-words representation require a pre-determined vocabulary size, and if there is a change in the vocabulary, a recalculation of the word vector for each word is required. Furthermore, training of neural networks consumes a significant amount of computing power and time when there are many words in the vocabulary. This is the case in the natural language processing of chinese,

the Chinese character is expressed as a matrix synthesized by the dot matrix of Chinese characters in words, then orthogonal transformation is carried out, partial coefficients after orthogonal transformation are removed to be used as word vectors, new words are allowed to be added, but the number of the coefficients (the dimension of the word vectors) is difficult to determine.

If a simple word vector calculation method is available, the characteristics of the natural language are fully utilized, the defects of the method are overcome, and the application range of natural language processing can be expanded without being influenced by adding new words. The present invention has been developed in response to such real needs.

Disclosure of Invention

Technical problem to be solved

The technical problem to be solved by the invention is how to provide a method for calculating Chinese word vectors by using principal component analysis, so as to solve the problem that the common word vectorization technology needs to consume a large amount of computing power and time.

(II) technical scheme

In order to solve the above technical problem, the present invention provides a method for calculating a chinese word vector by using principal component analysis, which is characterized in that the method comprises the following steps:

s1, selecting a standard Chinese vocabulary, selecting a representative word in Chinese as a standard of pivot analysis;

s2, acquiring dot matrix vectors of each Chinese character in the Chinese words, and representing the Chinese characters by vectors formed by numerical values, so as to facilitate the further processing of a computer;

s3, calculating the synthetic vector of each Chinese word, combining the Chinese lattice vectors in the Chinese words into the synthetic vector of the word itself, and converting the word into a numerical value vector form;

s4, calculating the average synthetic vector of the reference vocabulary, and calculating the average synthetic vector of all words of the reference vocabulary;

s5, calculating a covariance matrix of the reference words, and obtaining a covariance matrix of differences among the words by subtracting the average synthetic vector from the synthetic vector of each word in the reference words and then multiplying the resultant vectors;

s6, calculating the eigenvalue and the eigenvector of the covariance matrix to obtain the characteristic of the covariance matrix;

s7, calculating a projection matrix of the Chinese word synthetic vector, and calculating a matrix for transforming the word synthetic vector according to the covariance matrix characteristic;

and S8, calculating word vectors of Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vectors of the words.

Further, the step S1 specifically includes: selecting M Chinese words W_kK is 1,2, …, M, including words with only 1 chinese character, as well as words consisting of multiple chinese characters.

Further, the step S2 specifically includes: obtain word W_kEach Chinese character C_kiLattice vector MC of_kiThe size of the lattice is dxd, and the values of elements in the lattice are 1 and 0; arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns₁,a₂,…,a_D) D ═ D × D, where a_i1 or a_i＝0，i＝1,2,…,D。

Further, d-16 or d-24.

Further, the step S3 specifically includes: for a Chinese word W composed of n words_kComposite vector MW of word_kIs the lattice vector MC of each Chinese character in the word_kiWeighted sum of, MW_k＝w₁×MC_k1+w₂×MC_k2+…+w_n×MC_knEach Chinese character C_kiWeight w of_iThe calculation method comprises the following steps:

further, the step S4 specifically includes: the calculation method of the average synthesis vector MW of the M reference words comprises the following steps: MW ═ MW (MW)₁+MW₂+…+MW_M)/M。

Further, the step S5 specifically includes: the resultant vector MW of m reference words_k(a_k1,a_k2,…,a_kD) Subtract the average resultant vector MW (a)₁,a₂,…,a_D) Then, forming a matrix A with M rows and D columns;

according to the matrix operation rule, calculating X as A^T×A；A^TDenotes the transposition of A, A^TThe covariance matrix X is a matrix of D rows and M columns, and the covariance matrix X is a matrix of D rows and D columns.

Further, the step S6 specifically includes: calculating eigenvalues and eigenvectors of the covariance matrix, and dividing the eigenvalues λ_jArranged in descending order, λ₁≥λ₂≥…≥λ_D。

Further, the step S7 specifically includes: selecting a minimum number L satisfying: (lambda₁+λ₂+…+λ_L)/(λ₁+λ₂+…+λ_D) ≧ 0.99, the L larger eigenvalues λ_jCorresponding eigenvector V_jAnd forming a projection matrix P with D rows and L columns.

Further, the step S8 specifically includes: for any Chinese word W_jCalculating W according to the step (2) and the step (3)_jResultant vector MW of_j(a_j1,a_j2,…,a_jD) Subtract the average resultant vector MW (a)₁,a₂,…,a_D) Obtaining the vector Y ═ a_j1-a₁,a_j2-a₂,…,a_jD-a_D) (ii) a According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word W_jThe word vector of (2).

(III) advantageous effects

The invention provides a method for calculating Chinese word vectors by using principal component analysis, which fully utilizes the characteristics of Chinese characters, has simple calculation, can avoid the common problem of 'unknown words' during Chinese word vectorization, is easy to determine the dimension of word vectors and has important application value in the natural language processing of Chinese.

Drawings

FIG. 1 is a flowchart of a method for computing Chinese word vectors using principal component analysis according to the present invention.

Detailed Description

In order to make the objects, contents and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

The invention discloses a method for calculating Chinese word vectors by using principal component analysis, which comprises the following steps: (1) and selecting a reference Chinese vocabulary. And selecting representative words in Chinese as the benchmark of pivot analysis. (2) And acquiring a dot matrix vector of each Chinese character in the Chinese words. The Chinese characters are expressed by vectors formed by numerical values, so that the further processing of a computer is facilitated. (3) And calculating the synthetic vector of each Chinese word. The Chinese lattice vectors in the Chinese words are combined into the synthetic vector of the word itself, and the word is also converted into a numerical value vector form. (4) And calculating the average synthetic vector of the reference vocabulary. An average composite vector of all words of the reference vocabulary is calculated. (5) And calculating a covariance matrix of the reference vocabulary. And subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words. (6) And calculating eigenvalues and eigenvectors of the covariance matrix. The characteristic of the covariance matrix is obtained. (7) And calculating a projection matrix of the Chinese word synthetic vector. A matrix is calculated that transforms the composite vector of words based on the covariance matrix properties. (8) And calculating word vectors of the Chinese words. And for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word. The Chinese words are firstly expressed as the synthetic vectors of the Chinese characters in the words to form a vector space, the basic vectors of the vector space are calculated, the synthetic vectors of the words are projected into the vector space, the projection coordinates are used as the word vectors of the Chinese words, the calculation is simple, the common problem of 'unknown words' during vectorization of the Chinese words can be avoided, and the Chinese word vector calculation method has important application value in natural language processing of Chinese.

The purpose of the invention is: a method for calculating Chinese word vectors by using principal component analysis is provided, which meets the requirement of natural language processing for calculating word vectors.

In order to achieve the above object, the present invention provides a method for calculating a Chinese word vector by using pivot analysis, the method comprising:

and S1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as the reference of pivot analysis.

S2, obtaining each Chinese character lattice vector in the Chinese words, and expressing the Chinese characters by the vector composed of numerical values, which is convenient for the further processing of the computer.

S3, calculating the synthetic vector of each Chinese word, combining the Chinese lattice vectors in the Chinese words into the synthetic vector of the word itself, and converting the word into a numerical value vector form.

S4, an average synthetic vector of the reference word is calculated, and an average synthetic vector of all the words of the reference word is calculated.

And S5, calculating a covariance matrix of the reference words, and multiplying the average synthetic vector subtracted by the synthetic vector of each word in the reference words to obtain a covariance matrix of the difference between the words.

And S6, calculating the eigenvalue and the eigenvector of the covariance matrix to obtain the characteristic of the covariance matrix.

S7, calculating a projection matrix of the Chinese word synthetic vector, and calculating a matrix for transforming the word synthetic vector according to the covariance matrix characteristic.

FIG. 1 is a flow chart of a method for computing Chinese word vectors using principal component analysis in accordance with the present invention. As shown in fig. 1, the method includes:

and S1, selecting a reference Chinese vocabulary. And selecting representative words in Chinese as the benchmark of pivot analysis.

In specific implementation, M (not less than 10000) Chinese words W are selected_kK-1, 2, …, M, including words with only 1 chinese character (the characters in GB 2312 may be selected), and words consisting of multiple chinese characters(common chinese words published by the relevant departments may be selected).

S2, obtaining each Chinese character lattice vector in the Chinese words. The Chinese characters are expressed by vectors formed by numerical values, so that the further processing of a computer is facilitated.

In specific implementation, the word W is obtained_kEach Chinese character C_kiIs a lattice MC_kiThe lattice size is d × d, d equals 16 or d equals 24. The values of the elements in the lattice are 1 and 0. Arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns₁,a₂,…,a_D) D ═ D × D, where a_i1 or a_i＝0，i＝1,2,…,D。

And S3, calculating the synthetic vector of each Chinese word. The Chinese lattice vectors in the Chinese words are combined into the synthetic vector of the word itself, and the word is also converted into a numerical value vector form.

In specific implementation, for a Chinese word W composed of n words_kComposite vector MW of word_kIs the lattice vector MC of each Chinese character in the word_kiWeighted sum of, MW_k＝w₁×MC_k1+w₂×MC_k2+…+w_n×MC_kn. Each Chinese character C_kiWeight w of_iThe calculation method comprises the following steps:

and S4, calculating the average synthetic vector of the reference words. An average composite vector of all words of the reference vocabulary is calculated.

In specific implementation, the calculation method of the average synthesis vector MW of the M reference words is as follows: MW ═ MW (MW)₁+MW₂+…+MW_M)/M。

And S5, calculating the covariance matrix of the reference words. And subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words.

The resultant vector MW of m reference words_k(a_k1,a_k2,…,a_kD) Subtract the average resultant vector MW (a)₁,a₂,…,a_D) After that, a matrix a of M rows and D columns is formed.

According to the matrix operation rule, calculating X as A^T×A。A^TDenotes the transposition of A, A^TThe covariance matrix X is a matrix of D rows and M columns, and the covariance matrix X is a matrix of D rows and D columns.

And S6, calculating eigenvalues and eigenvectors of the covariance matrix. The characteristic of the covariance matrix is obtained.

In specific implementation, the eigenvalue and eigenvector of the covariance matrix can be calculated by using the Jacobi method or other methods, and the eigenvalue λ is calculated_jArranged in descending order, λ₁≥λ₂≥…≥λ_D。

And S7, calculating a projection matrix of the Chinese word synthetic vector. A matrix is calculated that transforms the composite vector of words based on the covariance matrix properties.

In specific implementation, a minimum number L is selected to satisfy: (lambda₁+λ₂+…+λ_L)/(λ₁+λ₂+…+λ_D) ≧ 0.99, the L larger eigenvalues λ_jCorresponding eigenvector V_jAnd forming a projection matrix P with D rows and L columns.

And S8, calculating word vectors of the Chinese words. And for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word.

In specific implementation, for any Chinese word W_jCalculating W according to the step (2) and the step (3)_jResultant vector MW of_j(a_j1,a_j2,…,a_jD) Subtract the average resultant vector MW (a)₁,a₂,…,a_D) Obtaining the vector Y ═ a_j1-a₁,a_j2-a₂,…,a_jD-a_D)。According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word W_jThe word vector of (2).

The invention provides a method for calculating Chinese word vectors by using principal component analysis, which comprises the following steps:

(1) and selecting a reference Chinese vocabulary. And selecting representative words in Chinese as the benchmark of pivot analysis.

(2) And acquiring a dot matrix vector of each Chinese character in the Chinese words. The Chinese characters are expressed by vectors formed by numerical values, so that the further processing of a computer is facilitated.

(3) And calculating the synthetic vector of each Chinese word. The Chinese lattice vectors in the Chinese words are combined into the synthetic vector of the word itself, and the word is also converted into a numerical value vector form.

(4) And calculating the average synthetic vector of the reference vocabulary. An average composite vector of all words of the reference vocabulary is calculated.

(5) And calculating a covariance matrix of the reference vocabulary. And subtracting the average synthetic vector from the synthetic vector of each word in the reference vocabulary, and then multiplying the synthetic vectors to obtain a covariance matrix of the difference between the words.

(6) And calculating eigenvalues and eigenvectors of the covariance matrix. The characteristic of the covariance matrix is obtained.

(7) And calculating a projection matrix of the Chinese word synthetic vector. A matrix is calculated that transforms the composite vector of words based on the covariance matrix properties.

(8) And calculating word vectors of the Chinese words. And for the synthetic vector of any Chinese word, subtracting the average synthetic vector, and multiplying by a projection matrix to obtain the word vector of the word.

Further, in the step (1), M (not less than 10000) Chinese words W are selected_kAnd k is 1,2, …, M, including words of only 1 chinese character in GB 2312, and words consisting of multiple chinese characters in common chinese words published by the relevant departments.

Further, in the step (2), a word W is obtained_kEach Chinese character C_kiIs a lattice MC_kiThe lattice size is d × d, d equals 16 or d equals 24.The values of the elements in the lattice are 1 and 0. Arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns₁,a₂,…,a_D) D ═ D × D, where a_i1 or a_i＝0，i＝1,2,…,D。

Further, in the step (3), for a Chinese word W composed of n words_kComposite vector MW of word_kIs the lattice vector MC of each Chinese character in the word_kiWeighted sum of, MW_k＝w₁×MC_k1+w₂×MC_k2+…+w_n×MC_kn. Each Chinese character C_kiWeight w of_iThe calculation method comprises the following steps:

further, in the step (4), the average resultant vector MW of the M reference words is calculated as: MW ═ MW (MW)₁+MW₂+…+MW_M)/M。

Further, in the step (5), a resultant vector MW of the m reference words is obtained_k(a_k1,a_k2,…,a_kD) Subtract the average resultant vector MW (a)₁,a₂,…,a_D) After that, a matrix a of M rows and D columns is formed.

Further, in the step (6), after calculating eigenvalues and eigenvectors of the covariance matrix, the eigenvalue λ is calculated_jArranged in descending order, λ₁≥λ₂≥…≥λ_D。

Further, in the step (7)In specific implementation, a minimum number L is selected to satisfy: (lambda₁+λ₂+…+λ_L)/(λ₁+λ₂+…+λ_D) ≧ 0.99, the L larger eigenvalues λ_jCorresponding eigenvector V_jAnd forming a projection matrix P with D rows and L columns.

Further, in the step (8), for any Chinese word W_jCalculating W according to the step (2) and the step (3)_jResultant vector MW of_j(a_j1,a_j2,…,a_jD) Subtract the average resultant vector MW (a)₁,a₂,…,a_D) Obtaining the vector Y ═ a_j1-a₁,a_j2-a₂,…,a_jD-a_D). According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word W_jThe word vector of (2).

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A method for calculating Chinese word vectors by using principal component analysis is characterized by comprising the following steps:

2. The method of claim 1, wherein the step S1 specifically includes: selecting M Chinese words W_kK is 1,2, …, M, including words with only 1 chinese character, as well as words consisting of multiple chinese characters.

3. The method for calculating Chinese word vectors using pivot analysis as claimed in claim 1 or 2, wherein said step S2 specifically includes: obtain word W_kEach Chinese character C_kiLattice vector MC of_kiThe size of the lattice is dxd, and the values of elements in the lattice are 1 and 0; arranging the elements of each Chinese character lattice into a vector (a) with 1 line and D columns according to the sequence of lines or columns₁,a₂,…,a_D) D ═ D × D, where a_i1 or a_i＝0，i＝1,2,…,D。

4. The method of claim 3, wherein d-16 or d-24 is used for calculating the Chinese word vector.

5. The method of claim 3, wherein the step S3 specifically includes: for a Chinese word W composed of n words_kComposite vector MW of word_kIs the lattice vector MC of each Chinese character in the word_kiWeighted sum of, MW_k＝w₁×MC_k1+w₂×MC_k2+…+w_n×MC_knEach Chinese character C_kiWeight w of_iThe calculation method comprises the following steps:

6. the method of claim 4, wherein the step S4 specifically includes: the calculation method of the average synthesis vector MW of the M reference words comprises the following steps: MW ═ MW (MW)₁+MW₂+…+MW_M)/M。

7. The method of claim 5, wherein the step S5 specifically includes: the resultant vector MW of m reference words_k(a_k1,a_k2,…,a_kD) Subtract the average resultant vector MW (a)₁,a₂,…,a_D) Then, forming a matrix A with M rows and D columns;

8. The method of claim 6, wherein the step S6 specifically includes: calculating eigenvalues and eigenvectors of the covariance matrix, and dividing the eigenvalues λ_jArranged in descending order, λ₁≥λ₂≥…≥λ_D。

9. The method of claim 7, wherein the step S7 specifically includes: selecting a minimum number L satisfying: (lambda₁+λ₂+…+λ_L)/(λ₁+λ₂+…+λ_D) ≧ 0.99, the L larger eigenvalues λ_jCorresponding eigenvector V_jAnd forming a projection matrix P with D rows and L columns.

10. The method of claim 8, wherein the step S8 specifically includes: for any Chinese word W_jCalculating W according to the step (2) and the step (3)_jResultant vector MW of_j(a_j1,a_j2,…,a_jD) Subtract the average resultant vector MW (a)₁,a₂,…,a_D) Obtaining the vector Y ═ a_j1-a₁,a_j2-a₂,…,a_jD-a_D) (ii) a According to the matrix operation rule, the product Z of the calculation vector Y and the projection matrix P is Y multiplied by P, Z is a vector with 1 row and L columns, and Z is the Chinese word W_jThe word vector of (2).