CN113627176B

CN113627176B - Method for calculating Chinese word vector by principal component analysis

Info

Publication number: CN113627176B
Application number: CN202110942291.6A
Authority: CN
Inventors: 蒋遂平; 袁晓光; 刘轩; 王璐静; 臧小滨
Original assignee: Beijing Aerospace Aiwei Electronic Technology Ltd; Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Aerospace Aiwei Electronic Technology Ltd; Beijing Institute of Computer Technology and Applications
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2024-04-19
Anticipated expiration: 2041-08-17
Also published as: CN113627176A

Abstract

The invention relates to a method for calculating Chinese word vectors by principal component analysis, belonging to the field of language processing. The invention selects representative words in Chinese as the benchmark of principal component analysis; representing Chinese characters by vectors composed of numerical values; combining the Chinese character lattice vectors in the Chinese words into the synthetic vectors of the words, and converting the words into numerical vector forms; calculating an average synthesized vector of all words of the reference vocabulary; subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words; obtaining the characteristic of a covariance matrix; according to covariance matrix characteristics, calculating a matrix for transforming the synthetic vector of the word; and subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word. The method is simple in calculation, can avoid the common problem of 'unknown words' in Chinese word vectorization, and has important application value in Chinese natural language processing.

Description

Method for calculating Chinese word vector by principal component analysis

Technical Field

The invention belongs to the field of language processing, and particularly relates to a method for calculating a Chinese word vector by utilizing principal component analysis, in particular to a method for calculating a word vector of a Chinese word by utilizing a Chinese character lattice and principal component analysis.

Background

Natural language processing is a technique for processing human language with a computer. Since computers are adept at numerical computation, to process natural language, it is first necessary to convert the natural language into numerical form. The process of converting natural language into numerical form is called vectorization of words, words and sentences, i.e. a word, a word and a sentence are respectively represented by a plurality of numbers.

Common word vectorization techniques are the one hot (one hot) technique and the continuous word bag (continuous bag od word, CBOW) technique. In the one-hot technique, a vocabulary, for example, 10000 vocabularies, is determined in advance, each of which is represented by 10000 ordered numbers (10000-dimensional vectors), and if a word is arranged in the vocabulary at the i-th position, the i-th component is 1 and the remaining components are 0 in the corresponding vectors.

The single heat representation redundancy is large, people develop continuous word bag representation, a certain word in a sentence is used as a central word, n words before and after the word are used as related words, the average single heat vector of the single heat representation of the n-song related words is input into a neural network for training, and the output of the neural network is the single heat representation of the central word. When the neural network is stable, the weight of the i output node of the neural network connected with the hidden layer node is the word vector of the i word.

Both the one-hot representation and the continuous bag of words representation require a prior determination of the vocabulary size and, if there is a variation in the vocabulary, the word vector for each word needs to be recalculated. Furthermore, training of neural networks requires a significant amount of computational power and time when there are more words in the vocabulary. This is in the natural language processing of chinese,

The Chinese number is expressed as a matrix synthesized by the dot matrix of Chinese characters in the word, then orthogonal transformation is carried out, and partial coefficients after the orthogonal transformation are removed are used as word vectors, so that new words are allowed to be added, but the number of the coefficients (the dimension of the word vectors) is difficult to determine.

If a word vector calculation method with simple calculation can be provided, the characteristics of natural language are fully utilized, the defects of the method are overcome, the influence of adding new words is avoided, and the application range of natural language processing can be expanded. The present invention has been made in view of such a real demand.

Disclosure of Invention

First, the technical problem to be solved

The invention aims to provide a method for calculating Chinese word vectors by principal component analysis, so as to solve the problem that the common word vectorization technology needs to consume a large amount of computing power and time.

(II) technical scheme

In order to solve the above technical problems, the present invention provides a method for calculating a chinese word vector by principal component analysis, which is characterized in that the method comprises the following steps:

s1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as a reference of principal component analysis;

s2, obtaining each Chinese character lattice vector in Chinese words, and representing the Chinese characters by vectors formed by numerical values, so that the computer can further process the Chinese characters conveniently;

S3, calculating the synthetic vector of each Chinese word, combining the synthetic vector of the word by using the Chinese character lattice vector in the Chinese word, and converting the word into a numerical vector form;

s4, calculating an average synthesized vector of the reference words, and calculating an average synthesized vector of all words of the reference words;

s5, calculating a covariance matrix of the reference words, and multiplying the average synthesized vector subtracted from the synthesized vector of each word in the reference words to obtain a covariance matrix of the difference between the words;

S6, eigenvalues and eigenvectors of the covariance matrix are calculated, and characteristics of the covariance matrix are obtained;

S7, calculating a projection matrix of the Chinese word synthesis vector, and calculating a matrix for transforming the word synthesis vector according to covariance matrix characteristics;

s8, calculating word vectors of the Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vector of the word.

Further, the step S1 specifically includes: m chinese words W _k, k=1, 2, …, M are selected, including words of only 1 chinese character, and words composed of a plurality of chinese characters.

Further, the step S2 specifically includes: obtaining a lattice vector MC _ki of each Chinese character C _ki in the word W _k, wherein the lattice size is d multiplied by d, and the values of elements in the lattice are 1 and 0; the elements of each Chinese character lattice are arranged into a vector (a ₁,a₂,…,a_D) with 1 row and D column according to the row or column sequence, wherein, d=d×d, a _i =1 or a _i =0, i=1, 2, … and D.

Further, d=16 or d=24.

Further, the step S3 specifically includes: for a Chinese word W _k composed of n words, the word synthesis vector MW _k is a weighted sum of the individual Chinese lattice vectors MC _ki in the word, MW _k＝w₁×MC_k1+w₂×MC_k2+…+w_n×MC_kn, and the weight W _i of each Chinese character C _ki is calculated by:

Further, the step S4 specifically includes: the calculation method of the average synthesized vector MW of the M reference words comprises the following steps: MW= (MW ₁+MW₂+…+MW_M)/M.

Further, the step S5 specifically includes: subtracting the average synthesized vector MW (a ₁,a₂,…,a_D) from the synthesized vector MW _k(a_k1,a_k2,…,a_kD of the M reference words to form a matrix A of M rows and D columns;

According to the matrix operation rule, x=a ^T×A;A^T is calculated to represent the transpose of a, a ^T is a matrix of D rows and M columns, and the resulting covariance matrix X is a matrix of D rows and D columns.

Further, the step S6 specifically includes: eigenvalues and eigenvectors of the covariance matrix are calculated and eigenvalues λ _j are arranged in order from large to small, λ ₁≥λ₂≥…≥λ_D.

Further, the step S7 specifically includes: selecting a minimum number L, satisfying: (lambda ₁+λ₂+…+λ_L)/(λ₁+λ₂+…+λ_D) is more than or equal to 0.99, and the eigenvectors V _j corresponding to the L larger eigenvalues lambda _j form a projection matrix P of D rows and L columns.

Further, the step S8 specifically includes: for any chinese word W _j, calculate the composite vector MW _j(a_j1,a_j2,…,a_jD of W _j according to step (2) and step (3), subtract the average composite vector MW (a ₁,a₂,…,a_D) to obtain vector y= (a _j1-a₁,a_j2-a₂,…,a_jD-a_D); according to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W _j.

(III) beneficial effects

The invention provides a method for calculating Chinese word vectors by principal component analysis, which fully utilizes the characteristics of Chinese characters, has simple calculation, can avoid the common problem of 'unknown words' when the Chinese word vectors are vectorized, is easy to determine the dimension of the word vectors, and has important application value in the natural language processing of Chinese.

Drawings

FIG. 1 is a flow chart of a method for computing Chinese word vectors using principal component analysis in accordance with the present invention.

Detailed Description

To make the objects, contents and advantages of the present invention more apparent, the following detailed description of the present invention will be given with reference to the accompanying drawings and examples.

The invention discloses a method for calculating Chinese word vectors by principal component analysis, which comprises the following steps: (1) selecting a reference Chinese vocabulary step. Representative words in Chinese are selected as the basis of principal component analysis. And (2) obtaining the dot matrix vector of each Chinese character in the Chinese word. The Chinese characters are represented by vectors composed of numerical values, so that the computer is convenient for further processing. (3) calculating the synthetic vector of each Chinese word. The Chinese character lattice vector in Chinese word is used to compose the synthetic vector of word itself, and the word is also converted into numerical vector form. (4) calculating an average composite vector of the reference words. An average composite vector of all words of the reference vocabulary is calculated. (5) calculating covariance matrix of the reference vocabulary. And subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words. (6) And calculating eigenvalues and eigenvectors of the covariance matrix. And obtaining the characteristic of the covariance matrix. And (7) calculating a projection matrix of the Chinese word synthesis vector. A matrix for transforming the composite vector of words is calculated based on the covariance matrix characteristics. (8) calculating the word vector of the Chinese word. And subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word. The invention firstly expresses Chinese words as the synthesized vectors of Chinese characters in words, forms a vector space, calculates the basic vectors of the vector space, finally projects the synthesized vectors of the words into the vector space, and takes projection coordinates as the word vectors of the Chinese words.

The purpose of the invention is that: a method for calculating Chinese word vectors by principal component analysis is provided, which meets the requirement of calculating word vectors by natural language processing.

To achieve the above object, the present invention provides a method for calculating a chinese word vector using principal component analysis, the method comprising:

s1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as a reference of principal component analysis.

S2, obtaining each Chinese character lattice vector in Chinese words, and representing the Chinese characters by vectors formed by numerical values, so that the computer is convenient for further processing.

S3, calculating the synthetic vector of each Chinese word, combining the synthetic vector of the word by using the Chinese character lattice vector in the Chinese word, and converting the word into a numerical vector form.

S4, calculating an average synthesized vector of the reference words, and calculating an average synthesized vector of all words of the reference words.

S5, calculating a covariance matrix of the reference words, and multiplying the average synthesized vector subtracted from the synthesized vector of each word in the reference words to obtain a covariance matrix of the difference between the words.

S6, eigenvalues and eigenvectors of the covariance matrix are calculated, and characteristics of the covariance matrix are obtained.

S7, calculating a projection matrix of the Chinese word synthesis vector, and calculating a matrix for transforming the word synthesis vector according to covariance matrix characteristics.

FIG. 1 is a flow chart of a method of calculating a Chinese word vector using principal component analysis in accordance with the present invention. As shown in fig. 1, the method includes:

S1, selecting a reference Chinese vocabulary. Representative words in Chinese are selected as the basis of principal component analysis.

In practice, M (> 10000) chinese words W _k, k=1, 2, …, M are selected, including words of only 1 chinese character (words in GB 2312 may be selected), and words composed of a plurality of chinese characters (words commonly used in the relevant departments may be selected).

S2, obtaining dot matrix vectors of each Chinese character in the Chinese words. The Chinese characters are represented by vectors composed of numerical values, so that the computer is convenient for further processing.

In specific implementation, the lattice MC _ki of each chinese character C _ki in the word W _k is obtained, where the lattice size is d×d, d=16 or d=24. The elements in the lattice take values of 1 and 0. The elements of each Chinese character lattice are arranged into a vector (a ₁,a₂,…,a_D) with 1 row and D column according to the row or column sequence, wherein, d=d×d, a _i =1 or a _i =0, i=1, 2, … and D.

S3, calculating the synthetic vector of each Chinese word. The Chinese character lattice vector in Chinese word is used to compose the synthetic vector of word itself, and the word is also converted into numerical vector form.

In practice, for a Chinese word W _k consisting of n words, the word synthesis vector MW _k is a weighted sum of the individual Chinese lattice vectors MC _ki in the word, MW _k＝w₁×MC_k1+w₂×MC_k2+…+w_n×MC_kn. The calculation method of the weight w _i of each Chinese character C _ki comprises the following steps:

S4, calculating an average synthetic vector of the reference vocabulary. An average composite vector of all words of the reference vocabulary is calculated.

In specific implementation, the calculation method of the average synthesized vector MW of the M reference words comprises the following steps: MW= (MW ₁+MW₂+…+MW_M)/M.

S5, calculating a covariance matrix of the reference vocabulary. And subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words.

The M reference words' composite vector MW _k(a_k1,a_k2,…,a_kD) is subtracted from the average composite vector MW (a ₁,a₂,…,a_D) to form a matrix a of M rows and D columns.

According to the matrix operation rule, x=a ^T×A.A^T is calculated to represent the transpose of a, a ^T is a matrix of D rows and M columns, and the resulting covariance matrix X is a matrix of D rows and D columns.

S6, calculating eigenvalues and eigenvectors of the covariance matrix. And obtaining the characteristic of the covariance matrix.

In practice, the eigenvalues and eigenvectors of the covariance matrix may be calculated using the jacobi method or other methods, and the eigenvalues λ _j are arranged in order from large to small, λ ₁≥λ₂≥…≥λ_D.

S7, calculating a projection matrix of the Chinese word synthesis vector. A matrix for transforming the composite vector of words is calculated based on the covariance matrix characteristics.

In specific implementation, a minimum number L is selected, which satisfies: (lambda ₁+λ₂+…+λ_L)/(λ₁+λ₂+…+λ_D) is more than or equal to 0.99, and the eigenvectors V _j corresponding to the L larger eigenvalues lambda _j form a projection matrix P of D rows and L columns.

S8, calculating word vectors of Chinese words. And subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word.

In specific implementation, for any chinese word W _j, the composite vector MW _j(a_j1,a_j2,…,a_jD of W _j is calculated according to step (2) and step (3), and the average composite vector MW (a ₁,a₂,…,a_D) is subtracted to obtain the vector y= (a _j1-a₁,a_j2-a₂,…,a_jD-a_D). According to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W _j.

The invention provides a method for calculating Chinese word vectors by principal component analysis, which comprises the following steps:

(1) And selecting a reference Chinese vocabulary. Representative words in Chinese are selected as the basis of principal component analysis.

(2) And obtaining the dot matrix vector of each Chinese character in the Chinese word. The Chinese characters are represented by vectors composed of numerical values, so that the computer is convenient for further processing.

(3) And calculating the synthetic vector of each Chinese word. The Chinese character lattice vector in Chinese word is used to compose the synthetic vector of word itself, and the word is also converted into numerical vector form.

(4) And calculating an average synthesized vector of the reference words. An average composite vector of all words of the reference vocabulary is calculated.

(5) And calculating covariance matrix of the reference vocabulary. And subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words.

(6) And calculating eigenvalues and eigenvectors of the covariance matrix. And obtaining the characteristic of the covariance matrix.

(7) And calculating a projection matrix of the Chinese word synthesis vector. A matrix for transforming the composite vector of words is calculated based on the covariance matrix characteristics.

(8) And calculating word vectors of the Chinese words. And subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word.

Further, in the step (1), M (> 10000) chinese words W _k, k=1, 2, …, M, including words of only 1 chinese in GB 2312, and words of a plurality of chinese characters, which are commonly used chinese words published by related departments, are selected.

Further, in the step (2), the lattice MC _ki of each chinese character C _ki in the word W _k is obtained, and the lattice size is d×d, d=16 or d=24. The elements in the lattice take values of 1 and 0. The elements of each Chinese character lattice are arranged into a vector (a ₁,a₂,…,a_D) with 1 row and D column according to the row or column sequence, wherein, d=d×d, a _i =1 or a _i =0, i=1, 2, … and D.

Further, in the step (3), for a chinese word W _k composed of n words, the word synthesis vector MW _k is a weighted sum of the respective chinese lattice vectors MC _ki in the word, MW _k＝w₁×MC_k1+w₂×MC_k2+…+w_n×MC_kn. The calculation method of the weight w _i of each Chinese character C _ki comprises the following steps:

Further, in the step (4), the calculation method of the average synthesized vector MW of the M reference words is as follows: MW= (MW ₁+MW₂+…+MW_M)/M.

Further, in the step (5), the average synthesis vector MW (a ₁,a₂,…,a_D) is subtracted from the synthesis vector MW _k(a_k1,a_k2,…,a_kD of the M reference words to form a matrix a of M rows and D columns.

Further, in the step (6), eigenvalues and eigenvectors of the covariance matrix are calculated, and eigenvalues λ _j and λ ₁≥λ₂≥…≥λ_D are arranged in order from the top to the bottom.

Further, in the step (7), in the implementation, a minimum number L is selected to satisfy: (lambda ₁+λ₂+…+λ_L)/(λ₁+λ₂+…+λ_D) is more than or equal to 0.99, and the eigenvectors V _j corresponding to the L larger eigenvalues lambda _j form a projection matrix P of D rows and L columns.

Further, in the step (8), for any chinese word W _j, the composite vector MW _j(a_j1,a_j2,…,a_jD of W _j is calculated according to the step (2) and the step (3), and the average composite vector MW (a ₁,a₂,…,a_D) is subtracted to obtain a vector y= (a _j1-a₁,a_j2-a₂,…,a_jD-a_D). According to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W _j.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A method for calculating a chinese word vector using principal component analysis, the method comprising the steps of:

s8, calculating word vectors of Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vector of the word;

wherein,

The step S1 specifically includes: selecting M chinese words W _k, k=1, 2, …, M, including words of only 1 chinese character, and words composed of a plurality of chinese characters;

the step S2 specifically includes: obtaining a lattice vector MC _ki of each Chinese character C _ki in the word W _k, wherein the lattice size is d multiplied by d, and the values of elements in the lattice are 1 and 0; arranging the elements of each Chinese character lattice into a vector (a ₁,a₂,…,a_D) with 1 row and D columns according to row or column sequence, wherein d=d×d, a _i =1 or a _i =0, i=1, 2, … and D;

The step S3 specifically includes: for a Chinese word W _k composed of n words, the word synthesis vector MW _k is a weighted sum of the individual Chinese lattice vectors MC _ki in the word, MW _k＝w₁×MC_k1+w₂×MC_k2+…+w_n×MC_kn, and the weight W _i of each Chinese character C _ki is calculated by:

2. The method of claim 1, wherein d=16 or d=24.

3. The method for calculating a chinese word vector using principal component analysis as recited in claim 2, wherein said step S4 comprises: the calculation method of the average synthesized vector MW of the M reference words comprises the following steps: MW= (MW ₁+MW₂+…+MW_M)/M.

4. The method for calculating a chinese word vector using principal component analysis as recited in claim 3, wherein said step S5 comprises: subtracting the average synthesized vector MW (a ₁,a₂,…,a_D) from the synthesized vector MW _k(a_k1,a_k2,…,a_kD of the M reference words to form a matrix A of M rows and D columns;

5. The method for calculating a chinese word vector using pivot analysis as recited in claim 4, wherein said step S6 comprises: eigenvalues and eigenvectors of the covariance matrix are calculated and eigenvalues λ _j are arranged in order from large to small, λ ₁≥λ₂≥…≥λ_D.

6. The method for calculating a chinese word vector using pivot analysis as recited in claim 5, wherein said step S7 comprises: selecting a minimum number L, satisfying: (lambda ₁+λ₂+…+λ_L)/(λ₁+λ₂+…+λ_D) is more than or equal to 0.99, and the eigenvectors V _j corresponding to the L eigenvalues lambda _j form a projection matrix P of D rows and L columns.

7. The method for calculating a chinese word vector using pivot analysis as recited in claim 6, wherein said step S8 comprises: for any chinese word W _j, calculate the composite vector MW _j(a_j1,a_j2,…,a_jD of W _j according to step (2) and step (3), subtract the average composite vector MW (a ₁,a₂,…,a_D) to obtain vector y= (a _j1-a₁,a_j2-a₂,…,a_jD-a_D); according to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W _j.