CN113627176B - Method for calculating Chinese word vector by principal component analysis - Google Patents
Method for calculating Chinese word vector by principal component analysis Download PDFInfo
- Publication number
- CN113627176B CN113627176B CN202110942291.6A CN202110942291A CN113627176B CN 113627176 B CN113627176 B CN 113627176B CN 202110942291 A CN202110942291 A CN 202110942291A CN 113627176 B CN113627176 B CN 113627176B
- Authority
- CN
- China
- Prior art keywords
- vector
- word
- chinese
- words
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 239000013598 vector Substances 0.000 title claims abstract description 198
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000000513 principal component analysis Methods 0.000 title claims abstract description 24
- 239000011159 matrix material Substances 0.000 claims abstract description 94
- 238000004364 calculation method Methods 0.000 claims abstract description 10
- 230000001131 transforming effect Effects 0.000 claims abstract description 7
- 239000002131 composite material Substances 0.000 claims description 17
- 230000015572 biosynthetic process Effects 0.000 claims description 15
- 238000003786 synthesis reaction Methods 0.000 claims description 15
- 238000003058 natural language processing Methods 0.000 abstract description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Algebra (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention relates to a method for calculating Chinese word vectors by principal component analysis, belonging to the field of language processing. The invention selects representative words in Chinese as the benchmark of principal component analysis; representing Chinese characters by vectors composed of numerical values; combining the Chinese character lattice vectors in the Chinese words into the synthetic vectors of the words, and converting the words into numerical vector forms; calculating an average synthesized vector of all words of the reference vocabulary; subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words; obtaining the characteristic of a covariance matrix; according to covariance matrix characteristics, calculating a matrix for transforming the synthetic vector of the word; and subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word. The method is simple in calculation, can avoid the common problem of 'unknown words' in Chinese word vectorization, and has important application value in Chinese natural language processing.
Description
Technical Field
The invention belongs to the field of language processing, and particularly relates to a method for calculating a Chinese word vector by utilizing principal component analysis, in particular to a method for calculating a word vector of a Chinese word by utilizing a Chinese character lattice and principal component analysis.
Background
Natural language processing is a technique for processing human language with a computer. Since computers are adept at numerical computation, to process natural language, it is first necessary to convert the natural language into numerical form. The process of converting natural language into numerical form is called vectorization of words, words and sentences, i.e. a word, a word and a sentence are respectively represented by a plurality of numbers.
Common word vectorization techniques are the one hot (one hot) technique and the continuous word bag (continuous bag od word, CBOW) technique. In the one-hot technique, a vocabulary, for example, 10000 vocabularies, is determined in advance, each of which is represented by 10000 ordered numbers (10000-dimensional vectors), and if a word is arranged in the vocabulary at the i-th position, the i-th component is 1 and the remaining components are 0 in the corresponding vectors.
The single heat representation redundancy is large, people develop continuous word bag representation, a certain word in a sentence is used as a central word, n words before and after the word are used as related words, the average single heat vector of the single heat representation of the n-song related words is input into a neural network for training, and the output of the neural network is the single heat representation of the central word. When the neural network is stable, the weight of the i output node of the neural network connected with the hidden layer node is the word vector of the i word.
Both the one-hot representation and the continuous bag of words representation require a prior determination of the vocabulary size and, if there is a variation in the vocabulary, the word vector for each word needs to be recalculated. Furthermore, training of neural networks requires a significant amount of computational power and time when there are more words in the vocabulary. This is in the natural language processing of chinese,
The Chinese number is expressed as a matrix synthesized by the dot matrix of Chinese characters in the word, then orthogonal transformation is carried out, and partial coefficients after the orthogonal transformation are removed are used as word vectors, so that new words are allowed to be added, but the number of the coefficients (the dimension of the word vectors) is difficult to determine.
If a word vector calculation method with simple calculation can be provided, the characteristics of natural language are fully utilized, the defects of the method are overcome, the influence of adding new words is avoided, and the application range of natural language processing can be expanded. The present invention has been made in view of such a real demand.
Disclosure of Invention
First, the technical problem to be solved
The invention aims to provide a method for calculating Chinese word vectors by principal component analysis, so as to solve the problem that the common word vectorization technology needs to consume a large amount of computing power and time.
(II) technical scheme
In order to solve the above technical problems, the present invention provides a method for calculating a chinese word vector by principal component analysis, which is characterized in that the method comprises the following steps:
s1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as a reference of principal component analysis;
s2, obtaining each Chinese character lattice vector in Chinese words, and representing the Chinese characters by vectors formed by numerical values, so that the computer can further process the Chinese characters conveniently;
S3, calculating the synthetic vector of each Chinese word, combining the synthetic vector of the word by using the Chinese character lattice vector in the Chinese word, and converting the word into a numerical vector form;
s4, calculating an average synthesized vector of the reference words, and calculating an average synthesized vector of all words of the reference words;
s5, calculating a covariance matrix of the reference words, and multiplying the average synthesized vector subtracted from the synthesized vector of each word in the reference words to obtain a covariance matrix of the difference between the words;
S6, eigenvalues and eigenvectors of the covariance matrix are calculated, and characteristics of the covariance matrix are obtained;
S7, calculating a projection matrix of the Chinese word synthesis vector, and calculating a matrix for transforming the word synthesis vector according to covariance matrix characteristics;
s8, calculating word vectors of the Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vector of the word.
Further, the step S1 specifically includes: m chinese words W k, k=1, 2, …, M are selected, including words of only 1 chinese character, and words composed of a plurality of chinese characters.
Further, the step S2 specifically includes: obtaining a lattice vector MC ki of each Chinese character C ki in the word W k, wherein the lattice size is d multiplied by d, and the values of elements in the lattice are 1 and 0; the elements of each Chinese character lattice are arranged into a vector (a 1,a2,…,aD) with 1 row and D column according to the row or column sequence, wherein, d=d×d, a i =1 or a i =0, i=1, 2, … and D.
Further, d=16 or d=24.
Further, the step S3 specifically includes: for a Chinese word W k composed of n words, the word synthesis vector MW k is a weighted sum of the individual Chinese lattice vectors MC ki in the word, MW k=w1×MCk1+w2×MCk2+…+wn×MCkn, and the weight W i of each Chinese character C ki is calculated by:
Further, the step S4 specifically includes: the calculation method of the average synthesized vector MW of the M reference words comprises the following steps: MW= (MW 1+MW2+…+MWM)/M.
Further, the step S5 specifically includes: subtracting the average synthesized vector MW (a 1,a2,…,aD) from the synthesized vector MW k(ak1,ak2,…,akD of the M reference words to form a matrix A of M rows and D columns;
According to the matrix operation rule, x=a T×A;AT is calculated to represent the transpose of a, a T is a matrix of D rows and M columns, and the resulting covariance matrix X is a matrix of D rows and D columns.
Further, the step S6 specifically includes: eigenvalues and eigenvectors of the covariance matrix are calculated and eigenvalues λ j are arranged in order from large to small, λ 1≥λ2≥…≥λD.
Further, the step S7 specifically includes: selecting a minimum number L, satisfying: (lambda 1+λ2+…+λL)/(λ1+λ2+…+λD) is more than or equal to 0.99, and the eigenvectors V j corresponding to the L larger eigenvalues lambda j form a projection matrix P of D rows and L columns.
Further, the step S8 specifically includes: for any chinese word W j, calculate the composite vector MW j(aj1,aj2,…,ajD of W j according to step (2) and step (3), subtract the average composite vector MW (a 1,a2,…,aD) to obtain vector y= (a j1-a1,aj2-a2,…,ajD-aD); according to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W j.
(III) beneficial effects
The invention provides a method for calculating Chinese word vectors by principal component analysis, which fully utilizes the characteristics of Chinese characters, has simple calculation, can avoid the common problem of 'unknown words' when the Chinese word vectors are vectorized, is easy to determine the dimension of the word vectors, and has important application value in the natural language processing of Chinese.
Drawings
FIG. 1 is a flow chart of a method for computing Chinese word vectors using principal component analysis in accordance with the present invention.
Detailed Description
To make the objects, contents and advantages of the present invention more apparent, the following detailed description of the present invention will be given with reference to the accompanying drawings and examples.
The invention discloses a method for calculating Chinese word vectors by principal component analysis, which comprises the following steps: (1) selecting a reference Chinese vocabulary step. Representative words in Chinese are selected as the basis of principal component analysis. And (2) obtaining the dot matrix vector of each Chinese character in the Chinese word. The Chinese characters are represented by vectors composed of numerical values, so that the computer is convenient for further processing. (3) calculating the synthetic vector of each Chinese word. The Chinese character lattice vector in Chinese word is used to compose the synthetic vector of word itself, and the word is also converted into numerical vector form. (4) calculating an average composite vector of the reference words. An average composite vector of all words of the reference vocabulary is calculated. (5) calculating covariance matrix of the reference vocabulary. And subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words. (6) And calculating eigenvalues and eigenvectors of the covariance matrix. And obtaining the characteristic of the covariance matrix. And (7) calculating a projection matrix of the Chinese word synthesis vector. A matrix for transforming the composite vector of words is calculated based on the covariance matrix characteristics. (8) calculating the word vector of the Chinese word. And subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word. The invention firstly expresses Chinese words as the synthesized vectors of Chinese characters in words, forms a vector space, calculates the basic vectors of the vector space, finally projects the synthesized vectors of the words into the vector space, and takes projection coordinates as the word vectors of the Chinese words.
The purpose of the invention is that: a method for calculating Chinese word vectors by principal component analysis is provided, which meets the requirement of calculating word vectors by natural language processing.
To achieve the above object, the present invention provides a method for calculating a chinese word vector using principal component analysis, the method comprising:
s1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as a reference of principal component analysis.
S2, obtaining each Chinese character lattice vector in Chinese words, and representing the Chinese characters by vectors formed by numerical values, so that the computer is convenient for further processing.
S3, calculating the synthetic vector of each Chinese word, combining the synthetic vector of the word by using the Chinese character lattice vector in the Chinese word, and converting the word into a numerical vector form.
S4, calculating an average synthesized vector of the reference words, and calculating an average synthesized vector of all words of the reference words.
S5, calculating a covariance matrix of the reference words, and multiplying the average synthesized vector subtracted from the synthesized vector of each word in the reference words to obtain a covariance matrix of the difference between the words.
S6, eigenvalues and eigenvectors of the covariance matrix are calculated, and characteristics of the covariance matrix are obtained.
S7, calculating a projection matrix of the Chinese word synthesis vector, and calculating a matrix for transforming the word synthesis vector according to covariance matrix characteristics.
S8, calculating word vectors of the Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vector of the word.
FIG. 1 is a flow chart of a method of calculating a Chinese word vector using principal component analysis in accordance with the present invention. As shown in fig. 1, the method includes:
S1, selecting a reference Chinese vocabulary. Representative words in Chinese are selected as the basis of principal component analysis.
In practice, M (> 10000) chinese words W k, k=1, 2, …, M are selected, including words of only 1 chinese character (words in GB 2312 may be selected), and words composed of a plurality of chinese characters (words commonly used in the relevant departments may be selected).
S2, obtaining dot matrix vectors of each Chinese character in the Chinese words. The Chinese characters are represented by vectors composed of numerical values, so that the computer is convenient for further processing.
In specific implementation, the lattice MC ki of each chinese character C ki in the word W k is obtained, where the lattice size is d×d, d=16 or d=24. The elements in the lattice take values of 1 and 0. The elements of each Chinese character lattice are arranged into a vector (a 1,a2,…,aD) with 1 row and D column according to the row or column sequence, wherein, d=d×d, a i =1 or a i =0, i=1, 2, … and D.
S3, calculating the synthetic vector of each Chinese word. The Chinese character lattice vector in Chinese word is used to compose the synthetic vector of word itself, and the word is also converted into numerical vector form.
In practice, for a Chinese word W k consisting of n words, the word synthesis vector MW k is a weighted sum of the individual Chinese lattice vectors MC ki in the word, MW k=w1×MCk1+w2×MCk2+…+wn×MCkn. The calculation method of the weight w i of each Chinese character C ki comprises the following steps:
S4, calculating an average synthetic vector of the reference vocabulary. An average composite vector of all words of the reference vocabulary is calculated.
In specific implementation, the calculation method of the average synthesized vector MW of the M reference words comprises the following steps: MW= (MW 1+MW2+…+MWM)/M.
S5, calculating a covariance matrix of the reference vocabulary. And subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words.
The M reference words' composite vector MW k(ak1,ak2,…,akD) is subtracted from the average composite vector MW (a 1,a2,…,aD) to form a matrix a of M rows and D columns.
According to the matrix operation rule, x=a T×A.AT is calculated to represent the transpose of a, a T is a matrix of D rows and M columns, and the resulting covariance matrix X is a matrix of D rows and D columns.
S6, calculating eigenvalues and eigenvectors of the covariance matrix. And obtaining the characteristic of the covariance matrix.
In practice, the eigenvalues and eigenvectors of the covariance matrix may be calculated using the jacobi method or other methods, and the eigenvalues λ j are arranged in order from large to small, λ 1≥λ2≥…≥λD.
S7, calculating a projection matrix of the Chinese word synthesis vector. A matrix for transforming the composite vector of words is calculated based on the covariance matrix characteristics.
In specific implementation, a minimum number L is selected, which satisfies: (lambda 1+λ2+…+λL)/(λ1+λ2+…+λD) is more than or equal to 0.99, and the eigenvectors V j corresponding to the L larger eigenvalues lambda j form a projection matrix P of D rows and L columns.
S8, calculating word vectors of Chinese words. And subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word.
In specific implementation, for any chinese word W j, the composite vector MW j(aj1,aj2,…,ajD of W j is calculated according to step (2) and step (3), and the average composite vector MW (a 1,a2,…,aD) is subtracted to obtain the vector y= (a j1-a1,aj2-a2,…,ajD-aD). According to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W j.
The invention provides a method for calculating Chinese word vectors by principal component analysis, which comprises the following steps:
(1) And selecting a reference Chinese vocabulary. Representative words in Chinese are selected as the basis of principal component analysis.
(2) And obtaining the dot matrix vector of each Chinese character in the Chinese word. The Chinese characters are represented by vectors composed of numerical values, so that the computer is convenient for further processing.
(3) And calculating the synthetic vector of each Chinese word. The Chinese character lattice vector in Chinese word is used to compose the synthetic vector of word itself, and the word is also converted into numerical vector form.
(4) And calculating an average synthesized vector of the reference words. An average composite vector of all words of the reference vocabulary is calculated.
(5) And calculating covariance matrix of the reference vocabulary. And subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words.
(6) And calculating eigenvalues and eigenvectors of the covariance matrix. And obtaining the characteristic of the covariance matrix.
(7) And calculating a projection matrix of the Chinese word synthesis vector. A matrix for transforming the composite vector of words is calculated based on the covariance matrix characteristics.
(8) And calculating word vectors of the Chinese words. And subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word.
Further, in the step (1), M (> 10000) chinese words W k, k=1, 2, …, M, including words of only 1 chinese in GB 2312, and words of a plurality of chinese characters, which are commonly used chinese words published by related departments, are selected.
Further, in the step (2), the lattice MC ki of each chinese character C ki in the word W k is obtained, and the lattice size is d×d, d=16 or d=24. The elements in the lattice take values of 1 and 0. The elements of each Chinese character lattice are arranged into a vector (a 1,a2,…,aD) with 1 row and D column according to the row or column sequence, wherein, d=d×d, a i =1 or a i =0, i=1, 2, … and D.
Further, in the step (3), for a chinese word W k composed of n words, the word synthesis vector MW k is a weighted sum of the respective chinese lattice vectors MC ki in the word, MW k=w1×MCk1+w2×MCk2+…+wn×MCkn. The calculation method of the weight w i of each Chinese character C ki comprises the following steps:
Further, in the step (4), the calculation method of the average synthesized vector MW of the M reference words is as follows: MW= (MW 1+MW2+…+MWM)/M.
Further, in the step (5), the average synthesis vector MW (a 1,a2,…,aD) is subtracted from the synthesis vector MW k(ak1,ak2,…,akD of the M reference words to form a matrix a of M rows and D columns.
According to the matrix operation rule, x=a T×A.AT is calculated to represent the transpose of a, a T is a matrix of D rows and M columns, and the resulting covariance matrix X is a matrix of D rows and D columns.
Further, in the step (6), eigenvalues and eigenvectors of the covariance matrix are calculated, and eigenvalues λ j and λ 1≥λ2≥…≥λD are arranged in order from the top to the bottom.
Further, in the step (7), in the implementation, a minimum number L is selected to satisfy: (lambda 1+λ2+…+λL)/(λ1+λ2+…+λD) is more than or equal to 0.99, and the eigenvectors V j corresponding to the L larger eigenvalues lambda j form a projection matrix P of D rows and L columns.
Further, in the step (8), for any chinese word W j, the composite vector MW j(aj1,aj2,…,ajD of W j is calculated according to the step (2) and the step (3), and the average composite vector MW (a 1,a2,…,aD) is subtracted to obtain a vector y= (a j1-a1,aj2-a2,…,ajD-aD). According to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W j.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.
Claims (7)
1. A method for calculating a chinese word vector using principal component analysis, the method comprising the steps of:
s1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as a reference of principal component analysis;
s2, obtaining each Chinese character lattice vector in Chinese words, and representing the Chinese characters by vectors formed by numerical values, so that the computer can further process the Chinese characters conveniently;
S3, calculating the synthetic vector of each Chinese word, combining the synthetic vector of the word by using the Chinese character lattice vector in the Chinese word, and converting the word into a numerical vector form;
s4, calculating an average synthesized vector of the reference words, and calculating an average synthesized vector of all words of the reference words;
s5, calculating a covariance matrix of the reference words, and multiplying the average synthesized vector subtracted from the synthesized vector of each word in the reference words to obtain a covariance matrix of the difference between the words;
S6, eigenvalues and eigenvectors of the covariance matrix are calculated, and characteristics of the covariance matrix are obtained;
S7, calculating a projection matrix of the Chinese word synthesis vector, and calculating a matrix for transforming the word synthesis vector according to covariance matrix characteristics;
s8, calculating word vectors of Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vector of the word;
wherein,
The step S1 specifically includes: selecting M chinese words W k, k=1, 2, …, M, including words of only 1 chinese character, and words composed of a plurality of chinese characters;
the step S2 specifically includes: obtaining a lattice vector MC ki of each Chinese character C ki in the word W k, wherein the lattice size is d multiplied by d, and the values of elements in the lattice are 1 and 0; arranging the elements of each Chinese character lattice into a vector (a 1,a2,…,aD) with 1 row and D columns according to row or column sequence, wherein d=d×d, a i =1 or a i =0, i=1, 2, … and D;
The step S3 specifically includes: for a Chinese word W k composed of n words, the word synthesis vector MW k is a weighted sum of the individual Chinese lattice vectors MC ki in the word, MW k=w1×MCk1+w2×MCk2+…+wn×MCkn, and the weight W i of each Chinese character C ki is calculated by:
2. The method of claim 1, wherein d=16 or d=24.
3. The method for calculating a chinese word vector using principal component analysis as recited in claim 2, wherein said step S4 comprises: the calculation method of the average synthesized vector MW of the M reference words comprises the following steps: MW= (MW 1+MW2+…+MWM)/M.
4. The method for calculating a chinese word vector using principal component analysis as recited in claim 3, wherein said step S5 comprises: subtracting the average synthesized vector MW (a 1,a2,…,aD) from the synthesized vector MW k(ak1,ak2,…,akD of the M reference words to form a matrix A of M rows and D columns;
According to the matrix operation rule, x=a T×A;AT is calculated to represent the transpose of a, a T is a matrix of D rows and M columns, and the resulting covariance matrix X is a matrix of D rows and D columns.
5. The method for calculating a chinese word vector using pivot analysis as recited in claim 4, wherein said step S6 comprises: eigenvalues and eigenvectors of the covariance matrix are calculated and eigenvalues λ j are arranged in order from large to small, λ 1≥λ2≥…≥λD.
6. The method for calculating a chinese word vector using pivot analysis as recited in claim 5, wherein said step S7 comprises: selecting a minimum number L, satisfying: (lambda 1+λ2+…+λL)/(λ1+λ2+…+λD) is more than or equal to 0.99, and the eigenvectors V j corresponding to the L eigenvalues lambda j form a projection matrix P of D rows and L columns.
7. The method for calculating a chinese word vector using pivot analysis as recited in claim 6, wherein said step S8 comprises: for any chinese word W j, calculate the composite vector MW j(aj1,aj2,…,ajD of W j according to step (2) and step (3), subtract the average composite vector MW (a 1,a2,…,aD) to obtain vector y= (a j1-a1,aj2-a2,…,ajD-aD); according to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W j.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110942291.6A CN113627176B (en) | 2021-08-17 | 2021-08-17 | Method for calculating Chinese word vector by principal component analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110942291.6A CN113627176B (en) | 2021-08-17 | 2021-08-17 | Method for calculating Chinese word vector by principal component analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113627176A CN113627176A (en) | 2021-11-09 |
CN113627176B true CN113627176B (en) | 2024-04-19 |
Family
ID=78386099
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110942291.6A Active CN113627176B (en) | 2021-08-17 | 2021-08-17 | Method for calculating Chinese word vector by principal component analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113627176B (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1786966A (en) * | 2004-12-09 | 2006-06-14 | 索尼英国有限公司 | Information treatment |
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
CN102135820A (en) * | 2011-01-18 | 2011-07-27 | 浙江大学 | Planarization pre-processing method |
JP2011164126A (en) * | 2010-02-04 | 2011-08-25 | Nippon Telegr & Teleph Corp <Ntt> | Noise suppression filter calculation method, and device and program therefor |
CN104598441A (en) * | 2014-12-25 | 2015-05-06 | 上海科阅信息技术有限公司 | Method for splitting Chinese sentences through computer |
CN107194408A (en) * | 2017-06-21 | 2017-09-22 | 安徽大学 | A kind of method for tracking target of the sparse coordination model of mixed block |
CN107273355A (en) * | 2017-06-12 | 2017-10-20 | 大连理工大学 | A kind of Chinese word vector generation method based on words joint training |
CN108154167A (en) * | 2017-12-04 | 2018-06-12 | 昆明理工大学 | A kind of Chinese character pattern similarity calculating method |
CN109582951A (en) * | 2018-10-19 | 2019-04-05 | 昆明理工大学 | A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm |
CN109992716A (en) * | 2019-03-29 | 2019-07-09 | 电子科技大学 | A kind of similar news recommended method of Indonesian based on ITQ algorithm |
CN110059191A (en) * | 2019-05-07 | 2019-07-26 | 山东师范大学 | A kind of text sentiment classification method and device |
CN110196893A (en) * | 2019-05-05 | 2019-09-03 | 平安科技(深圳)有限公司 | Non- subjective item method to go over files, device and storage medium based on text similarity |
CN112417153A (en) * | 2020-11-20 | 2021-02-26 | 虎博网络技术(上海)有限公司 | Text classification method and device, terminal equipment and readable storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB0420464D0 (en) * | 2004-09-14 | 2004-10-20 | Zentian Ltd | A speech recognition circuit and method |
JP5234469B2 (en) * | 2007-09-14 | 2013-07-10 | 国立大学法人 東京大学 | Correspondence relationship learning device and method, correspondence relationship learning program, annotation device and method, annotation program, retrieval device and method, and retrieval program |
-
2021
- 2021-08-17 CN CN202110942291.6A patent/CN113627176B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1786966A (en) * | 2004-12-09 | 2006-06-14 | 索尼英国有限公司 | Information treatment |
CN101079026A (en) * | 2007-07-02 | 2007-11-28 | 北京百问百答网络技术有限公司 | Text similarity, acceptation similarity calculating method and system and application system |
JP2011164126A (en) * | 2010-02-04 | 2011-08-25 | Nippon Telegr & Teleph Corp <Ntt> | Noise suppression filter calculation method, and device and program therefor |
CN102135820A (en) * | 2011-01-18 | 2011-07-27 | 浙江大学 | Planarization pre-processing method |
CN104598441A (en) * | 2014-12-25 | 2015-05-06 | 上海科阅信息技术有限公司 | Method for splitting Chinese sentences through computer |
CN107273355A (en) * | 2017-06-12 | 2017-10-20 | 大连理工大学 | A kind of Chinese word vector generation method based on words joint training |
CN107194408A (en) * | 2017-06-21 | 2017-09-22 | 安徽大学 | A kind of method for tracking target of the sparse coordination model of mixed block |
CN108154167A (en) * | 2017-12-04 | 2018-06-12 | 昆明理工大学 | A kind of Chinese character pattern similarity calculating method |
CN109582951A (en) * | 2018-10-19 | 2019-04-05 | 昆明理工大学 | A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm |
CN109992716A (en) * | 2019-03-29 | 2019-07-09 | 电子科技大学 | A kind of similar news recommended method of Indonesian based on ITQ algorithm |
CN110196893A (en) * | 2019-05-05 | 2019-09-03 | 平安科技(深圳)有限公司 | Non- subjective item method to go over files, device and storage medium based on text similarity |
CN110059191A (en) * | 2019-05-07 | 2019-07-26 | 山东师范大学 | A kind of text sentiment classification method and device |
CN112417153A (en) * | 2020-11-20 | 2021-02-26 | 虎博网络技术(上海)有限公司 | Text classification method and device, terminal equipment and readable storage medium |
Non-Patent Citations (5)
Title |
---|
Compressive parameter estimation with multiple measurement vectors via structured low-rank covariance estimation;Yuanxin Li等;2014 IEEE Workshop on Statistical Signal Processing;第384页-387页 * |
基于专家知识和深度学习的领域术语网络模型构建;丁维;中国优秀硕士学位论文全文数据库信息科技辑(第1期);第I138-2560页 * |
基于流形学习方法的中文文本分类研究;翟海超;中国优秀硕士学位论文全文数据库信息科技辑(第3期);第I138-2799页 * |
汉字关联性量化方法及其在文本相似性分析中的应用;赵彦斌;李庆华;;计算机应用;第26卷(第06期);第1398页-1400页 * |
藏语连续语音识别的语言模型研究;李照耀;中国优秀硕士学位论文全文数据库信息科技辑(第5期);第I136-180页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113627176A (en) | 2021-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gao et al. | High-dimensional functional time series forecasting: An application to age-specific mortality rates | |
CN107220220A (en) | Electronic equipment and method for text-processing | |
CN110532355A (en) | A kind of intention based on multi-task learning combines recognition methods with slot position | |
CN109992779A (en) | A kind of sentiment analysis method, apparatus, equipment and storage medium based on CNN | |
CN103810999A (en) | Linguistic model training method and system based on distributed neural networks | |
Shah et al. | Image captioning using deep neural architectures | |
CN107292382A (en) | A kind of neutral net acoustic model activation primitive pinpoints quantization method | |
CN109597988A (en) | The former prediction technique of vocabulary justice, device and electronic equipment across language | |
CN113157919B (en) | Sentence text aspect-level emotion classification method and sentence text aspect-level emotion classification system | |
CN104850533A (en) | Constrained nonnegative matrix decomposing method and solving method | |
CN116095089B (en) | Remote sensing satellite data processing method and system | |
Lin et al. | Intelligent decision support for new product development: a consumer-oriented approach | |
Ye et al. | MultiTL-KELM: A multi-task learning algorithm for multi-step-ahead time series prediction | |
CN110334196A (en) | Neural network Chinese charater problem based on stroke and from attention mechanism generates system | |
JP7127570B2 (en) | Question answering device, learning device, question answering method and program | |
CN110197252A (en) | Deep learning based on distance | |
WO2020040255A1 (en) | Word coding device, analysis device, language model learning device, method, and program | |
CN113627176B (en) | Method for calculating Chinese word vector by principal component analysis | |
Poghosyan et al. | Short-term memory with read-only unit in neural image caption generator | |
CN108876038A (en) | Big data, artificial intelligence, the Optimization of Material Property method of supercomputer collaboration | |
Cai et al. | Fast learning of deep neural networks via singular value decomposition | |
CN114757189B (en) | Event extraction method and device, intelligent terminal and storage medium | |
CN116561410A (en) | Course teaching resource recommendation method | |
CN111259106A (en) | Relation extraction method combining neural network and feature calculation | |
JP4499003B2 (en) | Information processing method, apparatus, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |