CN113627176B - Method for calculating Chinese word vector by principal component analysis - Google Patents

Method for calculating Chinese word vector by principal component analysis Download PDF

Info

Publication number
CN113627176B
CN113627176B CN202110942291.6A CN202110942291A CN113627176B CN 113627176 B CN113627176 B CN 113627176B CN 202110942291 A CN202110942291 A CN 202110942291A CN 113627176 B CN113627176 B CN 113627176B
Authority
CN
China
Prior art keywords
vector
word
chinese
words
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110942291.6A
Other languages
Chinese (zh)
Other versions
CN113627176A (en
Inventor
蒋遂平
袁晓光
刘轩
王璐静
臧小滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Aerospace Aiwei Electronic Technology Ltd
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Aerospace Aiwei Electronic Technology Ltd
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Aerospace Aiwei Electronic Technology Ltd, Beijing Institute of Computer Technology and Applications filed Critical Beijing Aerospace Aiwei Electronic Technology Ltd
Priority to CN202110942291.6A priority Critical patent/CN113627176B/en
Publication of CN113627176A publication Critical patent/CN113627176A/en
Application granted granted Critical
Publication of CN113627176B publication Critical patent/CN113627176B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Algebra (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to a method for calculating Chinese word vectors by principal component analysis, belonging to the field of language processing. The invention selects representative words in Chinese as the benchmark of principal component analysis; representing Chinese characters by vectors composed of numerical values; combining the Chinese character lattice vectors in the Chinese words into the synthetic vectors of the words, and converting the words into numerical vector forms; calculating an average synthesized vector of all words of the reference vocabulary; subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words; obtaining the characteristic of a covariance matrix; according to covariance matrix characteristics, calculating a matrix for transforming the synthetic vector of the word; and subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word. The method is simple in calculation, can avoid the common problem of 'unknown words' in Chinese word vectorization, and has important application value in Chinese natural language processing.

Description

Method for calculating Chinese word vector by principal component analysis
Technical Field
The invention belongs to the field of language processing, and particularly relates to a method for calculating a Chinese word vector by utilizing principal component analysis, in particular to a method for calculating a word vector of a Chinese word by utilizing a Chinese character lattice and principal component analysis.
Background
Natural language processing is a technique for processing human language with a computer. Since computers are adept at numerical computation, to process natural language, it is first necessary to convert the natural language into numerical form. The process of converting natural language into numerical form is called vectorization of words, words and sentences, i.e. a word, a word and a sentence are respectively represented by a plurality of numbers.
Common word vectorization techniques are the one hot (one hot) technique and the continuous word bag (continuous bag od word, CBOW) technique. In the one-hot technique, a vocabulary, for example, 10000 vocabularies, is determined in advance, each of which is represented by 10000 ordered numbers (10000-dimensional vectors), and if a word is arranged in the vocabulary at the i-th position, the i-th component is 1 and the remaining components are 0 in the corresponding vectors.
The single heat representation redundancy is large, people develop continuous word bag representation, a certain word in a sentence is used as a central word, n words before and after the word are used as related words, the average single heat vector of the single heat representation of the n-song related words is input into a neural network for training, and the output of the neural network is the single heat representation of the central word. When the neural network is stable, the weight of the i output node of the neural network connected with the hidden layer node is the word vector of the i word.
Both the one-hot representation and the continuous bag of words representation require a prior determination of the vocabulary size and, if there is a variation in the vocabulary, the word vector for each word needs to be recalculated. Furthermore, training of neural networks requires a significant amount of computational power and time when there are more words in the vocabulary. This is in the natural language processing of chinese,
The Chinese number is expressed as a matrix synthesized by the dot matrix of Chinese characters in the word, then orthogonal transformation is carried out, and partial coefficients after the orthogonal transformation are removed are used as word vectors, so that new words are allowed to be added, but the number of the coefficients (the dimension of the word vectors) is difficult to determine.
If a word vector calculation method with simple calculation can be provided, the characteristics of natural language are fully utilized, the defects of the method are overcome, the influence of adding new words is avoided, and the application range of natural language processing can be expanded. The present invention has been made in view of such a real demand.
Disclosure of Invention
First, the technical problem to be solved
The invention aims to provide a method for calculating Chinese word vectors by principal component analysis, so as to solve the problem that the common word vectorization technology needs to consume a large amount of computing power and time.
(II) technical scheme
In order to solve the above technical problems, the present invention provides a method for calculating a chinese word vector by principal component analysis, which is characterized in that the method comprises the following steps:
s1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as a reference of principal component analysis;
s2, obtaining each Chinese character lattice vector in Chinese words, and representing the Chinese characters by vectors formed by numerical values, so that the computer can further process the Chinese characters conveniently;
S3, calculating the synthetic vector of each Chinese word, combining the synthetic vector of the word by using the Chinese character lattice vector in the Chinese word, and converting the word into a numerical vector form;
s4, calculating an average synthesized vector of the reference words, and calculating an average synthesized vector of all words of the reference words;
s5, calculating a covariance matrix of the reference words, and multiplying the average synthesized vector subtracted from the synthesized vector of each word in the reference words to obtain a covariance matrix of the difference between the words;
S6, eigenvalues and eigenvectors of the covariance matrix are calculated, and characteristics of the covariance matrix are obtained;
S7, calculating a projection matrix of the Chinese word synthesis vector, and calculating a matrix for transforming the word synthesis vector according to covariance matrix characteristics;
s8, calculating word vectors of the Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vector of the word.
Further, the step S1 specifically includes: m chinese words W k, k=1, 2, …, M are selected, including words of only 1 chinese character, and words composed of a plurality of chinese characters.
Further, the step S2 specifically includes: obtaining a lattice vector MC ki of each Chinese character C ki in the word W k, wherein the lattice size is d multiplied by d, and the values of elements in the lattice are 1 and 0; the elements of each Chinese character lattice are arranged into a vector (a 1,a2,…,aD) with 1 row and D column according to the row or column sequence, wherein, d=d×d, a i =1 or a i =0, i=1, 2, … and D.
Further, d=16 or d=24.
Further, the step S3 specifically includes: for a Chinese word W k composed of n words, the word synthesis vector MW k is a weighted sum of the individual Chinese lattice vectors MC ki in the word, MW k=w1×MCk1+w2×MCk2+…+wn×MCkn, and the weight W i of each Chinese character C ki is calculated by:
Further, the step S4 specifically includes: the calculation method of the average synthesized vector MW of the M reference words comprises the following steps: MW= (MW 1+MW2+…+MWM)/M.
Further, the step S5 specifically includes: subtracting the average synthesized vector MW (a 1,a2,…,aD) from the synthesized vector MW k(ak1,ak2,…,akD of the M reference words to form a matrix A of M rows and D columns;
According to the matrix operation rule, x=a T×A;AT is calculated to represent the transpose of a, a T is a matrix of D rows and M columns, and the resulting covariance matrix X is a matrix of D rows and D columns.
Further, the step S6 specifically includes: eigenvalues and eigenvectors of the covariance matrix are calculated and eigenvalues λ j are arranged in order from large to small, λ 1≥λ2≥…≥λD.
Further, the step S7 specifically includes: selecting a minimum number L, satisfying: (lambda 12+…+λL)/(λ12+…+λD) is more than or equal to 0.99, and the eigenvectors V j corresponding to the L larger eigenvalues lambda j form a projection matrix P of D rows and L columns.
Further, the step S8 specifically includes: for any chinese word W j, calculate the composite vector MW j(aj1,aj2,…,ajD of W j according to step (2) and step (3), subtract the average composite vector MW (a 1,a2,…,aD) to obtain vector y= (a j1-a1,aj2-a2,…,ajD-aD); according to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W j.
(III) beneficial effects
The invention provides a method for calculating Chinese word vectors by principal component analysis, which fully utilizes the characteristics of Chinese characters, has simple calculation, can avoid the common problem of 'unknown words' when the Chinese word vectors are vectorized, is easy to determine the dimension of the word vectors, and has important application value in the natural language processing of Chinese.
Drawings
FIG. 1 is a flow chart of a method for computing Chinese word vectors using principal component analysis in accordance with the present invention.
Detailed Description
To make the objects, contents and advantages of the present invention more apparent, the following detailed description of the present invention will be given with reference to the accompanying drawings and examples.
The invention discloses a method for calculating Chinese word vectors by principal component analysis, which comprises the following steps: (1) selecting a reference Chinese vocabulary step. Representative words in Chinese are selected as the basis of principal component analysis. And (2) obtaining the dot matrix vector of each Chinese character in the Chinese word. The Chinese characters are represented by vectors composed of numerical values, so that the computer is convenient for further processing. (3) calculating the synthetic vector of each Chinese word. The Chinese character lattice vector in Chinese word is used to compose the synthetic vector of word itself, and the word is also converted into numerical vector form. (4) calculating an average composite vector of the reference words. An average composite vector of all words of the reference vocabulary is calculated. (5) calculating covariance matrix of the reference vocabulary. And subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words. (6) And calculating eigenvalues and eigenvectors of the covariance matrix. And obtaining the characteristic of the covariance matrix. And (7) calculating a projection matrix of the Chinese word synthesis vector. A matrix for transforming the composite vector of words is calculated based on the covariance matrix characteristics. (8) calculating the word vector of the Chinese word. And subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word. The invention firstly expresses Chinese words as the synthesized vectors of Chinese characters in words, forms a vector space, calculates the basic vectors of the vector space, finally projects the synthesized vectors of the words into the vector space, and takes projection coordinates as the word vectors of the Chinese words.
The purpose of the invention is that: a method for calculating Chinese word vectors by principal component analysis is provided, which meets the requirement of calculating word vectors by natural language processing.
To achieve the above object, the present invention provides a method for calculating a chinese word vector using principal component analysis, the method comprising:
s1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as a reference of principal component analysis.
S2, obtaining each Chinese character lattice vector in Chinese words, and representing the Chinese characters by vectors formed by numerical values, so that the computer is convenient for further processing.
S3, calculating the synthetic vector of each Chinese word, combining the synthetic vector of the word by using the Chinese character lattice vector in the Chinese word, and converting the word into a numerical vector form.
S4, calculating an average synthesized vector of the reference words, and calculating an average synthesized vector of all words of the reference words.
S5, calculating a covariance matrix of the reference words, and multiplying the average synthesized vector subtracted from the synthesized vector of each word in the reference words to obtain a covariance matrix of the difference between the words.
S6, eigenvalues and eigenvectors of the covariance matrix are calculated, and characteristics of the covariance matrix are obtained.
S7, calculating a projection matrix of the Chinese word synthesis vector, and calculating a matrix for transforming the word synthesis vector according to covariance matrix characteristics.
S8, calculating word vectors of the Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vector of the word.
FIG. 1 is a flow chart of a method of calculating a Chinese word vector using principal component analysis in accordance with the present invention. As shown in fig. 1, the method includes:
S1, selecting a reference Chinese vocabulary. Representative words in Chinese are selected as the basis of principal component analysis.
In practice, M (> 10000) chinese words W k, k=1, 2, …, M are selected, including words of only 1 chinese character (words in GB 2312 may be selected), and words composed of a plurality of chinese characters (words commonly used in the relevant departments may be selected).
S2, obtaining dot matrix vectors of each Chinese character in the Chinese words. The Chinese characters are represented by vectors composed of numerical values, so that the computer is convenient for further processing.
In specific implementation, the lattice MC ki of each chinese character C ki in the word W k is obtained, where the lattice size is d×d, d=16 or d=24. The elements in the lattice take values of 1 and 0. The elements of each Chinese character lattice are arranged into a vector (a 1,a2,…,aD) with 1 row and D column according to the row or column sequence, wherein, d=d×d, a i =1 or a i =0, i=1, 2, … and D.
S3, calculating the synthetic vector of each Chinese word. The Chinese character lattice vector in Chinese word is used to compose the synthetic vector of word itself, and the word is also converted into numerical vector form.
In practice, for a Chinese word W k consisting of n words, the word synthesis vector MW k is a weighted sum of the individual Chinese lattice vectors MC ki in the word, MW k=w1×MCk1+w2×MCk2+…+wn×MCkn. The calculation method of the weight w i of each Chinese character C ki comprises the following steps:
S4, calculating an average synthetic vector of the reference vocabulary. An average composite vector of all words of the reference vocabulary is calculated.
In specific implementation, the calculation method of the average synthesized vector MW of the M reference words comprises the following steps: MW= (MW 1+MW2+…+MWM)/M.
S5, calculating a covariance matrix of the reference vocabulary. And subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words.
The M reference words' composite vector MW k(ak1,ak2,…,akD) is subtracted from the average composite vector MW (a 1,a2,…,aD) to form a matrix a of M rows and D columns.
According to the matrix operation rule, x=a T×A.AT is calculated to represent the transpose of a, a T is a matrix of D rows and M columns, and the resulting covariance matrix X is a matrix of D rows and D columns.
S6, calculating eigenvalues and eigenvectors of the covariance matrix. And obtaining the characteristic of the covariance matrix.
In practice, the eigenvalues and eigenvectors of the covariance matrix may be calculated using the jacobi method or other methods, and the eigenvalues λ j are arranged in order from large to small, λ 1≥λ2≥…≥λD.
S7, calculating a projection matrix of the Chinese word synthesis vector. A matrix for transforming the composite vector of words is calculated based on the covariance matrix characteristics.
In specific implementation, a minimum number L is selected, which satisfies: (lambda 12+…+λL)/(λ12+…+λD) is more than or equal to 0.99, and the eigenvectors V j corresponding to the L larger eigenvalues lambda j form a projection matrix P of D rows and L columns.
S8, calculating word vectors of Chinese words. And subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word.
In specific implementation, for any chinese word W j, the composite vector MW j(aj1,aj2,…,ajD of W j is calculated according to step (2) and step (3), and the average composite vector MW (a 1,a2,…,aD) is subtracted to obtain the vector y= (a j1-a1,aj2-a2,…,ajD-aD). According to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W j.
The invention provides a method for calculating Chinese word vectors by principal component analysis, which comprises the following steps:
(1) And selecting a reference Chinese vocabulary. Representative words in Chinese are selected as the basis of principal component analysis.
(2) And obtaining the dot matrix vector of each Chinese character in the Chinese word. The Chinese characters are represented by vectors composed of numerical values, so that the computer is convenient for further processing.
(3) And calculating the synthetic vector of each Chinese word. The Chinese character lattice vector in Chinese word is used to compose the synthetic vector of word itself, and the word is also converted into numerical vector form.
(4) And calculating an average synthesized vector of the reference words. An average composite vector of all words of the reference vocabulary is calculated.
(5) And calculating covariance matrix of the reference vocabulary. And subtracting the average synthesized vector from the synthesized vector of each word in the reference word, and multiplying to obtain a covariance matrix of the difference between the words.
(6) And calculating eigenvalues and eigenvectors of the covariance matrix. And obtaining the characteristic of the covariance matrix.
(7) And calculating a projection matrix of the Chinese word synthesis vector. A matrix for transforming the composite vector of words is calculated based on the covariance matrix characteristics.
(8) And calculating word vectors of the Chinese words. And subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the obtained product by a projection matrix to obtain the word vector of the word.
Further, in the step (1), M (> 10000) chinese words W k, k=1, 2, …, M, including words of only 1 chinese in GB 2312, and words of a plurality of chinese characters, which are commonly used chinese words published by related departments, are selected.
Further, in the step (2), the lattice MC ki of each chinese character C ki in the word W k is obtained, and the lattice size is d×d, d=16 or d=24. The elements in the lattice take values of 1 and 0. The elements of each Chinese character lattice are arranged into a vector (a 1,a2,…,aD) with 1 row and D column according to the row or column sequence, wherein, d=d×d, a i =1 or a i =0, i=1, 2, … and D.
Further, in the step (3), for a chinese word W k composed of n words, the word synthesis vector MW k is a weighted sum of the respective chinese lattice vectors MC ki in the word, MW k=w1×MCk1+w2×MCk2+…+wn×MCkn. The calculation method of the weight w i of each Chinese character C ki comprises the following steps:
Further, in the step (4), the calculation method of the average synthesized vector MW of the M reference words is as follows: MW= (MW 1+MW2+…+MWM)/M.
Further, in the step (5), the average synthesis vector MW (a 1,a2,…,aD) is subtracted from the synthesis vector MW k(ak1,ak2,…,akD of the M reference words to form a matrix a of M rows and D columns.
According to the matrix operation rule, x=a T×A.AT is calculated to represent the transpose of a, a T is a matrix of D rows and M columns, and the resulting covariance matrix X is a matrix of D rows and D columns.
Further, in the step (6), eigenvalues and eigenvectors of the covariance matrix are calculated, and eigenvalues λ j and λ 1≥λ2≥…≥λD are arranged in order from the top to the bottom.
Further, in the step (7), in the implementation, a minimum number L is selected to satisfy: (lambda 12+…+λL)/(λ12+…+λD) is more than or equal to 0.99, and the eigenvectors V j corresponding to the L larger eigenvalues lambda j form a projection matrix P of D rows and L columns.
Further, in the step (8), for any chinese word W j, the composite vector MW j(aj1,aj2,…,ajD of W j is calculated according to the step (2) and the step (3), and the average composite vector MW (a 1,a2,…,aD) is subtracted to obtain a vector y= (a j1-a1,aj2-a2,…,ajD-aD). According to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W j.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (7)

1. A method for calculating a chinese word vector using principal component analysis, the method comprising the steps of:
s1, selecting a reference Chinese vocabulary, and selecting representative words in Chinese as a reference of principal component analysis;
s2, obtaining each Chinese character lattice vector in Chinese words, and representing the Chinese characters by vectors formed by numerical values, so that the computer can further process the Chinese characters conveniently;
S3, calculating the synthetic vector of each Chinese word, combining the synthetic vector of the word by using the Chinese character lattice vector in the Chinese word, and converting the word into a numerical vector form;
s4, calculating an average synthesized vector of the reference words, and calculating an average synthesized vector of all words of the reference words;
s5, calculating a covariance matrix of the reference words, and multiplying the average synthesized vector subtracted from the synthesized vector of each word in the reference words to obtain a covariance matrix of the difference between the words;
S6, eigenvalues and eigenvectors of the covariance matrix are calculated, and characteristics of the covariance matrix are obtained;
S7, calculating a projection matrix of the Chinese word synthesis vector, and calculating a matrix for transforming the word synthesis vector according to covariance matrix characteristics;
s8, calculating word vectors of Chinese words, subtracting the average synthesized vector from the synthesized vector of any Chinese word, and multiplying the average synthesized vector by a projection matrix to obtain the word vector of the word;
wherein,
The step S1 specifically includes: selecting M chinese words W k, k=1, 2, …, M, including words of only 1 chinese character, and words composed of a plurality of chinese characters;
the step S2 specifically includes: obtaining a lattice vector MC ki of each Chinese character C ki in the word W k, wherein the lattice size is d multiplied by d, and the values of elements in the lattice are 1 and 0; arranging the elements of each Chinese character lattice into a vector (a 1,a2,…,aD) with 1 row and D columns according to row or column sequence, wherein d=d×d, a i =1 or a i =0, i=1, 2, … and D;
The step S3 specifically includes: for a Chinese word W k composed of n words, the word synthesis vector MW k is a weighted sum of the individual Chinese lattice vectors MC ki in the word, MW k=w1×MCk1+w2×MCk2+…+wn×MCkn, and the weight W i of each Chinese character C ki is calculated by:
2. The method of claim 1, wherein d=16 or d=24.
3. The method for calculating a chinese word vector using principal component analysis as recited in claim 2, wherein said step S4 comprises: the calculation method of the average synthesized vector MW of the M reference words comprises the following steps: MW= (MW 1+MW2+…+MWM)/M.
4. The method for calculating a chinese word vector using principal component analysis as recited in claim 3, wherein said step S5 comprises: subtracting the average synthesized vector MW (a 1,a2,…,aD) from the synthesized vector MW k(ak1,ak2,…,akD of the M reference words to form a matrix A of M rows and D columns;
According to the matrix operation rule, x=a T×A;AT is calculated to represent the transpose of a, a T is a matrix of D rows and M columns, and the resulting covariance matrix X is a matrix of D rows and D columns.
5. The method for calculating a chinese word vector using pivot analysis as recited in claim 4, wherein said step S6 comprises: eigenvalues and eigenvectors of the covariance matrix are calculated and eigenvalues λ j are arranged in order from large to small, λ 1≥λ2≥…≥λD.
6. The method for calculating a chinese word vector using pivot analysis as recited in claim 5, wherein said step S7 comprises: selecting a minimum number L, satisfying: (lambda 12+…+λL)/(λ12+…+λD) is more than or equal to 0.99, and the eigenvectors V j corresponding to the L eigenvalues lambda j form a projection matrix P of D rows and L columns.
7. The method for calculating a chinese word vector using pivot analysis as recited in claim 6, wherein said step S8 comprises: for any chinese word W j, calculate the composite vector MW j(aj1,aj2,…,ajD of W j according to step (2) and step (3), subtract the average composite vector MW (a 1,a2,…,aD) to obtain vector y= (a j1-a1,aj2-a2,…,ajD-aD); according to the matrix operation rule, the product z=y×p of the vector Y and the projection matrix P is calculated, where Z is a vector of 1 row and L column, and Z is a word vector of the chinese word W j.
CN202110942291.6A 2021-08-17 2021-08-17 Method for calculating Chinese word vector by principal component analysis Active CN113627176B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110942291.6A CN113627176B (en) 2021-08-17 2021-08-17 Method for calculating Chinese word vector by principal component analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110942291.6A CN113627176B (en) 2021-08-17 2021-08-17 Method for calculating Chinese word vector by principal component analysis

Publications (2)

Publication Number Publication Date
CN113627176A CN113627176A (en) 2021-11-09
CN113627176B true CN113627176B (en) 2024-04-19

Family

ID=78386099

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110942291.6A Active CN113627176B (en) 2021-08-17 2021-08-17 Method for calculating Chinese word vector by principal component analysis

Country Status (1)

Country Link
CN (1) CN113627176B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786966A (en) * 2004-12-09 2006-06-14 索尼英国有限公司 Information treatment
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
CN102135820A (en) * 2011-01-18 2011-07-27 浙江大学 Planarization pre-processing method
JP2011164126A (en) * 2010-02-04 2011-08-25 Nippon Telegr & Teleph Corp <Ntt> Noise suppression filter calculation method, and device and program therefor
CN104598441A (en) * 2014-12-25 2015-05-06 上海科阅信息技术有限公司 Method for splitting Chinese sentences through computer
CN107194408A (en) * 2017-06-21 2017-09-22 安徽大学 A kind of method for tracking target of the sparse coordination model of mixed block
CN107273355A (en) * 2017-06-12 2017-10-20 大连理工大学 A kind of Chinese word vector generation method based on words joint training
CN108154167A (en) * 2017-12-04 2018-06-12 昆明理工大学 A kind of Chinese character pattern similarity calculating method
CN109582951A (en) * 2018-10-19 2019-04-05 昆明理工大学 A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm
CN109992716A (en) * 2019-03-29 2019-07-09 电子科技大学 A kind of similar news recommended method of Indonesian based on ITQ algorithm
CN110059191A (en) * 2019-05-07 2019-07-26 山东师范大学 A kind of text sentiment classification method and device
CN110196893A (en) * 2019-05-05 2019-09-03 平安科技(深圳)有限公司 Non- subjective item method to go over files, device and storage medium based on text similarity
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0420464D0 (en) * 2004-09-14 2004-10-20 Zentian Ltd A speech recognition circuit and method
JP5234469B2 (en) * 2007-09-14 2013-07-10 国立大学法人 東京大学 Correspondence relationship learning device and method, correspondence relationship learning program, annotation device and method, annotation program, retrieval device and method, and retrieval program

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786966A (en) * 2004-12-09 2006-06-14 索尼英国有限公司 Information treatment
CN101079026A (en) * 2007-07-02 2007-11-28 北京百问百答网络技术有限公司 Text similarity, acceptation similarity calculating method and system and application system
JP2011164126A (en) * 2010-02-04 2011-08-25 Nippon Telegr & Teleph Corp <Ntt> Noise suppression filter calculation method, and device and program therefor
CN102135820A (en) * 2011-01-18 2011-07-27 浙江大学 Planarization pre-processing method
CN104598441A (en) * 2014-12-25 2015-05-06 上海科阅信息技术有限公司 Method for splitting Chinese sentences through computer
CN107273355A (en) * 2017-06-12 2017-10-20 大连理工大学 A kind of Chinese word vector generation method based on words joint training
CN107194408A (en) * 2017-06-21 2017-09-22 安徽大学 A kind of method for tracking target of the sparse coordination model of mixed block
CN108154167A (en) * 2017-12-04 2018-06-12 昆明理工大学 A kind of Chinese character pattern similarity calculating method
CN109582951A (en) * 2018-10-19 2019-04-05 昆明理工大学 A kind of bilingual term vector model building method of card Chinese based on multiple CCA algorithm
CN109992716A (en) * 2019-03-29 2019-07-09 电子科技大学 A kind of similar news recommended method of Indonesian based on ITQ algorithm
CN110196893A (en) * 2019-05-05 2019-09-03 平安科技(深圳)有限公司 Non- subjective item method to go over files, device and storage medium based on text similarity
CN110059191A (en) * 2019-05-07 2019-07-26 山东师范大学 A kind of text sentiment classification method and device
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Compressive parameter estimation with multiple measurement vectors via structured low-rank covariance estimation;Yuanxin Li等;2014 IEEE Workshop on Statistical Signal Processing;第384页-387页 *
基于专家知识和深度学习的领域术语网络模型构建;丁维;中国优秀硕士学位论文全文数据库信息科技辑(第1期);第I138-2560页 *
基于流形学习方法的中文文本分类研究;翟海超;中国优秀硕士学位论文全文数据库信息科技辑(第3期);第I138-2799页 *
汉字关联性量化方法及其在文本相似性分析中的应用;赵彦斌;李庆华;;计算机应用;第26卷(第06期);第1398页-1400页 *
藏语连续语音识别的语言模型研究;李照耀;中国优秀硕士学位论文全文数据库信息科技辑(第5期);第I136-180页 *

Also Published As

Publication number Publication date
CN113627176A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
Gao et al. High-dimensional functional time series forecasting: An application to age-specific mortality rates
CN107220220A (en) Electronic equipment and method for text-processing
CN110532355A (en) A kind of intention based on multi-task learning combines recognition methods with slot position
CN109992779A (en) A kind of sentiment analysis method, apparatus, equipment and storage medium based on CNN
CN103810999A (en) Linguistic model training method and system based on distributed neural networks
Shah et al. Image captioning using deep neural architectures
CN107292382A (en) A kind of neutral net acoustic model activation primitive pinpoints quantization method
CN109597988A (en) The former prediction technique of vocabulary justice, device and electronic equipment across language
CN113157919B (en) Sentence text aspect-level emotion classification method and sentence text aspect-level emotion classification system
CN104850533A (en) Constrained nonnegative matrix decomposing method and solving method
CN116095089B (en) Remote sensing satellite data processing method and system
Lin et al. Intelligent decision support for new product development: a consumer-oriented approach
Ye et al. MultiTL-KELM: A multi-task learning algorithm for multi-step-ahead time series prediction
CN110334196A (en) Neural network Chinese charater problem based on stroke and from attention mechanism generates system
JP7127570B2 (en) Question answering device, learning device, question answering method and program
CN110197252A (en) Deep learning based on distance
WO2020040255A1 (en) Word coding device, analysis device, language model learning device, method, and program
CN113627176B (en) Method for calculating Chinese word vector by principal component analysis
Poghosyan et al. Short-term memory with read-only unit in neural image caption generator
CN108876038A (en) Big data, artificial intelligence, the Optimization of Material Property method of supercomputer collaboration
Cai et al. Fast learning of deep neural networks via singular value decomposition
CN114757189B (en) Event extraction method and device, intelligent terminal and storage medium
CN116561410A (en) Course teaching resource recommendation method
CN111259106A (en) Relation extraction method combining neural network and feature calculation
JP4499003B2 (en) Information processing method, apparatus, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant