CN106502394B - Term vector calculation method and device based on EEG signals - Google Patents
Term vector calculation method and device based on EEG signals Download PDFInfo
- Publication number
- CN106502394B CN106502394B CN201610907518.2A CN201610907518A CN106502394B CN 106502394 B CN106502394 B CN 106502394B CN 201610907518 A CN201610907518 A CN 201610907518A CN 106502394 B CN106502394 B CN 106502394B
- Authority
- CN
- China
- Prior art keywords
- eeg signals
- corpus
- phrase
- term vector
- labeler
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
- G06F3/015—Input arrangements based on nervous system activity detection, e.g. brain waves [EEG] detection, electromyograms [EMG] detection, electrodermal response detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2203/00—Indexing scheme relating to G06F3/00 - G06F3/048
- G06F2203/01—Indexing scheme relating to G06F3/01
- G06F2203/011—Emotion or mood input determined on the basis of sensed human body parameters such as pulse, heart rate or beat, temperature of skin, facial expressions, iris, voice pitch, brain activity patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/02—Preprocessing
Abstract
The present invention provides a kind of term vector calculation method and device based on EEG signals, the term vector calculation method based on EEG signals includes: step S1, text corpus is collected, the corpus in text corpus is handled, the corpus of the continuous phrase format as unit of phrase is obtained;The corpus of continuous phrase format is presented to labeler by step S2, is read for labeler, and acquisition labeler reads EEG signals when each phrase;Step S3, using the corresponding EEG signals of collected phrase as prediction target, training term vector predicts the EEG signals of its context characterized by current phrase, and constructing the term vector based on EEG signals indicates model.The present invention improves the accuracy of term vector calculating through the above scheme.
Description
Technical field
The invention belongs to natural language processing technique field more particularly to a kind of term vector calculating sides based on EEG signals
Method and device.
Background technique
In natural language processing task, usually using expression of the term vector as the word in original text, to count
The machine learning algorithm of value can apply to text data.The basic thought of term vector model is: by a large amount of trainings,
Each word in certain language is mapped to the vector of a regular length, it is however generally that this length is much smaller than the language word
The size of allusion quotation usually arrives several hundred dimensions tens.All these vectors constitute term vector space, and each vector can be to be somebody's turn to do
A point in space.The measurement of " distance " is introduced on this space, so that it may judge equivalent according to the distance of term vector
Similitude between language in syntax, semantically.Traditional term vector calculation method is try to use up by current text vector
It may predict the vector of its context accurately to optimize its expression.
In traditional term vector calculating process, predict that context is trained primary goal by current text.It is this
The major defect of method has following three points:
1, the other attribute of syntactic level for having only taken into account word does not account for the attribute of phrase semantic rank, therefore usually
The obtained term vector of training can only express between word the more relationship of shallow-layer;
2, lack the modeling to human language cognitive process, it is special to have ignored important Cognitive Neuroscience and psychology
Sign;
3, due to the complexity of human language cognitive mechanism, the term vector obtained by simple forecast context can not embody
The characteristic of different natural language processing tasks, universality are poor.
Summary of the invention
The purpose of the present invention is to provide a kind of term vector calculation method and device based on EEG signals, it is intended to improve word
The accuracy that vector calculates.
The invention is realized in this way a kind of term vector calculation method based on EEG signals, the method includes following
Step:
Step S1 collects text corpus, handles the corpus in text corpus, obtain as unit of phrase
The corpus of continuous phrase format;
The corpus of the continuous phrase format is presented to labeler by step S2, is read for labeler, and acquisition labeler is read
Read EEG signals when each phrase;
Step S3, using the corresponding EEG signals of collected phrase as prediction target, training term vector, with current phrase
Term vector indicate to be characterized the EEG signals for predicting its context, constructing term vector based on EEG signals indicates model.
Further technical solution of the invention is that the step S1 includes following sub-step:
Step S11 collects text corpus, and the corpus in the text corpus is sentence or chapter rank;
Step S12, removing length in the text corpus is more than the first preset value or length less than the second preset value
Corpus obtains pretreatment corpus;
Pretreatment corpus progress word segmentation processing is obtained word by step S13;
Step S14 converts phrase for institute's predicate, obtains with the corpus of continuous phrase format using chunk parsing technology.
Further technical solution of the invention is that the step S3 includes following sub-step:
Step S31 carries out noise reduction process to collected EEG signals, the EEG signals after obtaining noise reduction;
Step S32 carries out space projection and dimension-reduction treatment to the EEG signals after the noise reduction;
Step S33, all phrases in the pretreatment corpus, which are initialized as term vector, to be indicated;
Step S34 traverses all phrases in the pretreatment corpus, characterized by the term vector of current phrase indicates,
Using the EEG signals of its context of neural net regression model prediction, by the EEG signals of the context of prediction and real EEG electricity
Signal compares, and obtains prediction error, is indicated according to the term vector of the prediction current phrase of error transfer factor, wherein the reality
EEG signals are EEG signals when labeler reads the context;This step is repeated, until prediction error is less than default threshold
Value.
Further technical solution of the invention is that the step S31 includes:
The collected EEG signals are handled, the EEG signals that signal-to-noise ratio is higher than third preset value are obtained;
The step S32 includes:
Using cospace pattern algorithm to the signal-to-noise ratio be higher than third preset value EEG signals carry out space projection and
Dimensionality reduction obtains the EEG signals that dimension is lower than the 4th preset value.
Further technical solution of the invention is to carry out noise reduction process use to the collected EEG signals
FASTICA algorithm.
The present invention also provides a kind of term vector computing device based on EEG signals, described device include:
Collection module is handled the corpus in text corpus for collecting text corpus, obtain be with phrase
The corpus of the continuous phrase format of unit;
Acquisition module is read for the corpus of the continuous phrase format to be presented to labeler for labeler, acquisition mark
Note person reads EEG signals when each phrase;
Module is constructed, for using the corresponding EEG signals of collected phrase as prediction target, training term vector, to work as
The term vector of preceding phrase indicates to be characterized the EEG signals for predicting its context, term vector expression mould of the building based on EEG signals
Type.
Further technical solution of the invention is that the collection module includes:
Collector unit, for collecting text corpus, the corpus in the text corpus is sentence or chapter rank;
Pretreatment unit is more than that the first preset value or length are pre- less than second for removing length in the text corpus
If the corpus of value, pretreatment corpus is obtained;
Participle unit, for pretreatment corpus progress word segmentation processing to be obtained word;
Conversion unit converts phrase for institute's predicate, obtains with continuous phrase format for utilizing chunk parsing technology
Corpus.
Further technical solution of the invention is that the building module includes:
Noise reduction unit, for carrying out noise reduction process to collected EEG signals, the EEG signals after obtaining noise reduction;
Dimensionality reduction unit, for carrying out space projection and dimension-reduction treatment to the EEG signals after the noise reduction;
Initialization unit is indicated for all phrases in the pretreatment corpus to be initialized as term vector;
Construction unit is expressed as traversing all phrases in the pretreatment corpus with the term vector of current phrase
Feature, using the EEG signals of its context of neural net regression model prediction, by the EEG signals and reality of the context of prediction
Border EEG signals compare, and obtain prediction error, are indicated according to the term vector of the prediction current phrase of error transfer factor, wherein institute
Stating practical EEG signals is EEG signals when labeler reads the context;This step is repeated, until prediction error is less than
Preset threshold.
Further technical solution of the invention is that the noise reduction unit is also used to, to the collected EEG signals
It is handled, obtains the EEG signals that signal-to-noise ratio is higher than third preset value;
The dimensionality reduction unit is also used to, and is higher than the brain electricity of third preset value to the signal-to-noise ratio using cospace pattern algorithm
Signal carries out space projection and dimensionality reduction, obtains the EEG signals that dimension is lower than the 4th preset value.
Further technical solution of the invention is that the noise reduction module is also used to adopt using FASTICA algorithm to described
The EEG signals collected carry out noise reduction process.
The beneficial effects of the present invention are: the term vector calculation method and device provided by the invention based on EEG signals, leads to
It crosses above scheme: collecting text corpus, the corpus in text corpus is handled, is obtained continuous as unit of phrase
The corpus of phrase format;The corpus of continuous phrase format is presented to labeler, is read for labeler, acquisition labeler is read every
EEG signals when one phrase;Using the corresponding EEG signals of collected phrase as prediction target, training term vector, with current
Phrase is characterized the EEG signals for predicting its context, and constructing term vector based on EEG signals indicates model, improve word to
Measure the accuracy calculated.
Detailed description of the invention
Fig. 1 is the flow diagram of the term vector calculation method first embodiment the present invention is based on EEG signals;
Fig. 2 is the refinement process signal of the term vector calculation method second embodiment step S1 the present invention is based on EEG signals
Figure;
Fig. 3 is the refinement process signal of the term vector calculation method 3rd embodiment step S3 the present invention is based on EEG signals
Figure;
Fig. 4 is the functional block diagram of the term vector computing device first embodiment the present invention is based on EEG signals;
Fig. 5 is the refinement function mould of the term vector computing device second embodiment acquisition module the present invention is based on EEG signals
Block schematic diagram;
Fig. 6 is that the present invention is based on the refinement function moulds of the term vector computing device 3rd embodiment of EEG signals building module
Block schematic diagram.
Appended drawing reference:
Collection module -10: collector unit -101;Pretreatment unit -102;Participle unit -103;Conversion unit -104;
Acquisition module -20;
Construct module -30: noise reduction unit -301;Dimensionality reduction unit -302;Initialization unit -303;Construction unit -304.
Specific embodiment
The solution of the embodiment of the present invention is mainly: collecting text corpus, carries out to the corpus in text corpus
Processing, obtains the corpus of the continuous phrase format as unit of phrase;The corpus of the continuous phrase format is presented to mark
Person reads for labeler, and acquisition labeler reads EEG signals when each phrase;By the corresponding brain telecommunications of collected phrase
Number as prediction target, training term vector, characterized by current phrase predict its context EEG signals, building based on brain electricity
The term vector of signal indicates model.
Fig. 1 is please referred to, Fig. 1 is the process signal of the term vector calculation method first embodiment the present invention is based on EEG signals
Figure, as shown in Figure 1, the present invention is based on the term vector calculation method first embodiment of EEG signals the following steps are included:
Step S1 collects text corpus, handles the corpus in text corpus, obtain as unit of phrase
The corpus of continuous phrase format;
Specifically, corpus refers to that the linguistic data really occurred in actual use in language, corpus usually store up
There are in corpus, corpus is the database that corpus is carried using electronic computer as carrier, real corpus generally require by
Analysis and processing can become useful resource.
Currently, Chinese corpus is mainly the general corpus of Modern Chinese, the Peoples Daily tagged corpus, for language
The Modern Chinese corpus of teaching and research, Modern Chinese corpus towards speech signal analysis etc., people are needing corpus
When, corpus can be directly acquired from the corpus that these are built up.Certainly, realization of the invention can also be from other corpus
Middle acquisition corpus, for example obtain the corpus in internet web page.
Since the training of term vector is using phrase as training data, and the corpus in corpus is usually sentence or text
Chapter obtains the corpus of the continuous phrase format as unit of phrase therefore, it is necessary to handle corpus.For example, corpus is sentence
Sub " I likes that Beijing, Beijing are the capitals in China ", be processed into the continuous phrase as unit of phrase be " I/love/Beijing/
Beijing/be/China// capital ".
The corpus of continuous phrase format is presented to labeler by step S2, is read for labeler, and acquisition labeler is read every
EEG signals when one phrase;Wherein, labeler is to read the user of the corpus presented with continuous phrase format.
Specifically, the present invention is to indicate term vector by EEG signals, is read in labeler with the presentation of continuous phrase format
Corpus when, eeg signal acquisition device need to be worn, to obtain EEG signals when labeler reads each phrase.It is marked
After EEG signals when person reads each phrase, collected EEG signals and corresponding words group are stored in pairs.
Step S3, using the corresponding EEG signals of collected phrase as prediction target, training term vector, with current phrase
Term vector indicate to be characterized the EEG signals for predicting its context, constructing term vector based on EEG signals indicates model.
Specifically, all phrases in the pretreatment corpus can be initialized as term vector indicates;Then, institute is traversed
All phrases in pretreatment corpus are stated, it is pre- using neural net regression model characterized by the term vector of current phrase indicates
The EEG signals for surveying its context compare the EEG signals of the context of prediction and practical EEG signals, obtain prediction
Error indicates, wherein the practical EEG signals read institute for labeler according to the term vector of the prediction current phrase of error transfer factor
State EEG signals when context;This step is repeated, until prediction error is less than preset threshold.
Contextual window can be three in the present embodiment, characterized by the term vector of current phrase indicates, use nerve
Its three phrase above of net regression model prediction and the hereafter EEG signals of three phrases, by the brain telecommunications of the context of prediction
It number is compared with practical EEG signals, obtains prediction error and current phrase is adjusted to the error back propagation generated every time
Vector indicates.
The present embodiment is through the above scheme: collecting text corpus, handles the corpus in text corpus, obtain
The corpus of continuous phrase format as unit of phrase;The corpus of continuous phrase format is presented to labeler, is read for labeler
It reads, acquisition labeler reads EEG signals when each phrase;Using the corresponding EEG signals of collected phrase as prediction mesh
Mark, training term vector, predicts the EEG signals of its context characterized by current phrase, constructs the term vector based on EEG signals
It indicates model, improves the accuracy of term vector calculating.
As the second embodiment of the present invention, referring to figure 2., Fig. 2 be based on Fig. 1 description based on the word of EEG signals to
Measure the refinement flow diagram of the step S1 in calculation method.The step S1 collects text corpus, in text corpus
Corpus the step of being handled, obtaining the corpus of the continuous phrase format as unit of phrase may include:
Step S11 collects text corpus, and the corpus in text corpus is sentence or chapter rank;
Step S12, removing length in text corpus is more than the corpus of the first preset value or length less than the second preset value,
Obtain pretreatment corpus;
Pretreatment corpus is carried out word segmentation processing and obtains word by step S13;
Step S14 converts phrase for word using chunk parsing technology, obtains with the corpus of continuous phrase format.
Specifically, the corpus in text corpus being collected into is usually sentence or article, since the length of sentence can
Can be too long or too short, therefore, a sentence length value range can be rule of thumb preset, removing length in corpus is more than first
Preset value or length obtain pretreatment corpus, wherein the first preset value and the second preset value less than the corpus of the second preset value
It can be set by experience.
Since the training of term vector is using phrase as training data, and the corpus in corpus is usually sentence or text
Chapter obtains the corpus of the continuous phrase format as unit of phrase therefore, it is necessary to handle corpus.For example, corpus is sentence
Sub " I likes that Beijing, Beijing are the capitals in China ", be processed into the continuous phrase as unit of phrase be " I/love/Beijing/
Beijing/be/China// capital ".
In the present embodiment, pretreatment corpus first can be subjected to word segmentation processing, obtains word, then uses chunk parsing skill
Word is converted phrase by art, obtains with the corpus of continuous phrase format.
Word segmentation processing depends primarily on what participle dictionary was realized, and the quality for segmenting dictionary directly determines word segmentation processing
Quality, the participle dictionary generallyd use at present are by establishing based on " xinhua dictionary " or other similar published book
Dictionary can also rely on other participle dictionaries in the present embodiment to carry out word segmentation processing.
Language chunk parsing technology is common technology in shallow grammar analysis, and language chunking technology can be according to scheduled mould
Sentences decomposition is component by type, these components are mainly phrase and longer phrase, so that computer is for sentence
Understanding can rise to the bigger phrase of information content, phrase from the level of single word, word, be more nearly natural language.
As the third embodiment of the present invention, referring to figure 3., Fig. 3 be based on Fig. 1 description based on the word of EEG signals to
Measure the refinement flow diagram of the step S3 in calculation method.The step S3, by the corresponding EEG signals of collected phrase
As prediction target, training term vector predicts the EEG signals of its context characterized by current phrase, and building is based on brain telecommunications
Number term vector indicate model step may include:
Step S31 carries out noise reduction process to collected EEG signals, the EEG signals after obtaining noise reduction;
During EEG signals when the corpus that acquisition labeler reading is presented with continuous phrase format, it is easy to be set
The influence of standby noise signal and the factors such as electromyography signal and electro-ocular signal, it is therefore desirable to which labeler is read with continuous phrase format
The EEG signals when corpus of presentation carry out denoising, the EEG signals of the high s/n ratio after obtaining noise reduction.
Signal-to-noise ratio, English name are called SNR or S/N (SIGNAL-NOISE RATIO), also known as signal to noise ratio.Refer to one
The ratio of signal and noise in electronic equipment or electronic system.Here signal refers to passing through from device external needs
The electronic signal that this equipment is handled, noise refer to by not existing random in the original signal generated after the equipment
Extra (or be information), and this signal does not change with the variation of original signal.The measurement unit of signal-to-noise ratio
It is dB, calculation method is 10lg (PS/PN), and wherein PS and PN respectively represents the effective power of signal and noise, and signal-to-noise ratio is got over
Height illustrates that noise is smaller.
In the present embodiment, FASTICA algorithm is used to be presented the reading of collected labeler with continuous phrase format
EEG signals when corpus are projected as multiple isolated components, then differentiate noise using spectrum signature or high order cross feature etc.,
Noise component is removed in EEG signals when then reading the corpus presented with continuous phrase format from collected labeler, is obtained
The EEG signals of high s/n ratio after to noise reduction, the EEG signals of the high s/n ratio in the present embodiment after noise reduction are preferably noise
Than the EEG signals for being higher than 15db.
Independent component analysis (abbreviation ICA) is very effective data analysis tool, it is mainly used to from blended data
Extract original independent signal.It as Signal separator a kind of effective ways and widely paid close attention to.It is calculated in many ICA
In method, fixed point algorithm (abbreviation FASTICA) is widely used in signal processing with its fast convergence rate, good separating effect and leads
Domain.The algorithm can estimate the original signal that mutual statistical is independent, is mixed by X factor from observation signal well.
Step S32 carries out space projection and dimension-reduction treatment to the EEG signals after noise reduction;
Specifically, in the present embodiment, using cospace pattern algorithm (CSP) by the high s/n ratio after the noise reduction of different channels
EEG signals projection and dimensionality reduction are carried out according to its spatial position, EEG signals after obtaining dimensionality reduction, in the present embodiment after dimensionality reduction
EEG signals be preferably dimension be lower than 300 dimensions EEG signals.
Step S33, all phrases pre-processed in corpus, which are initialized as term vector, to be indicated;
Step S34, traversal pre-processes all phrases in corpus, characterized by the term vector of current phrase indicates, uses
The EEG signals of its context of neural net regression model prediction, by the EEG signals of the context of prediction and practical EEG signals
It compares, obtains prediction error, indicated according to the term vector of the prediction current phrase of error transfer factor, wherein the real EEG electricity
Signal is EEG signals when labeler reads the context;This step is repeated, until macro-forecast error is less than default threshold
Value.
Contextual window can be three in the present embodiment, characterized by the term vector of current phrase indicates, use nerve
Its three phrase above of net regression model prediction and the hereafter EEG signals of three phrases, by the brain telecommunications of the context of prediction
It number is compared with practical EEG signals, obtains prediction error and current phrase is adjusted to the error back propagation generated every time
Vector indicates, until default error threshold can rule of thumb be set as 10-5。
In conclusion the present invention is through the above scheme, text corpus is collected, at the corpus in text corpus
Reason, obtains the corpus of the continuous phrase format as unit of phrase;The corpus of continuous phrase format is presented to labeler, for mark
Note person reads, and acquisition labeler reads EEG signals when each phrase;Using the corresponding EEG signals of collected phrase as
Predict target, training term vector is predicted the EEG signals of its context characterized by current phrase, constructed based on EEG signals
Term vector indicates model, improves the accuracy of term vector calculating.
Corresponding with the above-mentioned term vector calculation method based on EEG signals, the present invention also provides be based on EEG signals
Term vector computing device.
Referring to figure 4., Fig. 4 is the functional module of the term vector computing device first embodiment the present invention is based on EEG signals
Schematic diagram, as shown in figure 4, the present invention is based on the term vector computing device first embodiment of EEG signals include: collection module 10,
Acquisition module 20 and building module 30.
Wherein, collection module 10 is handled the corpus in text corpus for collecting text corpus, obtain with
Phrase is the corpus of the continuous phrase format of unit;
Specifically, corpus refers to that the linguistic data really occurred in actual use in language, corpus usually store up
There are in corpus, corpus is the database that corpus is carried using electronic computer as carrier, real corpus generally require by
Analysis and processing can become useful resource.
Currently, Chinese corpus is mainly the general corpus of Modern Chinese, the Peoples Daily tagged corpus, for language
The Modern Chinese corpus of teaching and research, Modern Chinese corpus towards speech signal analysis etc., people are needing corpus
When, corpus can be directly acquired from the corpus that these are built up.Certainly, realization of the invention can also be from other corpus
Middle acquisition corpus, for example obtain the corpus in internet web page.
Since the training of term vector is using phrase as training data, and the corpus in corpus is usually sentence or text
Chapter obtains the corpus of the continuous phrase format as unit of phrase therefore, it is necessary to handle corpus.For example, corpus is sentence
Sub " I likes that Beijing, Beijing are the capitals in China ", be processed into the continuous phrase as unit of phrase be " I/love/Beijing/
Beijing/be/China// capital ".
Acquisition module 20 is used to the corpus of continuous phrase format being presented to labeler, reads for labeler, acquisition mark
Person reads EEG signals when each phrase.
Specifically, the present invention is to indicate term vector by EEG signals, is read in labeler with the presentation of continuous phrase format
Corpus when, eeg signal acquisition device need to be worn, to obtain EEG signals when labeler reads each phrase.It is marked
After EEG signals when person reads each phrase, collected EEG signals and corresponding words group are stored in pairs.
Module 30 is constructed to be used to train term vector using the corresponding EEG signals of collected phrase as prediction target, with
Current phrase is characterized the EEG signals for predicting its context, and constructing the term vector based on EEG signals indicates model.
Contextual window can be three in the present embodiment, characterized by the term vector of current phrase indicates, use nerve
Its three phrase above of net regression model prediction and the hereafter EEG signals of three phrases, by the brain telecommunications of the context of prediction
It number is compared with practical EEG signals, obtains prediction error and current phrase is adjusted to the error back propagation generated every time
Vector indicates, until overall default error threshold can rule of thumb be set as 10-5。
The present embodiment is through the above scheme: collection module 10 collects text corpus, to the corpus in text corpus into
Row processing, obtains the corpus of the continuous phrase format as unit of phrase;The corpus of continuous phrase format is in by acquisition module 20
Labeler is now given, is read for labeler, acquisition labeler reads EEG signals when each phrase;Collected phrase is corresponding
EEG signals as prediction target, training term vector, characterized by current phrase predict its context EEG signals, building
Term vector based on EEG signals indicates model, improves the accuracy of term vector calculating.
As the second embodiment of the present invention, referring to figure 5., Fig. 5 be based on Fig. 4 description based on the word of EEG signals to
The refinement the functional block diagram of collection module 10 in device for calculating.In the present embodiment, collection module 10 may include:
Collector unit 101, pretreatment unit 102, participle unit 103 and conversion unit 104.
Wherein, for collector unit 101 for collecting text corpus, the corpus in the text corpus is sentence or a piece
Chapter rank;
Pretreatment unit 102 is more than that the first preset value or length are default less than second for removing length in text corpus
The corpus of value obtains pretreatment corpus, wherein the first preset value and the second preset value can be set by experience.
Participle unit 103 obtains word for that will pre-process corpus progress word segmentation processing;
Conversion unit 104 is used to utilize chunk parsing technology, converts phrase for word, obtains with the language of continuous phrase format
Material.
Specifically, the corpus in text corpus that collection module 10 is collected into is usually sentence or article, due to sentence
The length of son may be too long or too short, therefore, can rule of thumb preset a sentence length value range, removes long in corpus
It spends more than the first preset value or length less than the corpus of the second preset value, obtains pretreatment corpus.
Since the training of term vector is using phrase as training data, and the corpus in corpus is usually sentence or text
Chapter obtains the corpus of the continuous phrase format as unit of phrase therefore, it is necessary to handle corpus.For example, corpus is sentence
Sub " I likes that Beijing, Beijing are the capitals in China ", be processed into the continuous phrase as unit of phrase be " I/love/Beijing/
Beijing/be/China// capital ".
In the present embodiment, can first pass through pretreatment unit 102 will pre-process corpus progress word segmentation processing, obtain word,
Then chunk parsing technology is used by conversion unit 104, converts phrase for word, obtains with the corpus of continuous phrase format.
Word segmentation processing depends primarily on what participle dictionary was realized, and the quality for segmenting dictionary directly determines word segmentation processing
Quality, the participle dictionary generallyd use at present are by establishing based on " xinhua dictionary " or other similar published book
Dictionary can also rely on other participle dictionaries in the present embodiment to carry out word segmentation processing.
Language chunk parsing technology is common technology in shallow grammar analysis, and language chunking technology can be according to scheduled mould
Sentences decomposition is component by type, these components are mainly phrase and longer phrase, so that computer is for sentence
Understanding can rise to the bigger phrase of information content, phrase from the level of single word, word, be more nearly natural language.
As the third embodiment of the present invention, please refer to Fig. 6, Fig. 6 be based on Fig. 4 description based on the word of EEG signals to
The refinement the functional block diagram of building module 30 in device for calculating.In the present embodiment, building module 30 may include:
Noise reduction unit 301, dimensionality reduction unit 302, initialization unit 303 and construction unit 304.
Wherein, noise reduction unit 301 is used to carry out noise reduction process to collected EEG signals, the brain telecommunications after obtaining noise reduction
Number;
During EEG signals when the corpus that acquisition labeler reading is presented with continuous phrase format, it is easy to be set
The influence of standby noise signal and the factors such as electromyography signal and electro-ocular signal, it is therefore desirable to which labeler is read with continuous phrase format
The EEG signals when corpus of presentation carry out denoising, the EEG signals of the high s/n ratio after obtaining noise reduction.
Signal-to-noise ratio, English name are called SNR or S/N (SIGNAL-NOISE RATIO), also known as signal to noise ratio.Refer to one
The ratio of signal and noise in electronic equipment or electronic system.Here signal refers to passing through from device external needs
The electronic signal that this equipment is handled, noise refer to by not existing random in the original signal generated after the equipment
Extra (or be information), and this signal does not change with the variation of original signal.The measurement unit of signal-to-noise ratio
It is dB, calculation method is 10lg (PS/PN), and wherein PS and PN respectively represents the effective power of signal and noise, and signal-to-noise ratio is got over
Height illustrates that noise is smaller.
In the present embodiment, noise reduction unit 301 uses FASTICA algorithm to read collected labeler with continuous phrase
The EEG signals when corpus that format is presented are projected as multiple isolated components, then use spectrum signature or high order cross feature etc.
Noise is differentiated, except denoising in EEG signals when then reading the corpus presented with continuous phrase format from collected labeler
The EEG signals of cent amount, the EEG signals of the high s/n ratio after obtaining noise reduction, the high s/n ratio in the present embodiment after noise reduction are excellent
It is selected as the EEG signals that signal-to-noise ratio is higher than 15db.
Independent component analysis (abbreviation ICA) is very effective data analysis tool, it is mainly used to from blended data
Extract original independent signal.It as Signal separator a kind of effective ways and widely paid close attention to.It is calculated in many ICA
In method, fixed point algorithm (abbreviation FASTICA) is widely used in signal processing with its fast convergence rate, good separating effect and leads
Domain.The algorithm can estimate the original signal that mutual statistical is independent, is mixed by X factor from observation signal well.
Dimensionality reduction unit 302 is used to carry out space projection and dimension-reduction treatment to the EEG signals after noise reduction;
Specifically, in the present embodiment, dimensionality reduction unit 302 uses cospace pattern algorithm (CSP) by the noise reduction of different channels
The EEG signals of high s/n ratio afterwards carry out projection and dimensionality reduction according to its spatial position, the EEG signals after obtaining dimensionality reduction, this reality
Applying the EEG signals in example after dimensionality reduction is preferably the EEG signals that dimension is lower than 300 dimensions.
Initialization unit 303 is used to all phrases pre-processed in corpus being initialized as term vector to indicate;
Construction unit 304 is used to traverse all phrases in pretreatment corpus, is expressed as spy with the term vector of current phrase
Sign, using the EEG signals of its context of neural net regression model prediction, by the EEG signals and reality of the context of prediction
EEG signals compare, and obtain prediction error, are indicated according to the term vector of the prediction current phrase of error transfer factor, wherein practical
EEG signals are EEG signals when labeler reads context;This step is repeated, until prediction error is less than preset threshold.
Contextual window can be three in the present embodiment, characterized by the term vector of current phrase indicates, use nerve
Its three phrase above of net regression model prediction and the hereafter EEG signals of three phrases, by the brain telecommunications of the context of prediction
It number is compared with practical EEG signals, obtains prediction error and current phrase is adjusted to the error back propagation generated every time
Vector indicates, until overall default error threshold can rule of thumb be set as 10-5。
In conclusion the present invention is through the above scheme, collection module 10 collects text corpus, in text corpus
Corpus is handled, and the corpus of the continuous phrase format as unit of phrase is obtained;Acquisition module 20 is by continuous phrase format
Corpus is presented to labeler, reads for labeler, and acquisition labeler reads EEG signals when each phrase;Constructing module 30 will
The corresponding EEG signals of collected phrase are predicted above and below it characterized by current phrase as prediction target, training term vector
The EEG signals of text, constructing the term vector based on EEG signals indicates model, improves the accuracy of term vector calculating.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.
Claims (10)
1. a kind of term vector calculation method based on EEG signals, which is characterized in that the described method comprises the following steps:
Step S1 collects text corpus, handles the corpus in text corpus, obtain continuous as unit of phrase
The corpus of phrase format;
The corpus of the continuous phrase format is presented to labeler by step S2, is read for labeler, and acquisition labeler is read every
EEG signals when one phrase;
Step S3, using the corresponding EEG signals of collected phrase as prediction target, training term vector, with the word of current phrase
Vector is expressed as the EEG signals that feature predicts its context, and constructing the term vector based on EEG signals indicates model.
2. the term vector calculation method according to claim 1 based on EEG signals, which is characterized in that the step S1 packet
Include following sub-step:
Step S11 collects text corpus, and the corpus in the text corpus is sentence or chapter rank;
Step S12, removing length in the text corpus is more than the first preset value or length less than the corpus of two preset values, is obtained
To pretreatment corpus;
Pretreatment corpus progress word segmentation processing is obtained word by step S13;
Step S14 converts phrase for institute's predicate, obtains with the corpus of continuous phrase format using chunk parsing technology.
3. the term vector calculation method according to claim 2 based on EEG signals, which is characterized in that the step S3 packet
Include following sub-step:
Step S31 carries out noise reduction process to collected EEG signals, the EEG signals after obtaining noise reduction;
Step S32 carries out space projection and dimension-reduction treatment to the EEG signals after the noise reduction;
Step S33, all phrases in the pretreatment corpus, which are initialized as term vector, to be indicated;
Step S34, all phrases traversed in the pretreatment corpus use characterized by the term vector of current phrase indicates
The EEG signals of its context of neural net regression model prediction, by the EEG signals of the context of prediction and practical EEG signals
It compares, obtains prediction error, indicated according to the term vector of the prediction current phrase of error transfer factor, wherein the real EEG electricity
Signal is EEG signals when labeler reads the context;This step is repeated, until prediction error is less than preset threshold.
4. the term vector calculation method according to claim 3 based on EEG signals, which is characterized in that the step S31
Include:
The collected EEG signals are handled, the EEG signals that signal-to-noise ratio is higher than third preset value are obtained;
The step S32 includes:
Space projection and dimensionality reduction are carried out using the EEG signals that cospace pattern algorithm is higher than third preset value to the signal-to-noise ratio,
Obtain the EEG signals that dimension is lower than the 4th preset value.
5. the term vector calculation method according to claim 3 based on EEG signals, which is characterized in that collected to described
EEG signals carry out noise reduction process use FASTICA algorithm.
6. a kind of term vector computing device based on EEG signals, which is characterized in that described device includes:
Collection module handles the corpus in text corpus, obtains as unit of phrase for collecting text corpus
Continuous phrase format corpus;
Acquisition module is read for the corpus of the continuous phrase format to be presented to labeler for labeler, and labeler is acquired
Read EEG signals when each phrase;
Module is constructed, for using the corresponding EEG signals of collected phrase as prediction target, training term vector, with current word
The term vector of group indicates to be characterized the EEG signals for predicting its context, term vector expression model of the building based on EEG signals.
7. the term vector computing device according to claim 6 based on EEG signals, which is characterized in that the collection module
Include:
Collector unit, for collecting text corpus, the corpus in the text corpus is sentence or chapter rank;
Pretreatment unit is more than the first preset value or length less than the second preset value for removing length in the text corpus
Corpus, obtain pretreatment corpus;
Participle unit, for pretreatment corpus progress word segmentation processing to be obtained word;
Conversion unit converts phrase for institute's predicate, obtains with the language of continuous phrase format for utilizing chunk parsing technology
Material.
8. the term vector computing device according to claim 7 based on EEG signals, which is characterized in that the building module
Include:
Noise reduction unit, for carrying out noise reduction process to collected EEG signals, the EEG signals after obtaining noise reduction;
Dimensionality reduction unit, for carrying out space projection and dimension-reduction treatment to the EEG signals after the noise reduction;
Initialization unit is indicated for all phrases in the pretreatment corpus to be initialized as term vector;
Construction unit, for traversing all phrases in the pretreatment corpus, characterized by the term vector of current phrase indicates,
Using the EEG signals of its context of neural net regression model prediction, by the EEG signals of the context of prediction and real EEG electricity
Signal compares, and obtains prediction error, is indicated according to the term vector of the prediction current phrase of error transfer factor, wherein the reality
EEG signals are EEG signals when labeler reads the context;This step is repeated, until prediction error is less than default threshold
Value.
9. the term vector computing device according to claim 8 based on EEG signals, which is characterized in that
The noise reduction unit is also used to, and is handled the collected EEG signals, and it is default higher than third to obtain signal-to-noise ratio
The EEG signals of value;
The dimensionality reduction unit is also used to, and is higher than the EEG signals of third preset value to the signal-to-noise ratio using cospace pattern algorithm
Space projection and dimensionality reduction are carried out, the EEG signals that dimension is lower than the 4th preset value are obtained.
10. the term vector computing device according to claim 8 based on EEG signals, which is characterized in that the noise reduction list
Member is also used to carry out noise reduction process to the collected EEG signals using FASTICA algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610907518.2A CN106502394B (en) | 2016-10-18 | 2016-10-18 | Term vector calculation method and device based on EEG signals |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610907518.2A CN106502394B (en) | 2016-10-18 | 2016-10-18 | Term vector calculation method and device based on EEG signals |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106502394A CN106502394A (en) | 2017-03-15 |
CN106502394B true CN106502394B (en) | 2019-06-25 |
Family
ID=58295164
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610907518.2A Active CN106502394B (en) | 2016-10-18 | 2016-10-18 | Term vector calculation method and device based on EEG signals |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106502394B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111881665B (en) * | 2020-09-27 | 2021-01-05 | 华南师范大学 | Word embedding representation method, device and equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104881401A (en) * | 2015-05-27 | 2015-09-02 | 大连理工大学 | Patent literature clustering method |
CN104965822A (en) * | 2015-07-29 | 2015-10-07 | 中南大学 | Emotion analysis method for Chinese texts based on computer information processing technology |
CN105701084A (en) * | 2015-12-28 | 2016-06-22 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Characteristic extraction method of text classification on the basis of mutual information |
-
2016
- 2016-10-18 CN CN201610907518.2A patent/CN106502394B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104881401A (en) * | 2015-05-27 | 2015-09-02 | 大连理工大学 | Patent literature clustering method |
CN104965822A (en) * | 2015-07-29 | 2015-10-07 | 中南大学 | Emotion analysis method for Chinese texts based on computer information processing technology |
CN105701084A (en) * | 2015-12-28 | 2016-06-22 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Characteristic extraction method of text classification on the basis of mutual information |
Also Published As
Publication number | Publication date |
---|---|
CN106502394A (en) | 2017-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107229610B (en) | A kind of analysis method and device of affection data | |
WO2019080863A1 (en) | Text sentiment classification method, storage medium and computer | |
CN108733653A (en) | A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information | |
Nayel et al. | Machine learning-based model for sentiment and sarcasm detection | |
CN110704621A (en) | Text processing method and device, storage medium and electronic equipment | |
CN107480136B (en) | Method applied to emotional curve analysis in movie script | |
CN106886580A (en) | A kind of picture feeling polarities analysis method based on deep learning | |
CN111125360B (en) | Emotion analysis method and device in game field and model training method and device thereof | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN108319581A (en) | A kind of natural language sentence evaluation method and device | |
CN107943786A (en) | A kind of Chinese name entity recognition method and system | |
CN108090099B (en) | Text processing method and device | |
CN109934251A (en) | A kind of method, identifying system and storage medium for rare foreign languages text identification | |
CN103744838B (en) | A kind of Chinese emotion digest system and method for measuring main flow emotion information | |
CN112860896A (en) | Corpus generalization method and man-machine conversation emotion analysis method for industrial field | |
Hasan et al. | Sentiment classification in bangla textual content: A comparative study | |
Choe et al. | Bridging the gap for tokenizer-free language models | |
CN111339772B (en) | Russian text emotion analysis method, electronic device and storage medium | |
CN112966508A (en) | General automatic term extraction method | |
Dong | Intelligent English teaching prediction system based on SVM and heterogeneous multimodal target recognition | |
CN111832302A (en) | Named entity identification method and device | |
CN106502394B (en) | Term vector calculation method and device based on EEG signals | |
CN116364072B (en) | Education information supervision method based on artificial intelligence | |
CN110929022A (en) | Text abstract generation method and system | |
CN110096696A (en) | A kind of Chinese long text sentiment analysis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |