CN106502394B - Term vector calculation method and device based on EEG signals - Google Patents

Term vector calculation method and device based on EEG signals Download PDF

Info

Publication number
CN106502394B
CN106502394B CN201610907518.2A CN201610907518A CN106502394B CN 106502394 B CN106502394 B CN 106502394B CN 201610907518 A CN201610907518 A CN 201610907518A CN 106502394 B CN106502394 B CN 106502394B
Authority
CN
China
Prior art keywords
eeg signals
corpus
phrase
term vector
labeler
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610907518.2A
Other languages
Chinese (zh)
Other versions
CN106502394A (en
Inventor
徐睿峰
杜嘉晨
桂林
黄锦辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN201610907518.2A priority Critical patent/CN106502394B/en
Publication of CN106502394A publication Critical patent/CN106502394A/en
Application granted granted Critical
Publication of CN106502394B publication Critical patent/CN106502394B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/015Input arrangements based on nervous system activity detection, e.g. brain waves [EEG] detection, electromyograms [EMG] detection, electrodermal response detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/01Indexing scheme relating to G06F3/01
    • G06F2203/011Emotion or mood input determined on the basis of sensed human body parameters such as pulse, heart rate or beat, temperature of skin, facial expressions, iris, voice pitch, brain activity patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/02Preprocessing

Abstract

The present invention provides a kind of term vector calculation method and device based on EEG signals, the term vector calculation method based on EEG signals includes: step S1, text corpus is collected, the corpus in text corpus is handled, the corpus of the continuous phrase format as unit of phrase is obtained;The corpus of continuous phrase format is presented to labeler by step S2, is read for labeler, and acquisition labeler reads EEG signals when each phrase;Step S3, using the corresponding EEG signals of collected phrase as prediction target, training term vector predicts the EEG signals of its context characterized by current phrase, and constructing the term vector based on EEG signals indicates model.The present invention improves the accuracy of term vector calculating through the above scheme.

Description

Term vector calculation method and device based on EEG signals
Technical field
The invention belongs to natural language processing technique field more particularly to a kind of term vector calculating sides based on EEG signals Method and device.
Background technique
In natural language processing task, usually using expression of the term vector as the word in original text, to count The machine learning algorithm of value can apply to text data.The basic thought of term vector model is: by a large amount of trainings, Each word in certain language is mapped to the vector of a regular length, it is however generally that this length is much smaller than the language word The size of allusion quotation usually arrives several hundred dimensions tens.All these vectors constitute term vector space, and each vector can be to be somebody's turn to do A point in space.The measurement of " distance " is introduced on this space, so that it may judge equivalent according to the distance of term vector Similitude between language in syntax, semantically.Traditional term vector calculation method is try to use up by current text vector It may predict the vector of its context accurately to optimize its expression.
In traditional term vector calculating process, predict that context is trained primary goal by current text.It is this The major defect of method has following three points:
1, the other attribute of syntactic level for having only taken into account word does not account for the attribute of phrase semantic rank, therefore usually The obtained term vector of training can only express between word the more relationship of shallow-layer;
2, lack the modeling to human language cognitive process, it is special to have ignored important Cognitive Neuroscience and psychology Sign;
3, due to the complexity of human language cognitive mechanism, the term vector obtained by simple forecast context can not embody The characteristic of different natural language processing tasks, universality are poor.
Summary of the invention
The purpose of the present invention is to provide a kind of term vector calculation method and device based on EEG signals, it is intended to improve word The accuracy that vector calculates.
The invention is realized in this way a kind of term vector calculation method based on EEG signals, the method includes following Step:
Step S1 collects text corpus, handles the corpus in text corpus, obtain as unit of phrase The corpus of continuous phrase format;
The corpus of the continuous phrase format is presented to labeler by step S2, is read for labeler, and acquisition labeler is read Read EEG signals when each phrase;
Step S3, using the corresponding EEG signals of collected phrase as prediction target, training term vector, with current phrase Term vector indicate to be characterized the EEG signals for predicting its context, constructing term vector based on EEG signals indicates model.
Further technical solution of the invention is that the step S1 includes following sub-step:
Step S11 collects text corpus, and the corpus in the text corpus is sentence or chapter rank;
Step S12, removing length in the text corpus is more than the first preset value or length less than the second preset value Corpus obtains pretreatment corpus;
Pretreatment corpus progress word segmentation processing is obtained word by step S13;
Step S14 converts phrase for institute's predicate, obtains with the corpus of continuous phrase format using chunk parsing technology.
Further technical solution of the invention is that the step S3 includes following sub-step:
Step S31 carries out noise reduction process to collected EEG signals, the EEG signals after obtaining noise reduction;
Step S32 carries out space projection and dimension-reduction treatment to the EEG signals after the noise reduction;
Step S33, all phrases in the pretreatment corpus, which are initialized as term vector, to be indicated;
Step S34 traverses all phrases in the pretreatment corpus, characterized by the term vector of current phrase indicates, Using the EEG signals of its context of neural net regression model prediction, by the EEG signals of the context of prediction and real EEG electricity Signal compares, and obtains prediction error, is indicated according to the term vector of the prediction current phrase of error transfer factor, wherein the reality EEG signals are EEG signals when labeler reads the context;This step is repeated, until prediction error is less than default threshold Value.
Further technical solution of the invention is that the step S31 includes:
The collected EEG signals are handled, the EEG signals that signal-to-noise ratio is higher than third preset value are obtained;
The step S32 includes:
Using cospace pattern algorithm to the signal-to-noise ratio be higher than third preset value EEG signals carry out space projection and Dimensionality reduction obtains the EEG signals that dimension is lower than the 4th preset value.
Further technical solution of the invention is to carry out noise reduction process use to the collected EEG signals FASTICA algorithm.
The present invention also provides a kind of term vector computing device based on EEG signals, described device include:
Collection module is handled the corpus in text corpus for collecting text corpus, obtain be with phrase The corpus of the continuous phrase format of unit;
Acquisition module is read for the corpus of the continuous phrase format to be presented to labeler for labeler, acquisition mark Note person reads EEG signals when each phrase;
Module is constructed, for using the corresponding EEG signals of collected phrase as prediction target, training term vector, to work as The term vector of preceding phrase indicates to be characterized the EEG signals for predicting its context, term vector expression mould of the building based on EEG signals Type.
Further technical solution of the invention is that the collection module includes:
Collector unit, for collecting text corpus, the corpus in the text corpus is sentence or chapter rank;
Pretreatment unit is more than that the first preset value or length are pre- less than second for removing length in the text corpus If the corpus of value, pretreatment corpus is obtained;
Participle unit, for pretreatment corpus progress word segmentation processing to be obtained word;
Conversion unit converts phrase for institute's predicate, obtains with continuous phrase format for utilizing chunk parsing technology Corpus.
Further technical solution of the invention is that the building module includes:
Noise reduction unit, for carrying out noise reduction process to collected EEG signals, the EEG signals after obtaining noise reduction;
Dimensionality reduction unit, for carrying out space projection and dimension-reduction treatment to the EEG signals after the noise reduction;
Initialization unit is indicated for all phrases in the pretreatment corpus to be initialized as term vector;
Construction unit is expressed as traversing all phrases in the pretreatment corpus with the term vector of current phrase Feature, using the EEG signals of its context of neural net regression model prediction, by the EEG signals and reality of the context of prediction Border EEG signals compare, and obtain prediction error, are indicated according to the term vector of the prediction current phrase of error transfer factor, wherein institute Stating practical EEG signals is EEG signals when labeler reads the context;This step is repeated, until prediction error is less than Preset threshold.
Further technical solution of the invention is that the noise reduction unit is also used to, to the collected EEG signals It is handled, obtains the EEG signals that signal-to-noise ratio is higher than third preset value;
The dimensionality reduction unit is also used to, and is higher than the brain electricity of third preset value to the signal-to-noise ratio using cospace pattern algorithm Signal carries out space projection and dimensionality reduction, obtains the EEG signals that dimension is lower than the 4th preset value.
Further technical solution of the invention is that the noise reduction module is also used to adopt using FASTICA algorithm to described The EEG signals collected carry out noise reduction process.
The beneficial effects of the present invention are: the term vector calculation method and device provided by the invention based on EEG signals, leads to It crosses above scheme: collecting text corpus, the corpus in text corpus is handled, is obtained continuous as unit of phrase The corpus of phrase format;The corpus of continuous phrase format is presented to labeler, is read for labeler, acquisition labeler is read every EEG signals when one phrase;Using the corresponding EEG signals of collected phrase as prediction target, training term vector, with current Phrase is characterized the EEG signals for predicting its context, and constructing term vector based on EEG signals indicates model, improve word to Measure the accuracy calculated.
Detailed description of the invention
Fig. 1 is the flow diagram of the term vector calculation method first embodiment the present invention is based on EEG signals;
Fig. 2 is the refinement process signal of the term vector calculation method second embodiment step S1 the present invention is based on EEG signals Figure;
Fig. 3 is the refinement process signal of the term vector calculation method 3rd embodiment step S3 the present invention is based on EEG signals Figure;
Fig. 4 is the functional block diagram of the term vector computing device first embodiment the present invention is based on EEG signals;
Fig. 5 is the refinement function mould of the term vector computing device second embodiment acquisition module the present invention is based on EEG signals Block schematic diagram;
Fig. 6 is that the present invention is based on the refinement function moulds of the term vector computing device 3rd embodiment of EEG signals building module Block schematic diagram.
Appended drawing reference:
Collection module -10: collector unit -101;Pretreatment unit -102;Participle unit -103;Conversion unit -104;
Acquisition module -20;
Construct module -30: noise reduction unit -301;Dimensionality reduction unit -302;Initialization unit -303;Construction unit -304.
Specific embodiment
The solution of the embodiment of the present invention is mainly: collecting text corpus, carries out to the corpus in text corpus Processing, obtains the corpus of the continuous phrase format as unit of phrase;The corpus of the continuous phrase format is presented to mark Person reads for labeler, and acquisition labeler reads EEG signals when each phrase;By the corresponding brain telecommunications of collected phrase Number as prediction target, training term vector, characterized by current phrase predict its context EEG signals, building based on brain electricity The term vector of signal indicates model.
Fig. 1 is please referred to, Fig. 1 is the process signal of the term vector calculation method first embodiment the present invention is based on EEG signals Figure, as shown in Figure 1, the present invention is based on the term vector calculation method first embodiment of EEG signals the following steps are included:
Step S1 collects text corpus, handles the corpus in text corpus, obtain as unit of phrase The corpus of continuous phrase format;
Specifically, corpus refers to that the linguistic data really occurred in actual use in language, corpus usually store up There are in corpus, corpus is the database that corpus is carried using electronic computer as carrier, real corpus generally require by Analysis and processing can become useful resource.
Currently, Chinese corpus is mainly the general corpus of Modern Chinese, the Peoples Daily tagged corpus, for language The Modern Chinese corpus of teaching and research, Modern Chinese corpus towards speech signal analysis etc., people are needing corpus When, corpus can be directly acquired from the corpus that these are built up.Certainly, realization of the invention can also be from other corpus Middle acquisition corpus, for example obtain the corpus in internet web page.
Since the training of term vector is using phrase as training data, and the corpus in corpus is usually sentence or text Chapter obtains the corpus of the continuous phrase format as unit of phrase therefore, it is necessary to handle corpus.For example, corpus is sentence Sub " I likes that Beijing, Beijing are the capitals in China ", be processed into the continuous phrase as unit of phrase be " I/love/Beijing/ Beijing/be/China// capital ".
The corpus of continuous phrase format is presented to labeler by step S2, is read for labeler, and acquisition labeler is read every EEG signals when one phrase;Wherein, labeler is to read the user of the corpus presented with continuous phrase format.
Specifically, the present invention is to indicate term vector by EEG signals, is read in labeler with the presentation of continuous phrase format Corpus when, eeg signal acquisition device need to be worn, to obtain EEG signals when labeler reads each phrase.It is marked After EEG signals when person reads each phrase, collected EEG signals and corresponding words group are stored in pairs.
Step S3, using the corresponding EEG signals of collected phrase as prediction target, training term vector, with current phrase Term vector indicate to be characterized the EEG signals for predicting its context, constructing term vector based on EEG signals indicates model.
Specifically, all phrases in the pretreatment corpus can be initialized as term vector indicates;Then, institute is traversed All phrases in pretreatment corpus are stated, it is pre- using neural net regression model characterized by the term vector of current phrase indicates The EEG signals for surveying its context compare the EEG signals of the context of prediction and practical EEG signals, obtain prediction Error indicates, wherein the practical EEG signals read institute for labeler according to the term vector of the prediction current phrase of error transfer factor State EEG signals when context;This step is repeated, until prediction error is less than preset threshold.
Contextual window can be three in the present embodiment, characterized by the term vector of current phrase indicates, use nerve Its three phrase above of net regression model prediction and the hereafter EEG signals of three phrases, by the brain telecommunications of the context of prediction It number is compared with practical EEG signals, obtains prediction error and current phrase is adjusted to the error back propagation generated every time Vector indicates.
The present embodiment is through the above scheme: collecting text corpus, handles the corpus in text corpus, obtain The corpus of continuous phrase format as unit of phrase;The corpus of continuous phrase format is presented to labeler, is read for labeler It reads, acquisition labeler reads EEG signals when each phrase;Using the corresponding EEG signals of collected phrase as prediction mesh Mark, training term vector, predicts the EEG signals of its context characterized by current phrase, constructs the term vector based on EEG signals It indicates model, improves the accuracy of term vector calculating.
As the second embodiment of the present invention, referring to figure 2., Fig. 2 be based on Fig. 1 description based on the word of EEG signals to Measure the refinement flow diagram of the step S1 in calculation method.The step S1 collects text corpus, in text corpus Corpus the step of being handled, obtaining the corpus of the continuous phrase format as unit of phrase may include:
Step S11 collects text corpus, and the corpus in text corpus is sentence or chapter rank;
Step S12, removing length in text corpus is more than the corpus of the first preset value or length less than the second preset value, Obtain pretreatment corpus;
Pretreatment corpus is carried out word segmentation processing and obtains word by step S13;
Step S14 converts phrase for word using chunk parsing technology, obtains with the corpus of continuous phrase format.
Specifically, the corpus in text corpus being collected into is usually sentence or article, since the length of sentence can Can be too long or too short, therefore, a sentence length value range can be rule of thumb preset, removing length in corpus is more than first Preset value or length obtain pretreatment corpus, wherein the first preset value and the second preset value less than the corpus of the second preset value It can be set by experience.
Since the training of term vector is using phrase as training data, and the corpus in corpus is usually sentence or text Chapter obtains the corpus of the continuous phrase format as unit of phrase therefore, it is necessary to handle corpus.For example, corpus is sentence Sub " I likes that Beijing, Beijing are the capitals in China ", be processed into the continuous phrase as unit of phrase be " I/love/Beijing/ Beijing/be/China// capital ".
In the present embodiment, pretreatment corpus first can be subjected to word segmentation processing, obtains word, then uses chunk parsing skill Word is converted phrase by art, obtains with the corpus of continuous phrase format.
Word segmentation processing depends primarily on what participle dictionary was realized, and the quality for segmenting dictionary directly determines word segmentation processing Quality, the participle dictionary generallyd use at present are by establishing based on " xinhua dictionary " or other similar published book Dictionary can also rely on other participle dictionaries in the present embodiment to carry out word segmentation processing.
Language chunk parsing technology is common technology in shallow grammar analysis, and language chunking technology can be according to scheduled mould Sentences decomposition is component by type, these components are mainly phrase and longer phrase, so that computer is for sentence Understanding can rise to the bigger phrase of information content, phrase from the level of single word, word, be more nearly natural language.
As the third embodiment of the present invention, referring to figure 3., Fig. 3 be based on Fig. 1 description based on the word of EEG signals to Measure the refinement flow diagram of the step S3 in calculation method.The step S3, by the corresponding EEG signals of collected phrase As prediction target, training term vector predicts the EEG signals of its context characterized by current phrase, and building is based on brain telecommunications Number term vector indicate model step may include:
Step S31 carries out noise reduction process to collected EEG signals, the EEG signals after obtaining noise reduction;
During EEG signals when the corpus that acquisition labeler reading is presented with continuous phrase format, it is easy to be set The influence of standby noise signal and the factors such as electromyography signal and electro-ocular signal, it is therefore desirable to which labeler is read with continuous phrase format The EEG signals when corpus of presentation carry out denoising, the EEG signals of the high s/n ratio after obtaining noise reduction.
Signal-to-noise ratio, English name are called SNR or S/N (SIGNAL-NOISE RATIO), also known as signal to noise ratio.Refer to one The ratio of signal and noise in electronic equipment or electronic system.Here signal refers to passing through from device external needs The electronic signal that this equipment is handled, noise refer to by not existing random in the original signal generated after the equipment Extra (or be information), and this signal does not change with the variation of original signal.The measurement unit of signal-to-noise ratio It is dB, calculation method is 10lg (PS/PN), and wherein PS and PN respectively represents the effective power of signal and noise, and signal-to-noise ratio is got over Height illustrates that noise is smaller.
In the present embodiment, FASTICA algorithm is used to be presented the reading of collected labeler with continuous phrase format EEG signals when corpus are projected as multiple isolated components, then differentiate noise using spectrum signature or high order cross feature etc., Noise component is removed in EEG signals when then reading the corpus presented with continuous phrase format from collected labeler, is obtained The EEG signals of high s/n ratio after to noise reduction, the EEG signals of the high s/n ratio in the present embodiment after noise reduction are preferably noise Than the EEG signals for being higher than 15db.
Independent component analysis (abbreviation ICA) is very effective data analysis tool, it is mainly used to from blended data Extract original independent signal.It as Signal separator a kind of effective ways and widely paid close attention to.It is calculated in many ICA In method, fixed point algorithm (abbreviation FASTICA) is widely used in signal processing with its fast convergence rate, good separating effect and leads Domain.The algorithm can estimate the original signal that mutual statistical is independent, is mixed by X factor from observation signal well.
Step S32 carries out space projection and dimension-reduction treatment to the EEG signals after noise reduction;
Specifically, in the present embodiment, using cospace pattern algorithm (CSP) by the high s/n ratio after the noise reduction of different channels EEG signals projection and dimensionality reduction are carried out according to its spatial position, EEG signals after obtaining dimensionality reduction, in the present embodiment after dimensionality reduction EEG signals be preferably dimension be lower than 300 dimensions EEG signals.
Step S33, all phrases pre-processed in corpus, which are initialized as term vector, to be indicated;
Step S34, traversal pre-processes all phrases in corpus, characterized by the term vector of current phrase indicates, uses The EEG signals of its context of neural net regression model prediction, by the EEG signals of the context of prediction and practical EEG signals It compares, obtains prediction error, indicated according to the term vector of the prediction current phrase of error transfer factor, wherein the real EEG electricity Signal is EEG signals when labeler reads the context;This step is repeated, until macro-forecast error is less than default threshold Value.
Contextual window can be three in the present embodiment, characterized by the term vector of current phrase indicates, use nerve Its three phrase above of net regression model prediction and the hereafter EEG signals of three phrases, by the brain telecommunications of the context of prediction It number is compared with practical EEG signals, obtains prediction error and current phrase is adjusted to the error back propagation generated every time Vector indicates, until default error threshold can rule of thumb be set as 10-5
In conclusion the present invention is through the above scheme, text corpus is collected, at the corpus in text corpus Reason, obtains the corpus of the continuous phrase format as unit of phrase;The corpus of continuous phrase format is presented to labeler, for mark Note person reads, and acquisition labeler reads EEG signals when each phrase;Using the corresponding EEG signals of collected phrase as Predict target, training term vector is predicted the EEG signals of its context characterized by current phrase, constructed based on EEG signals Term vector indicates model, improves the accuracy of term vector calculating.
Corresponding with the above-mentioned term vector calculation method based on EEG signals, the present invention also provides be based on EEG signals Term vector computing device.
Referring to figure 4., Fig. 4 is the functional module of the term vector computing device first embodiment the present invention is based on EEG signals Schematic diagram, as shown in figure 4, the present invention is based on the term vector computing device first embodiment of EEG signals include: collection module 10, Acquisition module 20 and building module 30.
Wherein, collection module 10 is handled the corpus in text corpus for collecting text corpus, obtain with Phrase is the corpus of the continuous phrase format of unit;
Specifically, corpus refers to that the linguistic data really occurred in actual use in language, corpus usually store up There are in corpus, corpus is the database that corpus is carried using electronic computer as carrier, real corpus generally require by Analysis and processing can become useful resource.
Currently, Chinese corpus is mainly the general corpus of Modern Chinese, the Peoples Daily tagged corpus, for language The Modern Chinese corpus of teaching and research, Modern Chinese corpus towards speech signal analysis etc., people are needing corpus When, corpus can be directly acquired from the corpus that these are built up.Certainly, realization of the invention can also be from other corpus Middle acquisition corpus, for example obtain the corpus in internet web page.
Since the training of term vector is using phrase as training data, and the corpus in corpus is usually sentence or text Chapter obtains the corpus of the continuous phrase format as unit of phrase therefore, it is necessary to handle corpus.For example, corpus is sentence Sub " I likes that Beijing, Beijing are the capitals in China ", be processed into the continuous phrase as unit of phrase be " I/love/Beijing/ Beijing/be/China// capital ".
Acquisition module 20 is used to the corpus of continuous phrase format being presented to labeler, reads for labeler, acquisition mark Person reads EEG signals when each phrase.
Specifically, the present invention is to indicate term vector by EEG signals, is read in labeler with the presentation of continuous phrase format Corpus when, eeg signal acquisition device need to be worn, to obtain EEG signals when labeler reads each phrase.It is marked After EEG signals when person reads each phrase, collected EEG signals and corresponding words group are stored in pairs.
Module 30 is constructed to be used to train term vector using the corresponding EEG signals of collected phrase as prediction target, with Current phrase is characterized the EEG signals for predicting its context, and constructing the term vector based on EEG signals indicates model.
Contextual window can be three in the present embodiment, characterized by the term vector of current phrase indicates, use nerve Its three phrase above of net regression model prediction and the hereafter EEG signals of three phrases, by the brain telecommunications of the context of prediction It number is compared with practical EEG signals, obtains prediction error and current phrase is adjusted to the error back propagation generated every time Vector indicates, until overall default error threshold can rule of thumb be set as 10-5
The present embodiment is through the above scheme: collection module 10 collects text corpus, to the corpus in text corpus into Row processing, obtains the corpus of the continuous phrase format as unit of phrase;The corpus of continuous phrase format is in by acquisition module 20 Labeler is now given, is read for labeler, acquisition labeler reads EEG signals when each phrase;Collected phrase is corresponding EEG signals as prediction target, training term vector, characterized by current phrase predict its context EEG signals, building Term vector based on EEG signals indicates model, improves the accuracy of term vector calculating.
As the second embodiment of the present invention, referring to figure 5., Fig. 5 be based on Fig. 4 description based on the word of EEG signals to The refinement the functional block diagram of collection module 10 in device for calculating.In the present embodiment, collection module 10 may include: Collector unit 101, pretreatment unit 102, participle unit 103 and conversion unit 104.
Wherein, for collector unit 101 for collecting text corpus, the corpus in the text corpus is sentence or a piece Chapter rank;
Pretreatment unit 102 is more than that the first preset value or length are default less than second for removing length in text corpus The corpus of value obtains pretreatment corpus, wherein the first preset value and the second preset value can be set by experience.
Participle unit 103 obtains word for that will pre-process corpus progress word segmentation processing;
Conversion unit 104 is used to utilize chunk parsing technology, converts phrase for word, obtains with the language of continuous phrase format Material.
Specifically, the corpus in text corpus that collection module 10 is collected into is usually sentence or article, due to sentence The length of son may be too long or too short, therefore, can rule of thumb preset a sentence length value range, removes long in corpus It spends more than the first preset value or length less than the corpus of the second preset value, obtains pretreatment corpus.
Since the training of term vector is using phrase as training data, and the corpus in corpus is usually sentence or text Chapter obtains the corpus of the continuous phrase format as unit of phrase therefore, it is necessary to handle corpus.For example, corpus is sentence Sub " I likes that Beijing, Beijing are the capitals in China ", be processed into the continuous phrase as unit of phrase be " I/love/Beijing/ Beijing/be/China// capital ".
In the present embodiment, can first pass through pretreatment unit 102 will pre-process corpus progress word segmentation processing, obtain word, Then chunk parsing technology is used by conversion unit 104, converts phrase for word, obtains with the corpus of continuous phrase format.
Word segmentation processing depends primarily on what participle dictionary was realized, and the quality for segmenting dictionary directly determines word segmentation processing Quality, the participle dictionary generallyd use at present are by establishing based on " xinhua dictionary " or other similar published book Dictionary can also rely on other participle dictionaries in the present embodiment to carry out word segmentation processing.
Language chunk parsing technology is common technology in shallow grammar analysis, and language chunking technology can be according to scheduled mould Sentences decomposition is component by type, these components are mainly phrase and longer phrase, so that computer is for sentence Understanding can rise to the bigger phrase of information content, phrase from the level of single word, word, be more nearly natural language.
As the third embodiment of the present invention, please refer to Fig. 6, Fig. 6 be based on Fig. 4 description based on the word of EEG signals to The refinement the functional block diagram of building module 30 in device for calculating.In the present embodiment, building module 30 may include: Noise reduction unit 301, dimensionality reduction unit 302, initialization unit 303 and construction unit 304.
Wherein, noise reduction unit 301 is used to carry out noise reduction process to collected EEG signals, the brain telecommunications after obtaining noise reduction Number;
During EEG signals when the corpus that acquisition labeler reading is presented with continuous phrase format, it is easy to be set The influence of standby noise signal and the factors such as electromyography signal and electro-ocular signal, it is therefore desirable to which labeler is read with continuous phrase format The EEG signals when corpus of presentation carry out denoising, the EEG signals of the high s/n ratio after obtaining noise reduction.
Signal-to-noise ratio, English name are called SNR or S/N (SIGNAL-NOISE RATIO), also known as signal to noise ratio.Refer to one The ratio of signal and noise in electronic equipment or electronic system.Here signal refers to passing through from device external needs The electronic signal that this equipment is handled, noise refer to by not existing random in the original signal generated after the equipment Extra (or be information), and this signal does not change with the variation of original signal.The measurement unit of signal-to-noise ratio It is dB, calculation method is 10lg (PS/PN), and wherein PS and PN respectively represents the effective power of signal and noise, and signal-to-noise ratio is got over Height illustrates that noise is smaller.
In the present embodiment, noise reduction unit 301 uses FASTICA algorithm to read collected labeler with continuous phrase The EEG signals when corpus that format is presented are projected as multiple isolated components, then use spectrum signature or high order cross feature etc. Noise is differentiated, except denoising in EEG signals when then reading the corpus presented with continuous phrase format from collected labeler The EEG signals of cent amount, the EEG signals of the high s/n ratio after obtaining noise reduction, the high s/n ratio in the present embodiment after noise reduction are excellent It is selected as the EEG signals that signal-to-noise ratio is higher than 15db.
Independent component analysis (abbreviation ICA) is very effective data analysis tool, it is mainly used to from blended data Extract original independent signal.It as Signal separator a kind of effective ways and widely paid close attention to.It is calculated in many ICA In method, fixed point algorithm (abbreviation FASTICA) is widely used in signal processing with its fast convergence rate, good separating effect and leads Domain.The algorithm can estimate the original signal that mutual statistical is independent, is mixed by X factor from observation signal well.
Dimensionality reduction unit 302 is used to carry out space projection and dimension-reduction treatment to the EEG signals after noise reduction;
Specifically, in the present embodiment, dimensionality reduction unit 302 uses cospace pattern algorithm (CSP) by the noise reduction of different channels The EEG signals of high s/n ratio afterwards carry out projection and dimensionality reduction according to its spatial position, the EEG signals after obtaining dimensionality reduction, this reality Applying the EEG signals in example after dimensionality reduction is preferably the EEG signals that dimension is lower than 300 dimensions.
Initialization unit 303 is used to all phrases pre-processed in corpus being initialized as term vector to indicate;
Construction unit 304 is used to traverse all phrases in pretreatment corpus, is expressed as spy with the term vector of current phrase Sign, using the EEG signals of its context of neural net regression model prediction, by the EEG signals and reality of the context of prediction EEG signals compare, and obtain prediction error, are indicated according to the term vector of the prediction current phrase of error transfer factor, wherein practical EEG signals are EEG signals when labeler reads context;This step is repeated, until prediction error is less than preset threshold.
Contextual window can be three in the present embodiment, characterized by the term vector of current phrase indicates, use nerve Its three phrase above of net regression model prediction and the hereafter EEG signals of three phrases, by the brain telecommunications of the context of prediction It number is compared with practical EEG signals, obtains prediction error and current phrase is adjusted to the error back propagation generated every time Vector indicates, until overall default error threshold can rule of thumb be set as 10-5
In conclusion the present invention is through the above scheme, collection module 10 collects text corpus, in text corpus Corpus is handled, and the corpus of the continuous phrase format as unit of phrase is obtained;Acquisition module 20 is by continuous phrase format Corpus is presented to labeler, reads for labeler, and acquisition labeler reads EEG signals when each phrase;Constructing module 30 will The corresponding EEG signals of collected phrase are predicted above and below it characterized by current phrase as prediction target, training term vector The EEG signals of text, constructing the term vector based on EEG signals indicates model, improves the accuracy of term vector calculating.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (10)

1. a kind of term vector calculation method based on EEG signals, which is characterized in that the described method comprises the following steps:
Step S1 collects text corpus, handles the corpus in text corpus, obtain continuous as unit of phrase The corpus of phrase format;
The corpus of the continuous phrase format is presented to labeler by step S2, is read for labeler, and acquisition labeler is read every EEG signals when one phrase;
Step S3, using the corresponding EEG signals of collected phrase as prediction target, training term vector, with the word of current phrase Vector is expressed as the EEG signals that feature predicts its context, and constructing the term vector based on EEG signals indicates model.
2. the term vector calculation method according to claim 1 based on EEG signals, which is characterized in that the step S1 packet Include following sub-step:
Step S11 collects text corpus, and the corpus in the text corpus is sentence or chapter rank;
Step S12, removing length in the text corpus is more than the first preset value or length less than the corpus of two preset values, is obtained To pretreatment corpus;
Pretreatment corpus progress word segmentation processing is obtained word by step S13;
Step S14 converts phrase for institute's predicate, obtains with the corpus of continuous phrase format using chunk parsing technology.
3. the term vector calculation method according to claim 2 based on EEG signals, which is characterized in that the step S3 packet Include following sub-step:
Step S31 carries out noise reduction process to collected EEG signals, the EEG signals after obtaining noise reduction;
Step S32 carries out space projection and dimension-reduction treatment to the EEG signals after the noise reduction;
Step S33, all phrases in the pretreatment corpus, which are initialized as term vector, to be indicated;
Step S34, all phrases traversed in the pretreatment corpus use characterized by the term vector of current phrase indicates The EEG signals of its context of neural net regression model prediction, by the EEG signals of the context of prediction and practical EEG signals It compares, obtains prediction error, indicated according to the term vector of the prediction current phrase of error transfer factor, wherein the real EEG electricity Signal is EEG signals when labeler reads the context;This step is repeated, until prediction error is less than preset threshold.
4. the term vector calculation method according to claim 3 based on EEG signals, which is characterized in that the step S31 Include:
The collected EEG signals are handled, the EEG signals that signal-to-noise ratio is higher than third preset value are obtained;
The step S32 includes:
Space projection and dimensionality reduction are carried out using the EEG signals that cospace pattern algorithm is higher than third preset value to the signal-to-noise ratio, Obtain the EEG signals that dimension is lower than the 4th preset value.
5. the term vector calculation method according to claim 3 based on EEG signals, which is characterized in that collected to described EEG signals carry out noise reduction process use FASTICA algorithm.
6. a kind of term vector computing device based on EEG signals, which is characterized in that described device includes:
Collection module handles the corpus in text corpus, obtains as unit of phrase for collecting text corpus Continuous phrase format corpus;
Acquisition module is read for the corpus of the continuous phrase format to be presented to labeler for labeler, and labeler is acquired Read EEG signals when each phrase;
Module is constructed, for using the corresponding EEG signals of collected phrase as prediction target, training term vector, with current word The term vector of group indicates to be characterized the EEG signals for predicting its context, term vector expression model of the building based on EEG signals.
7. the term vector computing device according to claim 6 based on EEG signals, which is characterized in that the collection module Include:
Collector unit, for collecting text corpus, the corpus in the text corpus is sentence or chapter rank;
Pretreatment unit is more than the first preset value or length less than the second preset value for removing length in the text corpus Corpus, obtain pretreatment corpus;
Participle unit, for pretreatment corpus progress word segmentation processing to be obtained word;
Conversion unit converts phrase for institute's predicate, obtains with the language of continuous phrase format for utilizing chunk parsing technology Material.
8. the term vector computing device according to claim 7 based on EEG signals, which is characterized in that the building module Include:
Noise reduction unit, for carrying out noise reduction process to collected EEG signals, the EEG signals after obtaining noise reduction;
Dimensionality reduction unit, for carrying out space projection and dimension-reduction treatment to the EEG signals after the noise reduction;
Initialization unit is indicated for all phrases in the pretreatment corpus to be initialized as term vector;
Construction unit, for traversing all phrases in the pretreatment corpus, characterized by the term vector of current phrase indicates, Using the EEG signals of its context of neural net regression model prediction, by the EEG signals of the context of prediction and real EEG electricity Signal compares, and obtains prediction error, is indicated according to the term vector of the prediction current phrase of error transfer factor, wherein the reality EEG signals are EEG signals when labeler reads the context;This step is repeated, until prediction error is less than default threshold Value.
9. the term vector computing device according to claim 8 based on EEG signals, which is characterized in that
The noise reduction unit is also used to, and is handled the collected EEG signals, and it is default higher than third to obtain signal-to-noise ratio The EEG signals of value;
The dimensionality reduction unit is also used to, and is higher than the EEG signals of third preset value to the signal-to-noise ratio using cospace pattern algorithm Space projection and dimensionality reduction are carried out, the EEG signals that dimension is lower than the 4th preset value are obtained.
10. the term vector computing device according to claim 8 based on EEG signals, which is characterized in that the noise reduction list Member is also used to carry out noise reduction process to the collected EEG signals using FASTICA algorithm.
CN201610907518.2A 2016-10-18 2016-10-18 Term vector calculation method and device based on EEG signals Active CN106502394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610907518.2A CN106502394B (en) 2016-10-18 2016-10-18 Term vector calculation method and device based on EEG signals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610907518.2A CN106502394B (en) 2016-10-18 2016-10-18 Term vector calculation method and device based on EEG signals

Publications (2)

Publication Number Publication Date
CN106502394A CN106502394A (en) 2017-03-15
CN106502394B true CN106502394B (en) 2019-06-25

Family

ID=58295164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610907518.2A Active CN106502394B (en) 2016-10-18 2016-10-18 Term vector calculation method and device based on EEG signals

Country Status (1)

Country Link
CN (1) CN106502394B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881665B (en) * 2020-09-27 2021-01-05 华南师范大学 Word embedding representation method, device and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104881401A (en) * 2015-05-27 2015-09-02 大连理工大学 Patent literature clustering method
CN104965822A (en) * 2015-07-29 2015-10-07 中南大学 Emotion analysis method for Chinese texts based on computer information processing technology
CN105701084A (en) * 2015-12-28 2016-06-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Characteristic extraction method of text classification on the basis of mutual information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104881401A (en) * 2015-05-27 2015-09-02 大连理工大学 Patent literature clustering method
CN104965822A (en) * 2015-07-29 2015-10-07 中南大学 Emotion analysis method for Chinese texts based on computer information processing technology
CN105701084A (en) * 2015-12-28 2016-06-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Characteristic extraction method of text classification on the basis of mutual information

Also Published As

Publication number Publication date
CN106502394A (en) 2017-03-15

Similar Documents

Publication Publication Date Title
CN107229610B (en) A kind of analysis method and device of affection data
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN108733653A (en) A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information
Nayel et al. Machine learning-based model for sentiment and sarcasm detection
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN107480136B (en) Method applied to emotional curve analysis in movie script
CN106886580A (en) A kind of picture feeling polarities analysis method based on deep learning
CN111125360B (en) Emotion analysis method and device in game field and model training method and device thereof
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN108319581A (en) A kind of natural language sentence evaluation method and device
CN107943786A (en) A kind of Chinese name entity recognition method and system
CN108090099B (en) Text processing method and device
CN109934251A (en) A kind of method, identifying system and storage medium for rare foreign languages text identification
CN103744838B (en) A kind of Chinese emotion digest system and method for measuring main flow emotion information
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
Hasan et al. Sentiment classification in bangla textual content: A comparative study
Choe et al. Bridging the gap for tokenizer-free language models
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN112966508A (en) General automatic term extraction method
Dong Intelligent English teaching prediction system based on SVM and heterogeneous multimodal target recognition
CN111832302A (en) Named entity identification method and device
CN106502394B (en) Term vector calculation method and device based on EEG signals
CN116364072B (en) Education information supervision method based on artificial intelligence
CN110929022A (en) Text abstract generation method and system
CN110096696A (en) A kind of Chinese long text sentiment analysis method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant