CN115565540B - Invasive brain-computer interface Chinese pronunciation decoding method - Google Patents

Invasive brain-computer interface Chinese pronunciation decoding method Download PDF

Info

Publication number
CN115565540B
CN115565540B CN202211545924.0A CN202211545924A CN115565540B CN 115565540 B CN115565540 B CN 115565540B CN 202211545924 A CN202211545924 A CN 202211545924A CN 115565540 B CN115565540 B CN 115565540B
Authority
CN
China
Prior art keywords
hyperbolic
data
chinese pronunciation
space
computer interface
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211545924.0A
Other languages
Chinese (zh)
Other versions
CN115565540A (en
Inventor
祁玉
谭显瀚
王跃明
张建民
朱君明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202211545924.0A priority Critical patent/CN115565540B/en
Publication of CN115565540A publication Critical patent/CN115565540A/en
Application granted granted Critical
Publication of CN115565540B publication Critical patent/CN115565540B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an invasive brain-computer interface Chinese pronunciation decoding method, which comprises the following steps: screening effective neurons from the electroencephalogram data, removing the neurons with high similarity, and labeling the electroencephalogram data by using synchronous audio data after standardization; according to the characteristics of the Chinese pronunciation electroencephalogram data, the electroencephalogram data are projected into a hyperbolic space; constructing an effective hyperbolic neural network and a hyperbolic multiple logistic regression classifier to classify Chinese phonemes in the electroencephalogram data; in the training process, a certain number of triples are extracted from training data, the losses of hierarchical clustering are calculated for the triples based on the output characteristics of a network, and the triples are added into a total loss function to be optimized according to a certain weight; decoding by using the trained hyperbolic neural network and the hyperbolic multiple logistic regression classifier. The method better utilizes the structural characteristics of the Chinese pronunciation electroencephalogram data by introducing the hyperbolic space and the hyperbolic decoding method, and effectively improves the classification decoding performance of the Chinese pronunciation electroencephalogram data.

Description

Invasive brain-computer interface Chinese pronunciation decoding method
Technical Field
The invention relates to the field of electroencephalogram data decoding, in particular to an invasive brain-computer interface Chinese pronunciation decoding method.
Background
The invasive brain-computer interface utilizes the high resolution intra-cortical brain electrical signals recorded by the invasive electrodes to identify the state and intent of the brain, thereby assisting the clinical patient in performing a variety of different tasks. In recent years, the application and research of an invasive brain-computer interface on voice are rapidly developed. Advanced speech brain-computer interfaces have enabled direct speech synthesis or decoding of speech phonemes, words and sentences from brain electrical signals, which means that invasive speech brain-computer interfaces have great potential for restoring the communication ability of aphasia patients.
In general, the speech brain-computer interface regards pronunciation as a motion process, and decodes neural signals into speech by decoding oral pronunciation kinematics as an intermediate link. One way is to convert the electroencephalogram signals recorded from the motor cortex into oral vocal actions during speaking, and then convert the corresponding oral vocal actions into voice. With the help of machine learning methods such as deep networks, some speech brain-computer interfaces tend to learn decoders in an end-to-end manner, generating speech waveforms directly from brain electrical signals.
For example, chinese patent publication No. CN111681636A discloses a method for generating a technical term sound based on a brain-computer interface, which includes collecting electroencephalogram signals reflecting brain activity information, external audio signals and video image signals, extracting features, performing nonlinear computation and learning through a plurality of neural networks, adding external context information and feedback input, directly decoding intentions and language contents expressed by the brain from the brain signals, and finally completing speech generation through a neural network, thereby realizing speech generation of the brain-computer interface technology.
However, decoding speech directly from neural signals faces the problem of vocabulary limitation. This is time consuming because the test needs to repeatedly speak words in the vocabulary for decoder training before the speech brain-computer interface is constructed. On the other hand, phonemes are basic sound units in pronunciation. Typically, the number of phonemes is much smaller than the number of words. Through accurate recognition of phonemes, free decoding of words after combination is expected. But it is difficult to accurately decode the speech phonemes from the neural signal. From a kinematic process perspective, the kinematics associated with speech is a combination of orofacial movements, including lips, tongue, chin, and other joints. Therefore, the phonemes with similar kinematics are often mixed up and difficult to distinguish, and the overall classification performance of the phonemes is reduced. How to accurately decode speech phonemes from neural signals remains a challenging problem.
More importantly, the application and research of the brain-computer interface aiming at Chinese pronunciation do not exist, and how to design an algorithm aiming at the Chinese pronunciation characteristics is realized, so that the good classification and decoding performance is realized, and then the high-efficiency voice brain-computer interface is constructed and is still in a blank state at present.
Disclosure of Invention
The invention provides an invasive brain-computer interface Chinese pronunciation decoding method, which can effectively improve the classification decoding performance of Chinese pronunciation electroencephalogram data.
An invasive brain-computer interface Chinese pronunciation decoding method comprises the following steps:
(1) Collecting electroencephalogram data of Chinese pronunciation and synchronous audio data, screening effective neurons from the electroencephalogram data, removing the neurons with high similarity, and standardizing the electroencephalogram data; marking the time node of the sound production on the electroencephalogram data by using the synchronous audio data, and intercepting data segments with fixed window length, wherein each data segment corresponds to a Chinese phoneme;
(2) Projecting the electroencephalogram data processed in the step (1) into a hyperbolic space, and forming training data by the electroencephalogram data and corresponding Chinese phonemes in the hyperbolic space;
(3) Constructing a hyperbolic neural network and a hyperbolic multiple logistic regression classifier; the hyperbolic neural network is used for extracting the features of the electroencephalogram data in the hyperbolic space, and the hyperbolic multiple logistic regression classifier is used for classifying Chinese phonemes for the features of the electroencephalogram data;
(4) Training a hyperbolic neural network and a hyperbolic multiple logistic regression classifier;
in the training process, a certain number of triples are extracted from training data, the loss of hierarchical clustering is calculated for the triples on the basis of the output characteristics of the hyperbolic neural network, and the triples are added into a total loss function to be optimized according to a certain weight;
(5) And projecting the electroencephalogram data to be decoded to a hyperbolic space, and then sequentially inputting the data to the trained hyperbolic neural network and hyperbolic multiple logistic regression classifier to obtain the decoded Chinese phoneme classification.
According to the hierarchical classification structure of the phonemes in the Chinese pronunciation and the hierarchy of the pronunciation position and the pronunciation mode in the Chinese pronunciation electroencephalogram, the hyperbolic neural network is constructed to better learn the characteristics of the Chinese pronunciation electroencephalogram, and the logit vector is obtained through the hyperbolic multiple logistic regression classifier. Meanwhile, hierarchical clustering constraint is performed on the logit vectors, and the model is encouraged to better mine the hierarchical structure of the data, so that better representation is learned, and the classification decoding performance of the Chinese pronunciation electroencephalogram data is effectively improved.
Preferably, in step (1), the neurons are screened by off-line screening. The method for screening effective neurons from electroencephalogram data and removing the neurons with high similarity specifically comprises the following steps:
carrying out spike potential classification, extracting the issuance of all neurons in an electroencephalogram signal, and drawing a waveform; visually inspecting the firing waveform of each neuron, and reserving neurons with obvious waveforms and a total firing frequency of more than 100; cosine similarity is calculated for the issuance of different neurons, and when the similarity degree of a plurality of neurons is greater than 0.7, only one neuron is reserved so as to reduce the influence of crosstalk on data quality.
When the data are normalized, the original value is subtracted by the mean value and then divided by the standard deviation, so that the obtained data meet the normal distribution that the mean value is 0 and the standard deviation is 1.
Preferably, the sound-emitting time node is marked by synchronous audio data, and the data segment with the window of [ -500ms, +1500ms ] is intercepted and used for subsequent training and verification by taking the sound-emitting time node as the center.
In the step (2), a Poincare disc model is adopted
Figure DEST_PATH_IMAGE001
To project the brain electrical data into the hyperbolic space:
Figure 399548DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
Figure 745079DEST_PATH_IMAGE004
wherein,
Figure DEST_PATH_IMAGE005
representing a hyperbolic space with curvature c and dimension d; />
Figure 833120DEST_PATH_IMAGE006
Represents a data point, <' > is>
Figure 896891DEST_PATH_IMAGE007
Represents a Euclidean real number space of dimension d, <' > H>
Figure 830212DEST_PATH_IMAGE008
Represents->
Figure 295829DEST_PATH_IMAGE006
Is greater than or equal to>
Figure 23613DEST_PATH_IMAGE009
And &>
Figure 574680DEST_PATH_IMAGE010
Represents a Euclidean metric and a hyperbolic metric, respectively>
Figure 46113DEST_PATH_IMAGE011
Representing the conformities of both metrics.
In the step (3), the hyperbolic neural network is expressed as:
Figure 366236DEST_PATH_IMAGE012
Figure 530501DEST_PATH_IMAGE013
Figure 568864DEST_PATH_IMAGE014
/>
wherein,
Figure 375146DEST_PATH_IMAGE015
and &>
Figure 753038DEST_PATH_IMAGE016
Respectively representing a hyperbolic neural network function and an Euclidean neural network function, and>
Figure 884942DEST_PATH_IMAGE017
and
Figure 145022DEST_PATH_IMAGE018
respectively representing an exponential transformation and a logarithmic transformation at the origin,crepresents the curvature of a hyperbolic space, and>
Figure 958257DEST_PATH_IMAGE006
represents a data point, <' > is>
Figure 252972DEST_PATH_IMAGE008
Represents->
Figure 759040DEST_PATH_IMAGE006
Absolute value of (a).
In step (3), when the hyperbolic multiple logistic regression classifier classifies Chinese phonemes, z classes are given, and probability calculation formulas of different classes are as follows:
Figure 506416DEST_PATH_IMAGE019
wherein,
Figure 123342DEST_PATH_IMAGE020
and &>
Figure 272564DEST_PATH_IMAGE021
Is a parameter of hyperbolic multiple logistic regression>
Figure 949533DEST_PATH_IMAGE022
A conformal factor representing the classification boundary of the class z,
Figure 184205DEST_PATH_IMAGE023
represents an inverse hyperbolic sine function, -is>
Figure 339243DEST_PATH_IMAGE024
Denotes an exponential function based on a natural constant e>
Figure 342971DEST_PATH_IMAGE025
Represents->
Figure 987579DEST_PATH_IMAGE021
The absolute value of (a);
Figure 912809DEST_PATH_IMAGE026
representing a Mobius addition operation; c represents the curvature of a hyperbolic space; />
Figure 933855DEST_PATH_IMAGE027
Representing an inner product operation.
In the step (4), the formula of the total loss function is as follows:
Figure 526510DEST_PATH_IMAGE028
wherein,
Figure 342020DEST_PATH_IMAGE029
represents a classification loss, is>
Figure 488967DEST_PATH_IMAGE030
Representing hierarchical clustering loss; />
Figure 313704DEST_PATH_IMAGE031
And &>
Figure 229707DEST_PATH_IMAGE032
Are coefficients that balance the two parts of the loss function.
The classification loss is calculated as follows:
Figure 216118DEST_PATH_IMAGE033
wherein,
Figure 850361DEST_PATH_IMAGE034
is/>
Figure 213210DEST_PATH_IMAGE035
Is selected based on the category label, <' > is selected>
Figure 249299DEST_PATH_IMAGE036
Is after softmax->
Figure 406611DEST_PATH_IMAGE035
Is greater than the log probability of->
Figure DEST_PATH_IMAGE037
The data amount of mini-batch is shown.
The hierarchical clustering loss is calculated as follows:
Figure 324888DEST_PATH_IMAGE038
Figure DEST_PATH_IMAGE039
wherein,
Figure 960269DEST_PATH_IMAGE040
represents a normalized softmax function; />
Figure DEST_PATH_IMAGE041
Representing triples extracted from training data; />
Figure 382023DEST_PATH_IMAGE042
Indicates pickin a triple>
Figure DEST_PATH_IMAGE043
Is connected to the minimum common ancestor node, </or>
Figure 710236DEST_PATH_IMAGE044
Indicates pickin a triple>
Figure DEST_PATH_IMAGE045
The smallest common ancestor node of (a),
Figure 381389DEST_PATH_IMAGE046
indicates pickin a triple>
Figure DEST_PATH_IMAGE047
The smallest common ancestor node of (c); />
Figure 617198DEST_PATH_IMAGE048
Representing a hyperbolic distance to a center of a hyperbolic space; />
Figure DEST_PATH_IMAGE049
Representing £ in a triplet>
Figure 159038DEST_PATH_IMAGE043
Hyperbolic similarity between->
Figure 923731DEST_PATH_IMAGE050
Indicates pickin a triple>
Figure 754284DEST_PATH_IMAGE045
Hyperbolic similarity between->
Figure DEST_PATH_IMAGE051
Representing £ in a triplet>
Figure 997047DEST_PATH_IMAGE047
Hyperbolic similarity between them; />
Figure 658972DEST_PATH_IMAGE052
Representing a matrix transposition.
When hyperbolic similarity calculation is carried out, a certain number of triples are sampled by using a random sampling method
Figure DEST_PATH_IMAGE053
Calculating a hyperbolic distance->
Figure 63409DEST_PATH_IMAGE054
Divided by the sum of three>
Figure DEST_PATH_IMAGE055
Is normalized to obtain->
Figure 177995DEST_PATH_IMAGE056
The degree of similarity is expressed as->
Figure DEST_PATH_IMAGE057
And when the hierarchical clustering loss is calculated, selecting a logic layer of the hyperbolic multiple logistic regression classifier to perform triple sampling and hierarchical clustering.
Compared with the prior art, the invention has the following beneficial effects:
the hyperbolic neural network is applied to the classification and decoding of the Chinese pronunciation electroencephalogram signals, the neural representation of the Chinese pronunciation is classified in a hyperbolic space, the hierarchical characteristics of the Chinese pronunciation and the signal representation are considered, and the hierarchical structure of the neural representation of the phoneme is restrained by the hierarchical clustering loss. The result proves that the model learns the interpretable hierarchical phoneme embedding from the electroencephalogram signal, and the phoneme decoding performance is obviously improved.
Drawings
FIG. 1 is a timing diagram illustrating an experimental paradigm of a data set in accordance with an embodiment of the present invention.
FIG. 2 is a spike issuance visualization diagram after different Chinese initials of the data set are grouped according to the utterance position in the embodiment of the present invention.
FIG. 3 is a graph comparing the classification accuracy of treatments with and without the inventive method.
FIG. 4 is a comparison graph of the distribution obtained after two-dimensional multiple logistic regression classification boundary visualization learned using the method of the present invention and not using the method of the present invention.
Detailed Description
The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.
In the data collection phase, this example was used to collect neural signals from the left major motor cortex of a paralyzed patient by implanting two 96-channel Utah intracortical microelectrode arrays (Blackrock Microsystems, salt Lake City, UT, USA) to record the neural signals. Neural signals were sampled at 30kHz using the Neuroport system (NSP, blackrock Microsystems) with two 96-channel Utah intracortical microelectrode arrays. During the experiment, the audio signal was recorded simultaneously with a microphone placed in front of the patient. The audio signal is digitized by the NeuroPort system through an analog input port at 30 khz. The embodiment designs three tasks aiming at Chinese pronunciation: 21 different initial pronunciation tasks, 24 different final pronunciation tasks and 20 different Chinese character pronunciation tasks. An experimental paradigm for data acquisition is shown in fig. 1. Specifically, in each trial, the subject was asked to view a red phoneme prompt on a computer screen one meter in front of him and hear an audible prompt for that phoneme. After one second the phoneme on the screen turns green indicating the start of the "start" phase and is tried to be spoken subsequently with the prompt phoneme. To ensure that the test has had sufficient reaction time to complete the test, the "start" phase lasts 3 seconds. After the "start" phase is ended, the recording of one of the dials is complete, and the recording of the next dial then begins.
The invention provides an invasive brain-computer interface Chinese pronunciation decoding method, which specifically comprises the following steps:
step 1, preprocessing electroencephalogram data.
Designing an experimental paradigm of Chinese pronunciation, and acquiring electroencephalogram data and synchronous audio data of the Chinese pronunciation; screening effective neurons from the electroencephalogram data, removing the neurons with high similarity, standardizing the data, marking the electroencephalogram data by using synchronous audio data, and then intercepting a data segment with a proper window length to obtain the preprocessed electroencephalogram data.
Specifically, spike classification (spike classification) is performed first, all neurons in the electroencephalogram signal are extracted, and a waveform is drawn. The firing waveform for each neuron was visually examined, and neurons with significant waveforms present with a total firing count greater than 100 were retained. And (4) issuing different neurons to calculate cosine similarity, and only keeping one neuron when the similarity of a plurality of neurons is greater than 0.7.
When the data are normalized, the original value is subtracted by the mean value and then divided by the standard deviation, so that the obtained data meet the normal distribution that the mean value is 0 and the standard deviation is 1.
And then marking the sounding time node by using the synchronous audio data, and intercepting a data segment taking the [ -500ms, +1500ms ] as a window for subsequent training and verification by taking the sounding time node as a center.
And 2, projecting the electroencephalogram data into a hyperbolic space.
Hyperbolic space is a non-european space of everywhere negative curvature. In a hyperbolic space, the farther from the center of the space, the greater the curvature and the greater the degree of spatial curvature. This means that hyperbolic spaces are well suited for modeling data having a tree structure or hierarchy: the number of nodes of the tree grows exponentially with the depth of the tree. After the Chinese pronunciation electroencephalogram signals are subjected to visual analysis, as shown in fig. 2, the Chinese pronunciation electroencephalogram signals can be found to have a certain hierarchical structure, and the hierarchical structure is related to a sound production mode and a sound production position. This means that hyperbolic space can be used to model chinese phonation brain electrical signals.
The example uses the presently most common and best-performing hyperbolic space model: poincare disc model
Figure 958869DEST_PATH_IMAGE058
To project the brain electrical data into the hyperbolic space:
Figure DEST_PATH_IMAGE059
Figure 475301DEST_PATH_IMAGE060
Figure 581798DEST_PATH_IMAGE004
wherein c represents the curvature of a hyperbolic space,dwhat is represented is the dimension of a hyperbolic space,
Figure DEST_PATH_IMAGE061
expressed are Euclidean measures and hyperbolic measures, respectively>
Figure 918101DEST_PATH_IMAGE011
Conformal factors for both metrics are shown.
And 3, constructing a hyperbolic neural network to extract features, and classifying the Chinese pronunciation by using a hyperbolic multiple logistic regression classifier.
Constructing a reasonable network structure according to the characteristics of small data volume and large dimensionality of Chinese pronunciation electroencephalogram; the hyperbolic neural network is a version of an Euclidean space neural network vector and matrix computing operation executed in a hyperbolic space. Since the vector and matrix calculation operations are too complex to be performed in the non-european space, the tangent space of the hyperbolic space is required to be used for the approximation operation. The tangent space of the hyperbolic space has the property of an Euclidean space, so that data only needs to be projected onto the tangent space, and after the calculation operation of vectors and matrixes is executed in the tangent space, the data is projected back to the Euclidean space. Here, the conversion between the tangent space and the original space needs to be completed by using exponential transformation and logarithmic transformation in the gyro vector space. In this way, a representation of the hyperbolic neural network can be obtained:
Figure 768245DEST_PATH_IMAGE012
Figure 342446DEST_PATH_IMAGE062
Figure 619844DEST_PATH_IMAGE063
wherein,
Figure 646706DEST_PATH_IMAGE064
and &>
Figure 300541DEST_PATH_IMAGE016
Respectively representing a hyperbolic neural network function and an Euclidean neural network function, and>
Figure 729248DEST_PATH_IMAGE017
and
Figure 177547DEST_PATH_IMAGE018
respectively, an exponential transformation and a logarithmic transformation at the origin, and c represents the curvature of a hyperbolic space.
Considering that the data volume of the Chinese pronunciation electroencephalogram signal is less, in the practical construction, a hyperbolic neural network structure with 2 layers is selected, and the number of neurons is respectively as follows: 256, 128.
Similar to the hyperbolic neural network, the hyperbolic multiple logistic regression is also a version of the european multiple logistic regression that performs operations in the hyperbolic space.
In particular, given
Figure 957284DEST_PATH_IMAGE065
The method comprises the following steps that (1) the logit probabilities of different types of samples are obtained by a hyperbolic multiple logistic regression method, and the specific calculation mode is as follows:
Figure 149231DEST_PATH_IMAGE019
wherein,
Figure 432445DEST_PATH_IMAGE020
and &>
Figure 328943DEST_PATH_IMAGE021
Is a parameter of hyperbolic multiple logistic regression>
Figure 330397DEST_PATH_IMAGE022
A conformal factor representing the classification boundary of class z,
Figure 326035DEST_PATH_IMAGE066
represents an inverse hyperbolic sine function, -is>
Figure 463755DEST_PATH_IMAGE024
Denotes an exponential function based on a natural constant e>
Figure 519436DEST_PATH_IMAGE025
Represents->
Figure 804924DEST_PATH_IMAGE021
Absolute value of (d);
Figure 276356DEST_PATH_IMAGE067
representing a Mobius addition operation; c represents the curvature of a hyperbolic space; />
Figure 596479DEST_PATH_IMAGE027
Representing an inner product operation.
The Mobius addition operation is an operation method of a gyro vector space, and is obtained by derivation through exponential transformation and logarithmic transformation, and the specific calculation method is as follows:
Figure 760744DEST_PATH_IMAGE068
and 4, training the hyperbolic neural network and the hyperbolic multiple logistic regression classifier.
And in the optimization process, a hyperbolic RSGD method is used for parameter optimization and updating. And considering the small data volume of the electroencephalogram signals, training and testing the model by using a leave-one-out method. Only one data is used as a test set at a time, and the rest is all used as a training set. In the training process, a hierarchical clustering constraint characteristic representation learning method based on triple similarity is added into the hyperbolic neural network.
(4-1) selecting a proper similarity calculation method: considering that the hyperbolic model is used to extract features, the hyperbolic distance can be directly adopted to calculate the similarity.
Given two points on a poincare disc
Figure 267949DEST_PATH_IMAGE069
The hyperbolic distance between two points is calculated as follows:
Figure 605390DEST_PATH_IMAGE070
wherein,
Figure 780019DEST_PATH_IMAGE071
representing the euclidean norm of the vector.
Sampling triples with the number of 20-50 by using a random sampling method
Figure 849606DEST_PATH_IMAGE041
Calculating hyperbolic distances between each other
Figure 375265DEST_PATH_IMAGE054
Divided by the sum of the three>
Figure 188501DEST_PATH_IMAGE055
Is normalized to obtain>
Figure 483216DEST_PATH_IMAGE072
The similarity can be expressed as
Figure DEST_PATH_IMAGE073
(4-2) selecting a proper clustering position: the hierarchical clustering loss is directly calculated for the logit vector, the hierarchical clustering loss is added into the overall loss needing to be optimized according to a certain weight, and meanwhile, the classification and clustering targets are optimized, so that the following overall loss function can be obtained
Figure 520442DEST_PATH_IMAGE028
Wherein,
Figure 267818DEST_PATH_IMAGE029
represents a classification loss, is>
Figure 884744DEST_PATH_IMAGE030
Representing hierarchical clustering penalties. />
Figure 33966DEST_PATH_IMAGE031
And &>
Figure 710935DEST_PATH_IMAGE032
Are coefficients that balance the two parts of the loss function.
For multi-class classification tasks, given
Figure 945607DEST_PATH_IMAGE074
Number of samples->
Figure DEST_PATH_IMAGE075
Belongs to>
Figure 897382DEST_PATH_IMAGE076
A category, and a corresponding label
Figure DEST_PATH_IMAGE077
Wherein->
Figure 901111DEST_PATH_IMAGE078
. Classification loss pick-up or-and-place>
Figure 483402DEST_PATH_IMAGE029
Can be represented by the following formula
Figure 205370DEST_PATH_IMAGE033
Wherein,
Figure 164099DEST_PATH_IMAGE034
is/>
Figure 287913DEST_PATH_IMAGE035
And->
Figure 306684DEST_PATH_IMAGE036
Is after softmax->
Figure 515949DEST_PATH_IMAGE035
The log probability of (c).
For hierarchical clustering loss, specifically, a certain number of triples are randomly sampled from data, and the hierarchical clustering loss is calculated based on the triples, and the goal of the loss is to enable nodes with higher similarity in a hierarchical clustering tree to be merged earlier, and the specific calculation is as follows:
Figure 12789DEST_PATH_IMAGE038
Figure 991109DEST_PATH_IMAGE039
wherein
Figure 180782DEST_PATH_IMAGE040
Represents a normalized softmax function, <' > based on the value of the sum>
Figure 877343DEST_PATH_IMAGE041
Represents triples extracted from data, representing data in a particular grouping field>
Figure 177874DEST_PATH_IMAGE042
Indicates pickin a triple>
Figure 10701DEST_PATH_IMAGE043
In the cluster, is based on the smallest common ancestor node in the cluster>
Figure 371275DEST_PATH_IMAGE048
Represents a hyperbolic distance to the center of the hyperbolic space, is>
Figure 555132DEST_PATH_IMAGE049
Representing £ in a triplet>
Figure 393775DEST_PATH_IMAGE043
Hyperbolic similarity between them.
And selecting to perform triple sampling and hierarchical clustering at a locality layer for simultaneously optimizing clustering and classification.
And 5, testing and applying the hyperbolic neural network and the hyperbolic multiple logistic regression classifier.
After the training is finished, whether the classification result of the data is correct is tested. And after all tests are finished, dividing the total test accuracy number by the total data volume to obtain a classification accuracy rate.
For comparison, the feature learning framework provided by the invention has the best effect in the hyperbolic space, the same network structure is used for carrying out experiments by using three different spatial metrics on the same data set, and the obtained comparison result is shown in fig. 3, wherein three sub-graphs respectively represent classification results of 21 Chinese initial pronunciations, 24 Chinese final pronunciations and 20 Chinese character pronunciations, and the performance of the framework in the hyperbolic space is obviously superior to that in the Euclidean space and the spherical space.
In order to illustrate that the learning framework can mine the potential hierarchy of data and learn characteristics with more voice characteristics, visual analysis is carried out on the multivariate logistic regression classification boundaries learned by the network, for example, as shown in fig. 4, the left sub-graph is the classification boundary added with hierarchical clustering optimization, the right sub-graph represents the classification boundary without hierarchical clustering optimization, different colors represent Chinese initials of different classes, and it can be seen that the learned classification boundaries are more dispersed after the hierarchical clustering optimization is added, and the classification boundaries of the initials at the same sounding position show aggregation.
The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. An invasive brain-computer interface Chinese pronunciation decoding method is characterized by comprising the following steps:
(1) Collecting electroencephalogram data of Chinese pronunciation and synchronous audio data, screening effective neurons from the electroencephalogram data, removing the neurons with high similarity, and standardizing the electroencephalogram data; marking the time node of the sound production on the electroencephalogram data by using the synchronous audio data, and intercepting data segments with fixed window length, wherein each data segment corresponds to a Chinese phoneme;
(2) Projecting the electroencephalogram data processed in the step (1) into a hyperbolic space, and forming training data by the electroencephalogram data and corresponding Chinese phonemes in the hyperbolic space;
(3) Constructing a hyperbolic neural network and a hyperbolic multiple logistic regression classifier; the hyperbolic neural network is used for extracting the features of the electroencephalogram data in the hyperbolic space, and the hyperbolic multiple logistic regression classifier is used for classifying Chinese phonemes for the features of the electroencephalogram data;
(4) Training a hyperbolic neural network and a hyperbolic multiple logistic regression classifier;
in the training process, a certain number of triples are extracted from training data, the loss of hierarchical clustering is calculated for the triples on the basis of the output characteristics of the hyperbolic neural network, and the triples are added into a total loss function to be optimized according to a certain weight;
(5) And projecting the electroencephalogram data to be decoded to a hyperbolic space, and then sequentially inputting the data to the trained hyperbolic neural network and hyperbolic multiple logistic regression classifier to obtain the decoded Chinese phoneme classification.
2. The invasive brain-computer interface chinese pronunciation decoding method of claim 1, wherein in step (1), the screening of the effective neurons from the electroencephalogram data and the removal of the highly similar neurons specifically are:
spike potential classification is carried out firstly, issuing of all neurons in electroencephalogram signals is extracted, and waveforms are drawn; checking the firing waveform of each neuron, and keeping the neurons with obvious waveforms and the total firing frequency more than 100;
and (3) cosine similarity is calculated for the issuance of different neurons, and when the similarity of a plurality of neurons is greater than 0.7, only one neuron is reserved so as to reduce the influence of crosstalk on data quality.
3. The invasive brain-computer interface chinese pronunciation decoding method of claim 1, wherein in step (2), a poincare disk model is used
Figure QLYQS_1
To project the brain electrical data into a hyperbolic space:
Figure QLYQS_2
Figure QLYQS_3
Figure QLYQS_4
wherein,
Figure QLYQS_5
a hyperbolic space with curvature c and dimension d is represented; />
Figure QLYQS_6
Represents a data point, <' > based on>
Figure QLYQS_7
The dimension of expression isdIn the Euclidean real number space, is greater than or equal to>
Figure QLYQS_8
RepresentxIs greater than or equal to>
Figure QLYQS_9
And &>
Figure QLYQS_10
Represents a Euclidean metric and a hyperbolic metric, respectively>
Figure QLYQS_11
Representing the conformities of both metrics.
4. The invasive brain-computer interface chinese pronunciation decoding method according to claim 1, wherein in step (3), the hyperbolic neural network is represented as:
Figure QLYQS_12
Figure QLYQS_13
Figure QLYQS_14
;/>
wherein,
Figure QLYQS_15
and &>
Figure QLYQS_16
Respectively representing a hyperbolic neural network function and an Euclidean neural network function, and>
Figure QLYQS_17
and
Figure QLYQS_18
respectively representing an exponential transformation and a logarithmic transformation at the origin,crepresents the curvature of a hyperbolic space, and>
Figure QLYQS_19
represents a data point, <' > based on>
Figure QLYQS_20
Represents->
Figure QLYQS_21
Absolute value of (a).
5. The invasive brain-computer interface chinese pronunciation decoding method of claim 1, wherein in step (3), when the hyperbolic multiple logistic regression classifier performs chinese phoneme classification, z classes are given, and the probability calculation formulas of the different classes are:
Figure QLYQS_22
wherein,
Figure QLYQS_23
and &>
Figure QLYQS_27
Parameters which are hyperbolic multivariate logistic regression>
Figure QLYQS_30
A conformal factor representing the classification boundary of class z,
Figure QLYQS_25
represents an inverse hyperbolic sine function, -is>
Figure QLYQS_26
Represents an exponential function based on a natural constant e>
Figure QLYQS_29
Represents->
Figure QLYQS_31
Absolute value of (d);
Figure QLYQS_24
representing a Mobius addition operation; c represents the curvature of a hyperbolic space; />
Figure QLYQS_28
Representing an inner product operation.
6. The invasive brain-computer interface chinese pronunciation decoding method of claim 1, wherein in step (4), the formula of the total loss function is:
Figure QLYQS_32
wherein,
Figure QLYQS_33
represents a classification loss, is>
Figure QLYQS_34
Representing hierarchical clustering loss; />
Figure QLYQS_35
And &>
Figure QLYQS_36
Are coefficients that balance the two parts of the loss function.
7. The invasive brain-computer interface chinese pronunciation decoding method of claim 6, wherein said classification penalty is calculated as follows:
Figure QLYQS_37
wherein,
Figure QLYQS_38
is/>
Figure QLYQS_39
Is selected based on the category label, <' > is selected>
Figure QLYQS_40
Is after softmax->
Figure QLYQS_41
Log probability of (d), based on the number of pairs of the preceding block in the test block>
Figure QLYQS_42
The data amount of mini-batch is shown.
8. The invasive brain-computer interface chinese pronunciation decoding method of claim 6, wherein said hierarchical clustering penalty is calculated as follows:
Figure QLYQS_43
Figure QLYQS_44
wherein,
Figure QLYQS_51
represents a normalized softmax function; />
Figure QLYQS_47
Representing triples extracted from training data; />
Figure QLYQS_53
Indicates pickin a triple>
Figure QLYQS_50
Is connected to the minimum common ancestor node, </or>
Figure QLYQS_52
Representing £ in a triplet>
Figure QLYQS_57
Is connected to the minimum common ancestor node, </or>
Figure QLYQS_59
Indicates pickin a triple>
Figure QLYQS_58
The smallest common ancestor node of (c); />
Figure QLYQS_60
Representing a hyperbolic distance to a center of a hyperbolic space; />
Figure QLYQS_45
Indicates pickin a triple>
Figure QLYQS_55
Hyperbolic similarity between->
Figure QLYQS_49
Indicates pickin a triple>
Figure QLYQS_56
Hyperbolic similarity between->
Figure QLYQS_48
Indicates pickin a triple>
Figure QLYQS_54
Hyperbolic similarity between them; />
Figure QLYQS_46
Representing a matrix transposition.
9. The invasive brain-computer interface chinese pronunciation decoding method of claim 8, wherein a random sampling method is used to sample a certain number of triplets in performing the hyperbolic similarity calculation
Figure QLYQS_61
Calculating a hyperbolic distance->
Figure QLYQS_62
Divided by the sum of the three>
Figure QLYQS_63
Is normalized to obtain>
Figure QLYQS_64
The degree of similarity is expressed as->
Figure QLYQS_65
10. The invasive brain-computer interface chinese pronunciation decoding method of claim 8, wherein during hierarchical clustering loss calculation, triple sampling and hierarchical clustering are performed on a logit layer of a hyperbolic multiple logistic regression classifier.
CN202211545924.0A 2022-12-05 2022-12-05 Invasive brain-computer interface Chinese pronunciation decoding method Active CN115565540B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211545924.0A CN115565540B (en) 2022-12-05 2022-12-05 Invasive brain-computer interface Chinese pronunciation decoding method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211545924.0A CN115565540B (en) 2022-12-05 2022-12-05 Invasive brain-computer interface Chinese pronunciation decoding method

Publications (2)

Publication Number Publication Date
CN115565540A CN115565540A (en) 2023-01-03
CN115565540B true CN115565540B (en) 2023-04-07

Family

ID=84770115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211545924.0A Active CN115565540B (en) 2022-12-05 2022-12-05 Invasive brain-computer interface Chinese pronunciation decoding method

Country Status (1)

Country Link
CN (1) CN115565540B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117851769B (en) * 2023-11-30 2024-06-21 浙江大学 Chinese character writing decoding method for invasive brain-computer interface
CN117958765B (en) * 2024-04-01 2024-06-21 华南理工大学 Multi-mode voice viscera organ recognition method based on hyperbolic space alignment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113031766A (en) * 2021-03-15 2021-06-25 哈尔滨工业大学 Method for decoding Chinese pronunciation through electroencephalogram
CN113589937A (en) * 2021-08-04 2021-11-02 浙江大学 Invasive brain-computer interface decoding method based on twin network kernel regression

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0993135A (en) * 1995-09-26 1997-04-04 Victor Co Of Japan Ltd Coder and decoder for sound data
CN102789594B (en) * 2012-06-28 2014-08-13 南京邮电大学 Voice generation method based on DIVA neural network model
CN111681636B (en) * 2020-06-16 2022-02-18 深圳市华创技术有限公司 Technical term sound generation method based on brain-computer interface, medical system and terminal

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113031766A (en) * 2021-03-15 2021-06-25 哈尔滨工业大学 Method for decoding Chinese pronunciation through electroencephalogram
CN113589937A (en) * 2021-08-04 2021-11-02 浙江大学 Invasive brain-computer interface decoding method based on twin network kernel regression

Also Published As

Publication number Publication date
CN115565540A (en) 2023-01-03

Similar Documents

Publication Publication Date Title
Jahangir et al. Deep learning approaches for speech emotion recognition: State of the art and research challenges
CN110516696B (en) Self-adaptive weight bimodal fusion emotion recognition method based on voice and expression
CN115565540B (en) Invasive brain-computer interface Chinese pronunciation decoding method
US20170358306A1 (en) Neural network-based voiceprint information extraction method and apparatus
CN109559736B (en) Automatic dubbing method for movie actors based on confrontation network
CN103996155A (en) Intelligent interaction and psychological comfort robot service system
CN115762536A (en) Small sample optimization bird sound recognition method based on bridge transform
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
CN109979436A (en) A kind of BP neural network speech recognition system and method based on frequency spectrum adaptive method
Sahu et al. Modeling feature representations for affective speech using generative adversarial networks
Ling An acoustic model for English speech recognition based on deep learning
Rybicka et al. End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors.
CN112466284B (en) Mask voice identification method
Anjos et al. Detection of voicing and place of articulation of fricatives with deep learning in a virtual speech and language therapy tutor
Wu et al. Speech synthesis with face embeddings
CN110348482A (en) A kind of speech emotion recognition system based on depth model integrated architecture
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium
Yang et al. Speech emotion analysis of netizens based on bidirectional lstm and pgcdbn
CN113282718B (en) Language identification method and system based on self-adaptive center anchor
Shome et al. Speaker Recognition through Deep Learning Techniques: A Comprehensive Review and Research Challenges
CN114882888A (en) Voiceprint recognition method and system based on variational self-coding and countermeasure generation network
CN115145402A (en) Intelligent toy system with network interaction function and control method
CN115472182A (en) Attention feature fusion-based voice emotion recognition method and device of multi-channel self-encoder
CN112951270B (en) Voice fluency detection method and device and electronic equipment
Singh Speaker emotion Recognition System using Artificial neural network classification method for brain-inspired application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant