CN112541541A

CN112541541A - Lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion

Info

Publication number: CN112541541A
Application number: CN202011452285.4A
Authority: CN
Inventors: 李康; 孔万增; 金宣妤; 唐佳佳
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-23
Anticipated expiration: 2040-12-10
Also published as: CN112541541B

Abstract

The invention discloses a lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion; the invention establishes direct correlation among multi-modal elements in a layered manner, and can capture short-term and long-term dependencies among different modalities. To avoid reducing resolution and preserve the original spatial structure information corresponding to each modality, the present invention applies corresponding attention weights in the form of broadcast when selecting and emphasizing multi-modal information interactions. In addition, the invention also provides a new tensor operator, which is called Ex-KR addition, so as to use shared information to fuse multi-modal element information to obtain global tensor representation. This is an effective supplement to the problem that most methods in the current multi-modal emotion recognition field only focus on modeling in a local time sequence-multi-modal space, and can not explicitly learn to obtain a complete representation form of all participating in modal fusion.

Description

Lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion

Technical Field

The invention belongs to the field of multi-modal emotion analysis of cross fusion of natural language processing, computer vision and voice signal processing, and particularly relates to a cross-modal-based multi-element hierarchical deep fusion technology.

Background

With the recent progress of machine learning research, the analysis of multi-modal time series data has become an increasingly important research field. Multimodal learning aims at building neural networks that can process and integrate information from multiple modalities, and multimodal emotion analysis is a research sub-field of multimodal learning. When people express their own emotions (negative or positive), in real life, various categories of information are involved in such communication activities, including speech (text modality), facial expressions and gestures (visual form modality), and prosodic features of voice (acoustic modality). Video is an important source of multimodal data and can provide three types of data, visual, acoustic and textual, simultaneously. For uniform presentation and ease of understanding, we refer to each word in speech as a text element, each frame in video as a visual element, and the prosodic features corresponding to each word as acoustic elements. Through a study of the three modality data, we found that there are two types of element interactions (intra-modality element interactions and inter-modality element interactions) present therein at the same time. Moreover, over time, the inter-modal element interactions provide more complex complementary information between the time domain and the multi-modal data. When the method is used for analyzing, if complementary information among multi-modal data is considered, the emotional state of a person can be analyzed more reliably and more accurately.

However, heterogeneous properties of multimodal data, such as proper alignment between elements, short-term dependencies and long-term dependencies between different modalities, are important complementary information between multimodal data, and are also major challenges that researchers must face when analyzing cross-modal information. Existing multimodal fusion methods typically focus on modeling in a local temporal-multimodal space and do not explicitly learn a complete multi-dimensional representation of all modalities involved in the fusion. For example, some methods compress time series data of three modalities into three vectors for subsequent multimodal fusion, such that the subsequent steps are unable to perceive critical cross-modality timing information. Existing approaches also attempt to convert from a source modality to a target modality to learn a joint representation of the two modalities. However, this "translation method" is mainly implemented between two modalities and must involve "translation" in two directions, so that the joint representation has strong local features but lacks important global multimodal properties.

Disclosure of Invention

Aiming at the defects of the prior art, the invention introduces a lightweight multi-modal fusion method based on a multi-element hierarchical deep fusion technology. The method transmits and integrates local association into a common global space through two different types of matching mechanisms. When multi-modal element matching fusion is carried out, a new tensor operator without additional parameters is provided, and the Ex-KR addition is carried out, wherein the tensor operator can combine local matrix representations by using common information of specific modes to obtain a global tensor, so that more lightweight multi-modal data modeling can be achieved.

The method comprises the following specific steps:

step 1, collecting n modal data of a measured object, and converting each modal data into a matrix form; n is more than or equal to 3.

And 2, carrying out interactive modeling on the modal data to obtain a matrix of the modal data containing context information.

Step 3, taking a matrix of one mode as a central matrix; respectively carrying out element interaction matching on the rest n-1 modal matrices and the central matrix to obtain n-1 matching matrices;

and 4, generating weight vectors of the n-1 matching matrixes obtained in the step 3.

And 5, respectively updating n-1 matching matrixes by using the n-1 weight vectors obtained in the step 4.

Step 6, performing multi-mode element matching by using n-1 matching matrixes to obtain n-dimensional tensorP。

Step 7, calculating an implicit matrix and a weight vector of the central matrix, and updating the tensor by using the weight vector of the central matrixPTo obtain tensorP′。

Step 8, according to the tensorP'The emotion distribution vector label is predicted. The ith element of emotion distribution vector label [ i ]]Is the probability that the subject is in the ith emotion. The emotion type of the tested object in the multi-modal data acquisition is determined according to the data.

Preferably, in step 2, the matrix H ═ bi _ fcn (e) containing the context information; wherein the intermediate matrix E ═ lstm (x); LSTM (·) represents a bidirectional long-short term memory network layer transformation; bi _ FCN (·) represents a nonlinear feed-forward fully-connected layer transformation. The matrix X is a matrix corresponding to the single modality obtained in step 1.

Preferably, step 3 is performed by a matrix H_xAnd a central matrix H_yIs matched with the matrix

Wherein, W₁To pass through adaptation matrix H_xAnd matrix H_yThe obtained weight matrix.

Preferably, the matching matrix M in step 4_xyIs implicit matrix R_xAnd a weight vector alpha_xThe expression of (a) is as follows:

R_x＝tanh(W_xM_xy ^T+b_x)

α_x＝softmax(w_xR_x)

wherein, tanh (·) represents hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation. W_xFor determining the hidden matrix R_xA parameter matrix of time. w is a_xFor determining the weight vector alpha_xA parameter matrix of time. b_xFor determining the hidden matrix R_xDeviation in time.

Preferably, the stepsIn step 5, the matching matrix M_xyObtaining a matching matrix M 'after updating'_xy. Matching matrix M'_xyLine i element M'_xy[i,:]＝M_xy[i,:]α_x[i](ii) a Wherein M is_xy[i,:]Representation matrix M_xyRow i element of (1); alpha is alpha_x[i]Representing a vector alpha_xThe ith element of (1).

Preferably, in step 6, the matrix M 'is matched'_xyAnd matching matrix M'_zyTensor obtained when performing multi-modal element matchingPIs a sub-vector ofP[:,k,j]＝M'_xy[:,j]+M'_zy[k,j]；M'_xy[:,j]Is matrix M'_xyOne column of (1); m'_zy[k,j]Is matrix M'_zyAn element of (1).

Preferably, in step 7, the implicit matrix R of the central matrix_yAs shown in formula (10); weight vector alpha of the central matrix_yAs shown in equation (11).

R_y＝tanh(W_yP^T+b_y) (10)

Wherein the matrix P is a multi-dimensional tensor

The deployment matrix of (a); w_yFor determining the hidden matrix R_yA parameter matrix of time. w is a_yFor determining the weight vector alpha_yA parameter matrix of time. b_yFor determining the hidden matrix R_yDeviation in time.

Zhang LiangP′Is sub-matrix ofP'[:,:,i]＝P[:,:,i]α_y[i]；P[:,:,i]Is tensorPA sub-matrix of (a); alpha is alpha_y[i]Is a weight vector alpha_yThe elements in (c).

Preferably, the specific process of step 8 is as follows:

according to tensorP'Calculating the corresponding vector p is shown as equation (13).

Where vec (-) represents the vectorization operation of the matrix,

representing a mode-3product operation. Vector c is a full 1 vector.

Predicting emotion distribution vector label from vector p is shown in formula (14)

label＝softmax(Wp+b)∈R^d (14)

Where W is the weight matrix and b is the bias.

Preferably, in step 1, there are three types of modal data, namely, a text mode, a visual mode and an acoustic mode.

The invention has the beneficial effects that:

1. the invention establishes direct correlation among multi-modal elements in a layered manner, and can capture short-term and long-term dependencies among different modalities. To avoid reducing resolution and preserve the original spatial structure information corresponding to each modality, the present invention applies corresponding attention weights in the form of broadcast when selecting and emphasizing multi-modal information interactions.

2. The invention also provides a new tensor operator, called Ex-KR addition, to fuse the multi-modal element information by using the shared information to obtain the global tensor representation. This is an effective supplement to the problem that most methods in the current multi-modal emotion recognition field only focus on modeling in a local time sequence-multi-modal space, and can not explicitly learn to obtain a complete representation form of all participating in modal fusion.

Drawings

FIG. 1 is a schematic diagram of a lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion;

FIG. 2 is a schematic diagram of two different types of matching mechanisms in the present invention;

FIG. 3 is a schematic view of an attention generating module;

Detailed Description

The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion is described in detail below with reference to the accompanying drawings.

Example 1

As shown in fig. 1, the lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion specifically includes the following steps:

step 1, data matrix representation of single mode

The data of the plurality of modes related to the present invention are referred to as X, Y and Z, respectively. Three types of modality data are used when the framework is actually used: the text mode, the visual mode and the acoustic mode sense the emotional state of the testee. We refer to text modalities as capital letter Y, visual modalities as capital letter X, and acoustic modalities as capital letter Z. The data of each modality can be organized in the form of a two-dimensional matrix, i.e.

Wherein, t_x、t_y、t_zRespectively representing the number of three modal elements, d_x、d_y、d_zRespectively, representing the characteristic length of the corresponding element. Taking the text modality as an example, it is tried to say exhilarating: "this movie is very appealing. "(tested in expressing a positive emotion) then we can get t_yThe actual value of (c) is 8.

Step 2, interactive modeling between single modal elements

Before multi-modal fusion is implemented, a two-dimensional matrix representation of the raw data of each modality needs to be transformed to establish interaction between the elements of the single modality, i.e. the element feature representation of each modality needs to contain context information of its neighboring elements. Here, context information of neighboring elements in a single modality is modeled using a bidirectional long short term memory network Layer (LSTM) that concatenates forward and reverse implicit state representations. To further enrich the single-modal characterization, we feed forward through non-linearityThe full-connected mapping transforms the encoded features for subsequent multi-modal fusion, obtaining a matrix H containing context information of three modalities_x、H_y、H_zThe following were used:

H_x＝bi_FCN(E_x) H_y＝bi_FCN(E_y) H_z＝bi_FCN(E_z) (1)

E_x＝LSTM(X) E_y＝LSTM(Y) E_z＝LSTM(Z) (2)

wherein, LSTM (·) represents bidirectional long-short term memory network layer transformation; bi _ FCN (·) represents a nonlinear feed-forward fully-connected layer transformation.

To more clearly illustrate the specific calculation process of LSTM (·), the matrix M ═ M₁,m₂,...,m_T]∈R^T×D(wherein m is_t∈R^DIs a vector, T1, 2,.., T) for example:

i_t＝σ(W_iim_t+b_ii+W_hih_(t-1)+b_hi)

f_t＝σ(W_ifm_t+b_if+W_hfh_(t-1)+b_hf)

g_t＝tahn(W_igm_t+b_ig+W_hgh_(t-1)+b_hg)

o_t＝σ(W_iom_t+b_io+W_hoh_(t-1)+b_ho)

c_t＝f_t*c_(t-1)+i_t*g_t

h_t＝o_t*tanh(c_t)

where h is_tIs the output at time t, c_tIs the state of the cell at time t, m_tIs an input at time t, i_t，f_t，g_t，o_tRespectively an input gate, a forgetting gate, a unit gate and an output gate. σ is sigmoid function, which is the Hadamard product, W in the formula_ii,W_hi,W_if,W_hf,W_ig,W_hg,W_io,W_hoIs a weight matrix, b_if,b_hf,b_ig,b_hgEtc. are deviation vectors. So we have [ h ]₁,h₂,...,h_T]＝LSTM(M)。

To illustrate the calculation of bi _ FCN (), we also use the matrix M ∈ R^T×DFor example, the following steps are carried out:

H＝bi_FCN(M)＝tanh(MW_f1+b_f1)W_f2+b_f2∈R^T×D'

w herein_f1And W_f2Is a weight matrix, b_f1And b_f2Is a deviation vector.

Step 3, element matching fusion between two modes

As shown in FIG. 2(a), two single-modality features containing intra-modality element interactions are represented H_xAnd H_yJoin operations to model the relationship between two modalities. Applying a bilinear transformation to two vector feature representations from two modalities may achieve this element matching fusion. Weight matrix W in bilinear transformation₁And W₂Can be used to represent context information between modalities and can ensure more flexible coupling between two different modalities. By matching the vector feature representations of the bimodal elements, the interaction of the elements between two modes can be modeled to obtain a corresponding matching matrix M_xyAnd M_zy。

Wherein the weight matrix W₁By adapting the matrix H_xAnd matrix H_yTo obtain the context information of; weight matrix W₂By adapting the matrix H_yAnd matrix H_zTo obtain the context information of.

Step 4,Generating a correlation matrix M_xyAnd M_zyAttention weight vector of

Matching matrix M based on interaction between two modalities_xyAnd M_zyWe compute the relative importance of each of its elements for a particular modality by an attention mechanism to optimize the joint feature representation between the modalities. The calculated value of the importance of each element is expressed in the form of a vector, and the length of the vector corresponds to the number of modality-specific elements, as shown in fig. 3. Obtaining a matching matrix M_xyIs implicit matrix R_xAnd a matching matrix M_zyIs implicit matrix R_zThe following were used:

R_x＝tanh(W_xM_xy ^T+b_x) R_z＝tanh(W_zM_zy ^T+b_z) (5)

wherein the vector α_x、α_zThe attention weight vectors are respectively related to the specific mode X and the specific mode Y, and the length of the vector corresponds to the number of elements of the modes X and Y. tanh (·) represents a hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation. W_x、W_zRespectively for determining the implicit matrix R_xImplicit matrix R_zA parameter matrix of time. w is a_x、w_zRespectively for determining a weight vector alpha_xWeight vector alpha_zA parameter matrix of time. b_x、b_zRespectively for determining the implicit matrix R_xImplicit matrix R_zDeviation in time.

Step 5, broadcasting operation of attention weight on matching matrix

By assigning attention weights to the corresponding values of the bimodal joint matching matrix, the model can be enabled to focus on important inter-modal interaction relationships. However, the conventional way of assigning averages at different positions by attention weights may cause the resolution of the feature representation to decrease. To avoid this situationIn case we apply the corresponding attention weights in the form of a broadcast in the proposed fusion framework; using weight vectors alpha_x、α_zUpdating the matching matrix M_xyAnd a matching matrix M_zyObtaining a new matching matrix M'_xyAnd a matching matrix M'_zy。

M'_xy＝attention_broadcast(M_xy,α_x) Wherein

M'_zy＝attention_broadcast(M_zy,α_z) Wherein

Wherein the content of the first and second substances,

represents matrix M'_xyRow i element of (1); m_xy[i,:]Representation matrix M_xyRow i element of (1); alpha is alpha_x[i]Representing a vector alpha_xThe ith element of (1). M_xy[i,:]α_x[i]A vector by scalar multiplication is shown.

Represents matrix M'_zyRow i element of (1); m_zy[i,:]Representation matrix M_zyRow i element of (1); alpha is alpha_z[i]Representing a vector alpha_zThe ith element of (1).

Step 6, utilizing the multi-modal common information by Ex-KR addition

Since the softmax function is used for normalization in equation (6), some weight values in the attention weight vector are very close to 0. When the broadcast operation of the attention vector is carried out on the application matching matrix, some values of the obtained matrix are close to 0, which brings certain sparsity to the matching matrix. On the other hand, two matching matrices M'_xyAnd M'_zyWhile containing information common to modality Y, i.e. each column of the two matching matricesThis common information facilitates deep fusion of multi-modal elements, corresponding to an element from modality Y. To make better use of these matching properties, we propose a new tensor operator: Ex-KR addition, whose sign can be described as

Through Ex-KR addition, multi-modal element matching based on the modality Y can be realized by utilizing common information in two matching matrixes, so that a multi-dimensional tensor related to multi-modal data fusion can be calculatedPCharacterization, as shown in FIG. 2 (b).

Step 7, broadcasting operation of attention weight on multi-modal tensor

The implicit matrix R can be obtained by the same attention generation module as step 4_yAs shown in equation (10), the attention weight vector α for the multi-modal multidimensional tensor_yAs shown in equation (11). Applying the vector to the multidimensional tensor in the form of a broadcastPThe expression capacity of the tensor can be further increased, and the tensor is updatedP′As shown in equation (12).

R_y＝tanh(W_yP^T+b_y) (10)

P'＝attention_broadcast(P,α_y) WhereinP'[:,:,i]＝P[:,:,i]α_y[i] (12)

Wherein, the matrix

Is a multi-dimensional tensor

Step 8, classification

Application of sum posing to multidimensional tensor characterization via broadcast attention computationP'Finally, the information of the three modes is summarized to obtain a corresponding vector p as shown in formula (13). Vector p, as a multi-modal high-level representation, can be used for the final emotion classification task.

Where vec (-) denotes the vectorization operation of the matrix,

representing a mode-3product operation. Vector quantity

Is a full 1 vector.

Then, the emotion distribution vector label is predicted according to the vector p as shown in the formula (14)

label＝softmax(Wp+b)∈R^d (14)

Where W is a learnable weight matrix, b is a bias; the ith element label [ i ] of the emotion distribution vector label is the probability of the predicted ith emotion.

As shown in Table 1, the emotion state discrimination task is performed on two multi-modal emotion databases CMU-MOSI and YouTube simultaneously by the multi-modal emotion fusion method, MAE is mean square error, CORR is Pearson correlation coefficient, ACC-7 is 7 classification precision, ACC-3 is 3 classification precision, and the result of the multi-modal emotion discrimination method is superior to that of the four multi-modal fusion methods mostly by comparing and measuring multiple indexes of the discrimination task.

TABLE 1 comparison of results

Example 2

This example differs from example 1 in that; when the lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion is applied to data of four modes (such as a text mode, a visual mode, an acoustic mode and electroencephalogram data), a matching matrix between the two modes can be calculated at first, and then the matching matrix representing local fusion of the two modes is integrated by utilizing the Ex-KR addition provided by us to obtain multi-modal global tensor representation. To illustrate the computational flow, the four modalities involved in the fusion can be expressed in m respectively₁，m₂，m₃，m₄The representation is carried out, the corresponding modal data being organized in the form of a two-dimensional matrix, i.e.

And

as with the methods employed by equations (1) and (2), the two-way long-short term memory network layer transformation LSTM (-) and the nonlinear feedforward full-connectivity layer transformation bi _ FCN (-) can be used to capture the context information matrix representation within a single modality, respectively

And

in the mode m₁In the case of common mode, the same steps as steps 3, 4 and 5 are adopted to realize element matching between modes and operate the generated attention weight vector on a matching matrix in a broadcasting operation mode, so that three matching matrices can be obtained

And

due to the threeThe matching matrix contains m modes₁So that the multi-modal global tensor representation can be realized by using the common information by using Ex-KR addition, and the specific calculation flow is as follows:

firstly, matching matrix is matched

And

fusion is realized on the basis of Ex-KR addition:

three modes (m) are fused₁，m₂And m₃) The tensor P carries out matrix expansion on the third dimension to obtain a matrix

Due to the matrix P and the matching matrix

And also includes a mode m₁Further fusion can be achieved here using Ex-KR addition:

the tensor thus calculatedQThe modality m can be utilized₁And because the Ex-KR addition does not contain weight parameters needing to be learned, more lightweight multi-modal fusion modeling can be realized.

According to the calculation procedure of step 7, the tensor can be calculatedQAttention vector of

And is implemented in tensorQThe broadcast computation of (c).

And finally, calculating to obtain a corresponding emotion distribution vector label by using the sum posing operation described in the step 8 and the formula (14), and realizing multi-modal emotion analysis based on multi-element layered deep fusion of four modal data.

Claims

1. The lightweight multi-modal emotion analysis method based on multi-element hierarchical depth fusion is characterized by comprising the following steps of: step 1, collecting n modal data of a measured object, and converting each modal data into a matrix form; n is more than or equal to 3;

step 2, carrying out interactive modeling on each modal data to obtain a matrix of each modal data containing context information;

step 4, generating weight vectors of the n-1 matching matrixes obtained in the step 3;

step 5, respectively updating n-1 matching matrixes by using the n-1 weight vectors obtained in the step 4;

step 6, performing multi-mode element matching by using n-1 matching matrixes to obtain n-dimensional tensorP；

Step 7, calculating an implicit matrix and a weight vector of the central matrix, and updating the tensor by using the weight vector of the central matrixPTo obtain tensorP′；

Step 8, according to the tensorP'Predicting emotion distribution vector label; the ith element of emotion distribution vector label [ i ]]The probability of the tested object in the ith emotion; the emotion type of the tested object in the multi-modal data acquisition is determined according to the data.

2. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: in step 2, a matrix H ═ bi _ fcn (e) containing context information; wherein the intermediate matrix E ═ lstm (x); LSTM (·) represents a bidirectional long-short term memory network layer transformation; bi _ FCN (-) represents a nonlinear feedforward fully-connected layer transform; the matrix X is a matrix corresponding to the single modality obtained in step 1.

3. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: one matrix H in step 3_xAnd a central matrix H_yIs matched with the matrix

4. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: matching matrix M in step 4_xyIs implicit matrix R_xAnd a weight vector alpha_xThe expression of (a) is as follows:

R_x＝tanh(W_xM_xy ^T+b_x)

α_x＝softmax(w_xR_x)

wherein, tanh (·) represents hyperbolic tangent function operation; softmax (·) represents a logistic regression function operation; w_xFor determining the hidden matrix R_xA parameter matrix of time; w is a_xFor determining the weight vector alpha_xA parameter matrix of time; b_xFor determining the hidden matrix R_xDeviation in time.

5. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: in step 5, the matching matrix M_xyObtaining a matching matrix M 'after updating'_xy(ii) a Matching matrix M'_xyLine i element M'_xy[i,:]＝M_xy[i,:]α_x[i](ii) a Wherein M is_xy[i,:]Representation matrix M_xyRow i element of (1); alpha is alpha_x[i]Representing a vector alpha_xThe ith element ofAnd (4) element.

6. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: in step 6, matching matrix M'_xyAnd matching matrix M'_zyTensor obtained when performing multi-modal element matchingPIs a sub-vector ofP[:,k,j]＝M'_xy[:,j]+M'_zy[k,j]；M'_xy[:,j]Is matrix M'_xyOne column of (1); m'_zy[k,j]Is matrix M'_zyAn element of (1).

7. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: in step 7, the implicit matrix R of the central matrix_yAs shown in formula (10); weight vector alpha of the central matrix_yAs shown in formula (11); (ii) a

R_y＝tanh(W_yP^T+b_y) (10)

Wherein the matrix P is a multi-dimensional tensor

The deployment matrix of (a); w_yFor determining the hidden matrix R_yA parameter matrix of time; w is a_yFor determining the weight vector alpha_yA parameter matrix of time; b_yFor determining the hidden matrix R_yA deviation in time;

8. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: the specific process of step 8 is as follows:

according to tensorP'Calculating a corresponding vector p as shown in formula (13);

where vec (-) represents the vectorization operation of the matrix,

represents the mode-3product operation; vector c is a full 1 vector;

label＝softmax(Wp+b)∈R^d (14)

Where W is the weight matrix and b is the bias.

9. The lightweight multi-modal sentiment analysis method based on multi-element hierarchical depth fusion according to claim 1, characterized in that: in the step 1, the modal data are three types, namely a text mode, a visual mode and an acoustic mode.