CN117958765A

CN117958765A - Multi-mode voice viscera organ recognition method based on hyperbolic space alignment

Info

Publication number: CN117958765A
Application number: CN202410386135.XA
Authority: CN
Inventors: 文贵华; 王昶
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2024-04-01
Filing date: 2024-04-01
Publication date: 2024-05-03
Anticipated expiration: 2044-04-01
Also published as: CN117958765B

Abstract

The invention relates to a multi-mode voice viscera organ recognition method based on hyperbolic space alignment, which comprises the steps of obtaining human voice signals and corresponding text signals; respectively extracting characteristics of the voice signal and the text signal to obtain audio characteristics and text characteristics; mapping the audio features and the text features into a hyperbolic geometric space, determining the hyperbolic distance between the mapped audio features and text features, and aligning the audio features and the text features by taking the hyperbolic distance as similarity; sequentially carrying out cross attention fusion and feature stitching on the aligned audio features and text features to obtain human voice features; and the viscera and organs are identified according to the voice characteristics of the human body. The invention combines the multi-mode and hyperbolic space feature alignment, so that the features are aligned in the hyperbolic space before fusion, and the accuracy of multi-mode voice viscera organ recognition is improved.

Description

Multi-mode voice viscera organ recognition method based on hyperbolic space alignment

Technical Field

The invention relates to the technical field of viscera and organs identification, in particular to a multi-mode voice viscera and organs identification method based on hyperbolic space alignment.

Background

In the theory of traditional Chinese medicine, the diagnosis is a very important means, but voice plays a very important role as an important medical information source.

According to the theory of traditional Chinese medicine, the diseases of viscera and organs of a human body often show changes in the aspects of qi, blood, channels and collaterals and the like of the body, so that corresponding sound expression is generated; secondly, text information in the voice can indirectly reflect the state and symptomatic description of the patient, and additional information is provided; in addition, voice is a non-invasive way of examination without causing any pain or discomfort to the patient, so it is safe and harmless to select a voice way to obtain medical information.

However, at present, recognition of viscera by speech requires abundant experience of Chinese medicine specialists, and accurate recognition is difficult for doctors and non-professionals with insufficient experience, so that assistance of intelligent tools is required.

However, there are currently very few studies using speech for automatic visceral recognition. Therefore, the application provides a multi-mode voice viscera organ recognition method based on hyperbolic space alignment.

Disclosure of Invention

In view of the above, in order to overcome the shortcomings of the prior art, the present invention provides a multi-modal voice viscera organ recognition method, specifically, the multi-modal voice and the text are mutually aligned and supplemented to obtain more accurate and comprehensive results.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a multi-modal voice viscera organ recognition method based on hyperbolic space alignment comprises,

Acquiring a human voice signal and a corresponding text signal;

respectively extracting characteristics of the voice signal and the text signal to obtain audio characteristics and text characteristics;

Mapping the audio features and the text features into a hyperbolic geometric space, determining the hyperbolic distance between the mapped audio features and text features, and aligning the audio features and the text features by taking the hyperbolic distance as similarity;

Sequentially carrying out cross attention fusion and feature stitching on the aligned audio features and text features to obtain human voice features;

The viscera and organs are identified according to the voice characteristics of the human body.

Preferably, preprocessing is performed before feature extraction of the voice signal and the text signal, including:

Resampling the speech signal at 16 kHz and fixing the audio length to 30 seconds, calculating a log mel-spectrogram using a window of 25 milliseconds and a step size of 10 milliseconds to generate an 80-dimensional audio frame feature representation,

And word segmentation is carried out on the text signals, so that each input text is limited to 128 words, and after word segmentation, the marks are added with a beginning and a space before and after marking, so that text characteristic representation taking words as units is obtained.

Preferably, the audio frame feature representation is feature extracted with a low-language encoder and the text feature representation is feature extracted with a bi-directional encoder; wherein,

The low-language encoder is formed by overlapping 6 layers of transformer networks, and the bidirectional encoder is formed by overlapping 12 layers of transformer networks. Each layer of transformer is obtained by a multi-head attention mechanism and a feedforward neural network in series, namely each layer of transformer network comprises a self-attention layer and a multi-layer perceptron which are serially connected in sequence;

The low-language encoder and the bi-directional encoder ensure that the audio features and the text features extracted by the module are more robust through upstream training of a large amount of audio data and text data. Experiments prove that through the pre-trained transformer network, compared with manual characteristics of manual design, abundant high-dimensional semantic characteristics can be obtained.

Preferably, the audio features mapped into the hyperbolic geometry space are represented as follows:

the text features mapped into the hyperbolic geometry are represented as follows:

Where X _hyai and X _hyti represent audio and text features, respectively, in hyperbolic space, a represents audio feature, t represents text feature, tanh (·) represents hyperbolic tangent function, and c represents sphere negative curvature.

Preferably, the hyperbolic distance between the mapped audio feature and the text feature is obtained by the following formula;

Where S (a, t) represents the hyperbolic distance of the mapped audio feature X _hyai and text feature X _hyti, and S (t, a) represents the hyperbolic distance of the mapped text feature X _hyti and audio feature X _hyai.

Preferably, the process of aligning the audio feature and the text feature with the hyperbolic distance as the similarity includes: taking the hyperbolic distance as the similarity of the characteristics, carrying out similarity normalization by using the following formula:

in the method, in the process of the invention, Representing the similarity of the ith audio to the text,/>Representing similarity of the ith text audio, N representing batch size, ɛ being temperature coefficient, a learnable parameter, and m representing similarity interval;

determining audio-text contrast loss according to cross entropy, and optimizing hyperbolic distance, wherein the calculation is as follows:

Where y ^a2t (a) and y ^t2a (t) represent the single thermal labels generated within a batch, the label for the positive pair of samples is 1 and the label for the negative pair of samples is 0.

The comparative losses are expressed as follows:

Preferably, the process of performing cross attention fusion and feature stitching on the aligned audio features and text features sequentially includes:

cross attention fusion is carried out on the aligned audio features and text features, and a first fusion feature is obtained;

cross attention fusion is carried out on the aligned text features and the aligned audio features, and a second fusion feature is obtained;

And splicing the first fusion characteristic and the second fusion characteristic to obtain the human voice characteristic.

Preferably, the human voice features are input into a classifier for viscera organ recognition, wherein the loss function of the classifier is as follows:

in the method, in the process of the invention, Y _i represents the probability of the ith viscera, representing the actual label of the input viscera.

Further, the classification loss and the alignment loss are used as a loss function to optimize the multi-modal voice viscera organ recognition method based on hyperbolic space alignment.

According to the technical scheme, the invention discloses a multi-mode voice viscera organ recognition method based on hyperbolic space alignment, and compared with the prior art, the multi-mode voice viscera organ recognition method combines multi-mode and hyperbolic space characteristics, so that the characteristics are aligned in the hyperbolic space before fusion, and the accuracy of multi-mode voice viscera organ recognition can be effectively improved; the invention provides a new thought for the recognition of viscera and organs by combining deep learning with knowledge in the field of traditional Chinese medicine, and can stably and rapidly recognize viscera and organ types corresponding to voice through a computer program, thereby overcoming the limitation brought by relying on the experience of expert traditional Chinese medicine.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a multi-modal voice viscera organ recognition method based on hyperbolic spatial alignment in the invention;

FIG. 2 is a schematic diagram of a multi-modal speech visceral organ recognition process based on hyperbolic spatial alignment in accordance with the present invention;

FIG. 3 is a schematic diagram of a speech feature extraction model according to the present invention;

fig. 4 is a schematic diagram of a text feature extraction model structure according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the multi-modal voice viscera organ recognition method based on hyperbolic space alignment disclosed in the embodiment of the invention comprises the following steps,

Acquiring a human voice signal and a corresponding text signal;

Mapping the audio features and the text features into hyperbolic geometric spaces respectively, determining hyperbolic distances between the mapped audio features and the mapped text features, and aligning the audio features and the mapped text features by taking the hyperbolic distances as similarity;

In one embodiment, as shown in figure 2,

Step one, collecting human voice signals by using recording equipment, and converting the voice signals into texts by using the existing voice-to-text technology.

Step two, preprocessing is carried out before the characteristic extraction of the voice signal and the text signal preferentially, and the method comprises the following steps:

resampling the speech signal at 16 kHz and fixing the audio length to 30 seconds, calculating a logarithmic mel-spectrogram using a window of 25 milliseconds and a step size of 10 milliseconds to generate an 80-dimensional audio frame feature representation;

Then, the feature extraction is carried out on the audio frame feature representation by utilizing a pre-trained low-language encoder, and the feature extraction is carried out on the text feature representation by utilizing a pre-trained bidirectional encoder; in the application, the low-language encoder and the bidirectional encoder models are pre-trained based on transformers under a large amount of data, so that the models obtain more robust semantic information compared with manual design and neural networks.

Wherein the encoder structure is shown in figures 3 and 4,

The low-language encoder is formed by overlapping 6 layers of transformer networks, the two-way encoder is formed by overlapping 12 layers of transformer networks, and each layer of transformer network comprises a self-attention layer and a multi-layer perceptron which are serially connected in sequence.

Further, the dimension of the speech feature vector extracted through the pre-training network is N (length) ×768, denoted by T ₆ (x), and the dimension of the text feature vector extracted through the pre-training network is N (length) ×768, denoted by T ₁₂ (x).

Step three, comparing and aligning, namely mapping the audio features and the text features into hyperbolic geometric spaces respectively, determining hyperbolic distances between the mapped audio features and text features, and aligning the audio features and the text features by taking the hyperbolic distances as similarity;

The method specifically comprises the following steps:

3.1, mapping the audio features and the text features into a hyperbolic geometric space according to the following formulas respectively;

wherein, X _hyai and X _hyti respectively represent audio and text features in hyperbolic space, and are feature vectors of 768 dimensions, a represents audio features, t represents text features, tanh (·) represents hyperbolic tangent function, and c represents sphere negative curvature.

3.2, After the hyperbolic mapping operation, calculating the hyperbolic distance between the mapped audio feature and the text feature;

Wherein S (a, t) represents the hyperbolic distance of the mapped audio feature X _hyai and text feature X _hyti, and S (t, a) represents the hyperbolic distance of the mapped text feature X _hyti and audio feature X _hyai; wherein the method comprises the steps of Representing an operator in hyperbolic space, the operator is represented as follows:

3.3, taking the hyperbolic distance as the similarity, and normalizing according to the following formula:

Where N represents the batch size, ɛ is the temperature coefficient, which is a learnable parameter, and m represents the similar interval.

3.4, Determining audio-text contrast loss according to cross entropy, and optimizing hyperbolic distance to further realize alignment of audio features and text features; the comparative losses are expressed as follows:

Wherein L _t2a is the alignment loss of the audio feature, L _a2t is the alignment loss of the text feature, an

Y ^a2t (a) and y ^t2a (t) are the generated one-hot labels, where the probability of the negative pair is 0 and the probability of the positive pair is 1.

Step four, carrying out cross attention fusion and feature splicing on the aligned audio features and text features in sequence, wherein the method comprises the following steps:

The invention uses two cross attention modules to fuse audio and text features, wherein, cross attention obtains q, k and v from different modal representations, for k different head attention, attention weight is calculated, output vectors of k heads are connected to obtain a first/second fusion feature, and dimension of the fusion feature is 1536.

And step six, recognizing viscera and organs according to the voice characteristics of the human body.

In this embodiment, the human voice features obtained by stitching are input into the classifier to perform viscera organ recognition, which is expressed as:

Wherein, Representing the organ category predicted by the model.

Finally, the predicted viscera organ category Y is obtained, and y= { intestine, lung, liver, spleen, kidney, stomach, heart, other, health }.

Visceral organ recognition belongs to a classification task, so that a back propagation optimization algorithm is required to optimize the classification loss to minimize the classification loss as much as possible. The training penalty of this embodiment is mainly realized by a class cross entropy penalty, which is defined as:

in the method, in the process of the invention, Y _i represents the probability of the ith viscera, which is 1 instead of 0, representing the true label of the input viscera.

In a preferred embodiment, classification loss and alignment loss are used together as a loss function to optimize the network used in the multi-modal voice viscera recognition method based on hyperbolic spatial alignment. I.e.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-modal speech visceral organ recognition method based on hyperbolic spatial alignment, comprising:

Acquiring a human voice signal and a corresponding text signal;

2. The method for recognition of multi-modal voice viscera based on hyperbolic spatial alignment as defined in claim 1, wherein preprocessing is performed before feature extraction of the voice signal and the text signal, comprising:

Resampling the speech signal at 16kHz with an audio length fixed at 30 seconds, calculating a logarithmic mel-spectrogram using a window of 25 milliseconds and a step size of 10 milliseconds to generate an 80-dimensional audio frame feature representation;

3. The multi-modal speech visceral organ recognition method based on hyperbolic spatial alignment according to claim 2, wherein the audio frame feature representation is feature extracted with a low-speech encoder and the text feature representation is feature extracted with a bi-directional encoder; wherein,

The low-language encoder is formed by overlapping 6 layers of transformer networks, and the bidirectional encoder is formed by overlapping 12 layers of transformer networks.

4. The method for recognition of multi-modal voice viscera based on hyperbolic spatial alignment of claim 3, wherein each layer of transformer network comprises a self-attention layer and a multi-layer perceptron connected in series.

5. The multi-modal speech visceral organ recognition method based on hyperbolic spatial alignment according to claim 1, wherein the audio features mapped into the hyperbolic geometric space are represented as follows:

；

6. The multi-modal voice visceral organ recognition method based on hyperbolic spatial alignment according to claim 5, wherein the hyperbolic distance between the mapped audio feature and the text feature is obtained by the following formula;

；

7. The method for recognition of multi-modal voice viscera based on hyperbolic spatial alignment of claim 6, wherein the process of aligning the audio feature and the text feature with hyperbolic distance as similarity comprises:

The similarity is normalized according to the following formula:

；

where N represents the batch size, ɛ is the temperature coefficient, is a learnable parameter, and m represents the similar interval;

determining audio-text contrast loss according to the cross entropy, and optimizing the hyperbolic distance; the comparative losses are expressed as follows:

；

8. The method for identifying multi-modal voice viscera based on hyperbolic spatial alignment according to claim 1, wherein the process of sequentially performing cross-attention fusion and feature stitching on the aligned audio features and text features comprises the following steps:

9. The multi-modal voice viscera recognition method based on hyperbolic spatial alignment in accordance with claim 1, wherein human voice features are input into a classifier for viscera recognition, wherein a loss function of the classifier is:

；