CN117958765A - Multi-mode voice viscera organ recognition method based on hyperbolic space alignment - Google Patents

Multi-mode voice viscera organ recognition method based on hyperbolic space alignment Download PDF

Info

Publication number
CN117958765A
CN117958765A CN202410386135.XA CN202410386135A CN117958765A CN 117958765 A CN117958765 A CN 117958765A CN 202410386135 A CN202410386135 A CN 202410386135A CN 117958765 A CN117958765 A CN 117958765A
Authority
CN
China
Prior art keywords
text
hyperbolic
features
feature
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410386135.XA
Other languages
Chinese (zh)
Other versions
CN117958765B (en
Inventor
文贵华
王昶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202410386135.XA priority Critical patent/CN117958765B/en
Publication of CN117958765A publication Critical patent/CN117958765A/en
Application granted granted Critical
Publication of CN117958765B publication Critical patent/CN117958765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a multi-mode voice viscera organ recognition method based on hyperbolic space alignment, which comprises the steps of obtaining human voice signals and corresponding text signals; respectively extracting characteristics of the voice signal and the text signal to obtain audio characteristics and text characteristics; mapping the audio features and the text features into a hyperbolic geometric space, determining the hyperbolic distance between the mapped audio features and text features, and aligning the audio features and the text features by taking the hyperbolic distance as similarity; sequentially carrying out cross attention fusion and feature stitching on the aligned audio features and text features to obtain human voice features; and the viscera and organs are identified according to the voice characteristics of the human body. The invention combines the multi-mode and hyperbolic space feature alignment, so that the features are aligned in the hyperbolic space before fusion, and the accuracy of multi-mode voice viscera organ recognition is improved.

Description

Multi-mode voice viscera organ recognition method based on hyperbolic space alignment
Technical Field
The invention relates to the technical field of viscera and organs identification, in particular to a multi-mode voice viscera and organs identification method based on hyperbolic space alignment.
Background
In the theory of traditional Chinese medicine, the diagnosis is a very important means, but voice plays a very important role as an important medical information source.
According to the theory of traditional Chinese medicine, the diseases of viscera and organs of a human body often show changes in the aspects of qi, blood, channels and collaterals and the like of the body, so that corresponding sound expression is generated; secondly, text information in the voice can indirectly reflect the state and symptomatic description of the patient, and additional information is provided; in addition, voice is a non-invasive way of examination without causing any pain or discomfort to the patient, so it is safe and harmless to select a voice way to obtain medical information.
However, at present, recognition of viscera by speech requires abundant experience of Chinese medicine specialists, and accurate recognition is difficult for doctors and non-professionals with insufficient experience, so that assistance of intelligent tools is required.
However, there are currently very few studies using speech for automatic visceral recognition. Therefore, the application provides a multi-mode voice viscera organ recognition method based on hyperbolic space alignment.
Disclosure of Invention
In view of the above, in order to overcome the shortcomings of the prior art, the present invention provides a multi-modal voice viscera organ recognition method, specifically, the multi-modal voice and the text are mutually aligned and supplemented to obtain more accurate and comprehensive results.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a multi-modal voice viscera organ recognition method based on hyperbolic space alignment comprises,
Acquiring a human voice signal and a corresponding text signal;
respectively extracting characteristics of the voice signal and the text signal to obtain audio characteristics and text characteristics;
Mapping the audio features and the text features into a hyperbolic geometric space, determining the hyperbolic distance between the mapped audio features and text features, and aligning the audio features and the text features by taking the hyperbolic distance as similarity;
Sequentially carrying out cross attention fusion and feature stitching on the aligned audio features and text features to obtain human voice features;
The viscera and organs are identified according to the voice characteristics of the human body.
Preferably, preprocessing is performed before feature extraction of the voice signal and the text signal, including:
Resampling the speech signal at 16 kHz and fixing the audio length to 30 seconds, calculating a log mel-spectrogram using a window of 25 milliseconds and a step size of 10 milliseconds to generate an 80-dimensional audio frame feature representation,
And word segmentation is carried out on the text signals, so that each input text is limited to 128 words, and after word segmentation, the marks are added with a beginning and a space before and after marking, so that text characteristic representation taking words as units is obtained.
Preferably, the audio frame feature representation is feature extracted with a low-language encoder and the text feature representation is feature extracted with a bi-directional encoder; wherein,
The low-language encoder is formed by overlapping 6 layers of transformer networks, and the bidirectional encoder is formed by overlapping 12 layers of transformer networks. Each layer of transformer is obtained by a multi-head attention mechanism and a feedforward neural network in series, namely each layer of transformer network comprises a self-attention layer and a multi-layer perceptron which are serially connected in sequence;
The low-language encoder and the bi-directional encoder ensure that the audio features and the text features extracted by the module are more robust through upstream training of a large amount of audio data and text data. Experiments prove that through the pre-trained transformer network, compared with manual characteristics of manual design, abundant high-dimensional semantic characteristics can be obtained.
Preferably, the audio features mapped into the hyperbolic geometry space are represented as follows:
the text features mapped into the hyperbolic geometry are represented as follows:
Where X hyai and X hyti represent audio and text features, respectively, in hyperbolic space, a represents audio feature, t represents text feature, tanh (·) represents hyperbolic tangent function, and c represents sphere negative curvature.
Preferably, the hyperbolic distance between the mapped audio feature and the text feature is obtained by the following formula;
Where S (a, t) represents the hyperbolic distance of the mapped audio feature X hyai and text feature X hyti, and S (t, a) represents the hyperbolic distance of the mapped text feature X hyti and audio feature X hyai.
Preferably, the process of aligning the audio feature and the text feature with the hyperbolic distance as the similarity includes: taking the hyperbolic distance as the similarity of the characteristics, carrying out similarity normalization by using the following formula:
in the method, in the process of the invention, Representing the similarity of the ith audio to the text,/>Representing similarity of the ith text audio, N representing batch size, ɛ being temperature coefficient, a learnable parameter, and m representing similarity interval;
determining audio-text contrast loss according to cross entropy, and optimizing hyperbolic distance, wherein the calculation is as follows:
Where y a2t (a) and y t2a (t) represent the single thermal labels generated within a batch, the label for the positive pair of samples is 1 and the label for the negative pair of samples is 0.
The comparative losses are expressed as follows:
Preferably, the process of performing cross attention fusion and feature stitching on the aligned audio features and text features sequentially includes:
cross attention fusion is carried out on the aligned audio features and text features, and a first fusion feature is obtained;
cross attention fusion is carried out on the aligned text features and the aligned audio features, and a second fusion feature is obtained;
And splicing the first fusion characteristic and the second fusion characteristic to obtain the human voice characteristic.
Preferably, the human voice features are input into a classifier for viscera organ recognition, wherein the loss function of the classifier is as follows:
in the method, in the process of the invention, Y i represents the probability of the ith viscera, representing the actual label of the input viscera.
Further, the classification loss and the alignment loss are used as a loss function to optimize the multi-modal voice viscera organ recognition method based on hyperbolic space alignment.
According to the technical scheme, the invention discloses a multi-mode voice viscera organ recognition method based on hyperbolic space alignment, and compared with the prior art, the multi-mode voice viscera organ recognition method combines multi-mode and hyperbolic space characteristics, so that the characteristics are aligned in the hyperbolic space before fusion, and the accuracy of multi-mode voice viscera organ recognition can be effectively improved; the invention provides a new thought for the recognition of viscera and organs by combining deep learning with knowledge in the field of traditional Chinese medicine, and can stably and rapidly recognize viscera and organ types corresponding to voice through a computer program, thereby overcoming the limitation brought by relying on the experience of expert traditional Chinese medicine.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a multi-modal voice viscera organ recognition method based on hyperbolic spatial alignment in the invention;
FIG. 2 is a schematic diagram of a multi-modal speech visceral organ recognition process based on hyperbolic spatial alignment in accordance with the present invention;
FIG. 3 is a schematic diagram of a speech feature extraction model according to the present invention;
fig. 4 is a schematic diagram of a text feature extraction model structure according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the multi-modal voice viscera organ recognition method based on hyperbolic space alignment disclosed in the embodiment of the invention comprises the following steps,
Acquiring a human voice signal and a corresponding text signal;
respectively extracting characteristics of the voice signal and the text signal to obtain audio characteristics and text characteristics;
Mapping the audio features and the text features into hyperbolic geometric spaces respectively, determining hyperbolic distances between the mapped audio features and the mapped text features, and aligning the audio features and the mapped text features by taking the hyperbolic distances as similarity;
Sequentially carrying out cross attention fusion and feature stitching on the aligned audio features and text features to obtain human voice features;
The viscera and organs are identified according to the voice characteristics of the human body.
In one embodiment, as shown in figure 2,
Step one, collecting human voice signals by using recording equipment, and converting the voice signals into texts by using the existing voice-to-text technology.
Step two, preprocessing is carried out before the characteristic extraction of the voice signal and the text signal preferentially, and the method comprises the following steps:
resampling the speech signal at 16 kHz and fixing the audio length to 30 seconds, calculating a logarithmic mel-spectrogram using a window of 25 milliseconds and a step size of 10 milliseconds to generate an 80-dimensional audio frame feature representation;
and word segmentation is carried out on the text signals, so that each input text is limited to 128 words, and after word segmentation, the marks are added with a beginning and a space before and after marking, so that text characteristic representation taking words as units is obtained.
Then, the feature extraction is carried out on the audio frame feature representation by utilizing a pre-trained low-language encoder, and the feature extraction is carried out on the text feature representation by utilizing a pre-trained bidirectional encoder; in the application, the low-language encoder and the bidirectional encoder models are pre-trained based on transformers under a large amount of data, so that the models obtain more robust semantic information compared with manual design and neural networks.
Wherein the encoder structure is shown in figures 3 and 4,
The low-language encoder is formed by overlapping 6 layers of transformer networks, the two-way encoder is formed by overlapping 12 layers of transformer networks, and each layer of transformer network comprises a self-attention layer and a multi-layer perceptron which are serially connected in sequence.
Further, the dimension of the speech feature vector extracted through the pre-training network is N (length) ×768, denoted by T 6 (x), and the dimension of the text feature vector extracted through the pre-training network is N (length) ×768, denoted by T 12 (x).
Step three, comparing and aligning, namely mapping the audio features and the text features into hyperbolic geometric spaces respectively, determining hyperbolic distances between the mapped audio features and text features, and aligning the audio features and the text features by taking the hyperbolic distances as similarity;
The method specifically comprises the following steps:
3.1, mapping the audio features and the text features into a hyperbolic geometric space according to the following formulas respectively;
wherein, X hyai and X hyti respectively represent audio and text features in hyperbolic space, and are feature vectors of 768 dimensions, a represents audio features, t represents text features, tanh (·) represents hyperbolic tangent function, and c represents sphere negative curvature.
3.2, After the hyperbolic mapping operation, calculating the hyperbolic distance between the mapped audio feature and the text feature;
Wherein S (a, t) represents the hyperbolic distance of the mapped audio feature X hyai and text feature X hyti, and S (t, a) represents the hyperbolic distance of the mapped text feature X hyti and audio feature X hyai; wherein the method comprises the steps of Representing an operator in hyperbolic space, the operator is represented as follows:
3.3, taking the hyperbolic distance as the similarity, and normalizing according to the following formula:
Where N represents the batch size, ɛ is the temperature coefficient, which is a learnable parameter, and m represents the similar interval.
3.4, Determining audio-text contrast loss according to cross entropy, and optimizing hyperbolic distance to further realize alignment of audio features and text features; the comparative losses are expressed as follows:
Wherein L t2a is the alignment loss of the audio feature, L a2t is the alignment loss of the text feature, an
Y a2t (a) and y t2a (t) are the generated one-hot labels, where the probability of the negative pair is 0 and the probability of the positive pair is 1.
Step four, carrying out cross attention fusion and feature splicing on the aligned audio features and text features in sequence, wherein the method comprises the following steps:
cross attention fusion is carried out on the aligned audio features and text features, and a first fusion feature is obtained;
cross attention fusion is carried out on the aligned text features and the aligned audio features, and a second fusion feature is obtained;
And splicing the first fusion characteristic and the second fusion characteristic to obtain the human voice characteristic.
The invention uses two cross attention modules to fuse audio and text features, wherein, cross attention obtains q, k and v from different modal representations, for k different head attention, attention weight is calculated, output vectors of k heads are connected to obtain a first/second fusion feature, and dimension of the fusion feature is 1536.
And step six, recognizing viscera and organs according to the voice characteristics of the human body.
In this embodiment, the human voice features obtained by stitching are input into the classifier to perform viscera organ recognition, which is expressed as:
Wherein, Representing the organ category predicted by the model.
Finally, the predicted viscera organ category Y is obtained, and y= { intestine, lung, liver, spleen, kidney, stomach, heart, other, health }.
Visceral organ recognition belongs to a classification task, so that a back propagation optimization algorithm is required to optimize the classification loss to minimize the classification loss as much as possible. The training penalty of this embodiment is mainly realized by a class cross entropy penalty, which is defined as:
in the method, in the process of the invention, Y i represents the probability of the ith viscera, which is 1 instead of 0, representing the true label of the input viscera.
In a preferred embodiment, classification loss and alignment loss are used together as a loss function to optimize the network used in the multi-modal voice viscera recognition method based on hyperbolic spatial alignment. I.e.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A multi-modal speech visceral organ recognition method based on hyperbolic spatial alignment, comprising:
Acquiring a human voice signal and a corresponding text signal;
respectively extracting characteristics of the voice signal and the text signal to obtain audio characteristics and text characteristics;
Mapping the audio features and the text features into a hyperbolic geometric space, determining the hyperbolic distance between the mapped audio features and text features, and aligning the audio features and the text features by taking the hyperbolic distance as similarity;
Sequentially carrying out cross attention fusion and feature stitching on the aligned audio features and text features to obtain human voice features;
The viscera and organs are identified according to the voice characteristics of the human body.
2. The method for recognition of multi-modal voice viscera based on hyperbolic spatial alignment as defined in claim 1, wherein preprocessing is performed before feature extraction of the voice signal and the text signal, comprising:
Resampling the speech signal at 16kHz with an audio length fixed at 30 seconds, calculating a logarithmic mel-spectrogram using a window of 25 milliseconds and a step size of 10 milliseconds to generate an 80-dimensional audio frame feature representation;
and word segmentation is carried out on the text signals, so that each input text is limited to 128 words, and after word segmentation, the marks are added with a beginning and a space before and after marking, so that text characteristic representation taking words as units is obtained.
3. The multi-modal speech visceral organ recognition method based on hyperbolic spatial alignment according to claim 2, wherein the audio frame feature representation is feature extracted with a low-speech encoder and the text feature representation is feature extracted with a bi-directional encoder; wherein,
The low-language encoder is formed by overlapping 6 layers of transformer networks, and the bidirectional encoder is formed by overlapping 12 layers of transformer networks.
4. The method for recognition of multi-modal voice viscera based on hyperbolic spatial alignment of claim 3, wherein each layer of transformer network comprises a self-attention layer and a multi-layer perceptron connected in series.
5. The multi-modal speech visceral organ recognition method based on hyperbolic spatial alignment according to claim 1, wherein the audio features mapped into the hyperbolic geometric space are represented as follows:
the text features mapped into the hyperbolic geometry are represented as follows:
Where X hyai and X hyti represent audio and text features, respectively, in hyperbolic space, a represents audio feature, t represents text feature, tanh (·) represents hyperbolic tangent function, and c represents sphere negative curvature.
6. The multi-modal voice visceral organ recognition method based on hyperbolic spatial alignment according to claim 5, wherein the hyperbolic distance between the mapped audio feature and the text feature is obtained by the following formula;
Where S (a, t) represents the hyperbolic distance of the mapped audio feature X hyai and text feature X hyti, and S (t, a) represents the hyperbolic distance of the mapped text feature X hyti and audio feature X hyai.
7. The method for recognition of multi-modal voice viscera based on hyperbolic spatial alignment of claim 6, wherein the process of aligning the audio feature and the text feature with hyperbolic distance as similarity comprises:
The similarity is normalized according to the following formula:
where N represents the batch size, ɛ is the temperature coefficient, is a learnable parameter, and m represents the similar interval;
determining audio-text contrast loss according to the cross entropy, and optimizing the hyperbolic distance; the comparative losses are expressed as follows:
Wherein L t2a is the alignment loss of the audio feature, L a2t is the alignment loss of the text feature, an
Y a2t (a) and y t2a (t) are the generated one-hot labels, where the probability of the negative pair is 0 and the probability of the positive pair is 1.
8. The method for identifying multi-modal voice viscera based on hyperbolic spatial alignment according to claim 1, wherein the process of sequentially performing cross-attention fusion and feature stitching on the aligned audio features and text features comprises the following steps:
cross attention fusion is carried out on the aligned audio features and text features, and a first fusion feature is obtained;
cross attention fusion is carried out on the aligned text features and the aligned audio features, and a second fusion feature is obtained;
And splicing the first fusion characteristic and the second fusion characteristic to obtain the human voice characteristic.
9. The multi-modal voice viscera recognition method based on hyperbolic spatial alignment in accordance with claim 1, wherein human voice features are input into a classifier for viscera recognition, wherein a loss function of the classifier is:
in the method, in the process of the invention, Y i represents the probability of the ith viscera, representing the actual label of the input viscera.
CN202410386135.XA 2024-04-01 2024-04-01 Multi-mode voice viscera organ recognition method based on hyperbolic space alignment Active CN117958765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410386135.XA CN117958765B (en) 2024-04-01 2024-04-01 Multi-mode voice viscera organ recognition method based on hyperbolic space alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410386135.XA CN117958765B (en) 2024-04-01 2024-04-01 Multi-mode voice viscera organ recognition method based on hyperbolic space alignment

Publications (2)

Publication Number Publication Date
CN117958765A true CN117958765A (en) 2024-05-03
CN117958765B CN117958765B (en) 2024-06-21

Family

ID=90846446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410386135.XA Active CN117958765B (en) 2024-04-01 2024-04-01 Multi-mode voice viscera organ recognition method based on hyperbolic space alignment

Country Status (1)

Country Link
CN (1) CN117958765B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11488586B1 (en) * 2021-07-19 2022-11-01 Institute Of Automation, Chinese Academy Of Sciences System for speech recognition text enhancement fusing multi-modal semantic invariance
CN115565540A (en) * 2022-12-05 2023-01-03 浙江大学 Invasive brain-computer interface Chinese pronunciation decoding method
WO2023003856A1 (en) * 2021-07-21 2023-01-26 Utech Products, Inc. Ai platform for processing speech and video information collected during a medical procedure
CN116075891A (en) * 2020-07-10 2023-05-05 诺沃斯有限公司 Speech analysis for monitoring or diagnosing health conditions
CN116467675A (en) * 2023-04-17 2023-07-21 华南理工大学 Viscera attribute coding method and system integrating multi-modal characteristics
CN116487031A (en) * 2023-04-17 2023-07-25 莆田市数字集团有限公司 Multi-mode fusion type auxiliary diagnosis method and system for pneumonia
CN117238019A (en) * 2023-09-26 2023-12-15 华南理工大学 Video facial expression category identification method and system based on space-time relative transformation
CN117476215A (en) * 2023-11-17 2024-01-30 上海触脉数字医疗科技有限公司 Medical auxiliary judging method and system based on AI
CN117672268A (en) * 2023-11-21 2024-03-08 重庆邮电大学 Multi-mode voice emotion recognition method based on relative entropy alignment fusion

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116075891A (en) * 2020-07-10 2023-05-05 诺沃斯有限公司 Speech analysis for monitoring or diagnosing health conditions
US11488586B1 (en) * 2021-07-19 2022-11-01 Institute Of Automation, Chinese Academy Of Sciences System for speech recognition text enhancement fusing multi-modal semantic invariance
WO2023003856A1 (en) * 2021-07-21 2023-01-26 Utech Products, Inc. Ai platform for processing speech and video information collected during a medical procedure
CN115565540A (en) * 2022-12-05 2023-01-03 浙江大学 Invasive brain-computer interface Chinese pronunciation decoding method
CN116467675A (en) * 2023-04-17 2023-07-21 华南理工大学 Viscera attribute coding method and system integrating multi-modal characteristics
CN116487031A (en) * 2023-04-17 2023-07-25 莆田市数字集团有限公司 Multi-mode fusion type auxiliary diagnosis method and system for pneumonia
CN117238019A (en) * 2023-09-26 2023-12-15 华南理工大学 Video facial expression category identification method and system based on space-time relative transformation
CN117476215A (en) * 2023-11-17 2024-01-30 上海触脉数字医疗科技有限公司 Medical auxiliary judging method and system based on AI
CN117672268A (en) * 2023-11-21 2024-03-08 重庆邮电大学 Multi-mode voice emotion recognition method based on relative entropy alignment fusion

Also Published As

Publication number Publication date
CN117958765B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
CN108805087B (en) Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system
CN108877801B (en) Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system
CN108805089B (en) Multi-modal-based emotion recognition method
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN109409296B (en) Video emotion recognition method integrating facial expression recognition and voice emotion recognition
CN110364251B (en) Intelligent interactive diagnosis guide consultation system based on machine reading understanding
CN108899050A (en) Speech signal analysis subsystem based on multi-modal Emotion identification system
CN105739688A (en) Man-machine interaction method and device based on emotion system, and man-machine interaction system
Sönmez et al. A speech emotion recognition model based on multi-level local binary and local ternary patterns
CN116049743B (en) Cognitive recognition method based on multi-modal data, computer equipment and storage medium
CN114724224A (en) Multi-mode emotion recognition method for medical care robot
CN112307975A (en) Multi-modal emotion recognition method and system integrating voice and micro-expressions
CN115169507A (en) Brain-like multi-mode emotion recognition network, recognition method and emotion robot
Zhang et al. Intelligent speech technologies for transcription, disease diagnosis, and medical equipment interactive control in smart hospitals: A review
CN107437090A (en) The continuous emotion Forecasting Methodology of three mode based on voice, expression and electrocardiosignal
CN117672268A (en) Multi-mode voice emotion recognition method based on relative entropy alignment fusion
CN117877660A (en) Medical report acquisition method and system based on voice recognition
Siriwardena et al. The secret source: Incorporating source features to improve acoustic-to-articulatory speech inversion
CN117958765B (en) Multi-mode voice viscera organ recognition method based on hyperbolic space alignment
CN114360584A (en) Phoneme-level-based speech emotion layered recognition method and system
CN117457162A (en) Emergency call sub-diagnosis method and system based on multi-encoder and multi-mode information fusion
CN116701996A (en) Multi-modal emotion analysis method, system, equipment and medium based on multiple loss functions
Liu et al. Respiratory sounds feature learning with deep convolutional neural networks
CN116341546A (en) Medical natural language processing method based on pre-training model
CN115620370A (en) Emotion recognition method based on multi-mode clustering federal learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant