CN116467675A - Viscera attribute coding method and system integrating multi-modal characteristics - Google Patents

Viscera attribute coding method and system integrating multi-modal characteristics Download PDF

Info

Publication number
CN116467675A
CN116467675A CN202310404163.5A CN202310404163A CN116467675A CN 116467675 A CN116467675 A CN 116467675A CN 202310404163 A CN202310404163 A CN 202310404163A CN 116467675 A CN116467675 A CN 116467675A
Authority
CN
China
Prior art keywords
space
tongue
attribute
model
organ
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310404163.5A
Other languages
Chinese (zh)
Inventor
陈家炜
文贵华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202310404163.5A priority Critical patent/CN116467675A/en
Publication of CN116467675A publication Critical patent/CN116467675A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Epidemiology (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a system for encoding the attribute of an internal organ by fusing multi-modal features, wherein the method comprises a data acquisition module, a data processing module, a multi-modal fusion feature encoding model and a model construction training module; the tongue image and the patient sound are collected and marked to obtain corresponding viscera attribute labels, data processing is respectively carried out, individual features of tongue image modes and sound modes are respectively extracted by using a deep neural network model, the individual features of tongue image data and sound data are fused by taking consistency and complementarity of representing space as constraints, the viscera labels and organ attribute labels are used for supervision learning, priori guiding knowledge of viscera attributes is embedded, a multi-mode fusion feature coding model is obtained, the tongue image and the patient sound are collected and processed, the multi-mode fusion feature coding model is used for obtaining corresponding viscera attribute labels, and accuracy and objectivity of viscera attribute coding are improved.

Description

Viscera attribute coding method and system integrating multi-modal characteristics
Technical Field
The invention relates to the technical field of machine learning, in particular to a method and a system for encoding an internal organ attribute by fusing multi-modal characteristics.
Background
In recent years, the development of artificial intelligence is rapid, and particularly, deep neural network machine learning technology based on big data is greatly developed and applied in a large number; the Western medicine has been widely used in the field of artificial intelligence and has achieved a series of breakthrough results, for example, image diagnosis based on deep learning, automatic medical record processing, etc., and the application of these techniques has greatly improved the efficiency and accuracy of medical diagnosis, and the application of deep learning techniques in medical diagnosis has become a trend.
At present, the application of traditional Chinese medicine in the field of artificial intelligence is increasingly focused, and the diagnosis of traditional Chinese medicine is based on the whole concept and dialectical thinking, and the diagnosis method of traditional Chinese medicine is focused on diagnosis based on dialectical treatment, and comprises the steps of looking, smelling, asking and cutting four diagnoses, and the condition of a patient is determined by comprehensively judging the face, eyes, tongue, sound, pulse and other symptoms of the patient, wherein tongue images and voice are the contents of the inspection and the smelling in the diagnosis method of traditional Chinese medicine and are also used as the basis of internal organs and attributes of the internal organs.
However, there are many disadvantages in the research of tongue images and voice multi-mode internal organs and attribute codes based on deep learning: on the one hand, the current diagnostic models are mostly based on the codes of internal organs and the attributes thereof of machine learning, and the algorithms cannot consider the inherent attributes of traditional Chinese medicine characteristics, and the classification and diagnosis standards of traditional Chinese medicine on organs such as viscera are different from those of Western medicine, so that data and algorithms need to be properly adjusted and optimized, for example, factors such as tongue morphology, color, tongue coating and the like, and factors such as sound tone, timbre, audio time domain, frequency domain and the like need to be considered, and the characteristics cannot be processed by the traditional machine learning algorithm; on the other hand, according to the principle of traditional Chinese medicine thinking and diagnosis theory, multi-diagnosis combined parameters are a diagnosis method specific to traditional Chinese medicine, and at present, no study on multi-mode combined parameters on internal organs and attribute diagnosis thereof exists, and although the current single-mode coding and diagnosis model has good effect, certain subjectivity and inconsistency exist in the treatment process, and comprehensive diagnosis is needed from different angles by combining different modes.
Therefore, from the point of view of multi-modal diagnosis, the combination of tongue image data and voice data for encoding the internal organ attribute is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a method and a system for encoding the properties of internal organs by fusing multi-modal features to solve the problems mentioned in the background art.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
an internal organ attribute coding method integrating multi-modal characteristics comprises the following steps:
s1, collecting tongue images and patient sounds, and labeling and obtaining visceral organ attribute tags corresponding to the tongue images and the sounds, wherein the tags comprise visceral organ category tags and organ attribute tags corresponding to each visceral organ;
s2, respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms;
s3, taking the processed batch tongue images as input image data, taking the converted batch spectrograms as input sound data, and respectively extracting the individual characteristics of tongue image modes and the individual characteristics of sound modes by using a deep neural network model;
s4, taking consistency and complementarity of the representation space as constraints, fusing modal features of individual features of tongue image data and individual features of sound data, and performing supervised learning by using the internal organ labels and the organ attribute labels so as to embed priori guiding knowledge of internal organ attributes and obtain a multi-modal fusion feature coding model of the internal organ attributes embedded with the priori knowledge.
Preferably, the method for encoding the attribute of the internal organs by fusing the multi-modal features further comprises the following steps:
s5, collecting tongue images and patient sounds, respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms, and inputting the processed batch tongue image data and sound data spectrograms into a multi-mode fusion feature coding model of the internal organ attributes embedded with priori knowledge to obtain internal organ attribute tags corresponding to the tongue images and the sounds.
Preferably, the specific content of step S2 includes:
performing tongue coating target detection and target region cutting on tongue image data by adopting a target detection model, expanding an image by a bilinear interpolation mode, performing random cutting according to the original size to obtain an output image copy with the same size as the original size, horizontally overturning the output image copy, and respectively performing normalization processing on three basic color channels of red, green and blue of the image;
and (3) adopting a voice denoising model to perform audio denoising, using an audio and music signal processing tool to reject a mute frame, randomly intercepting a sound fragment, performing pre-emphasis, framing and windowing processing, and converting a time domain sound signal into a spectrogram through time-frequency transformation.
Preferably, the deep neural network model in S3 includes a convolutional neural network combination type MLP model and a cyclic neural network combination type MLP model;
taking the batch tongue images after data processing as input image data, and extracting the individual features of the image data by using a convolutional neural network combination type MLP model;
and taking the batch spectrograms as input sound data, and extracting the individual characteristics of the sound data by using a cyclic neural network combined class MLP model.
Preferably, the convolutional neural network combined MLP-like model comprises a plurality of convolutional layers, a normalization layer, a downsampling layer and a full-connection layer;
the cyclic neural network combined MLP-like model comprises a plurality of cyclic units, a normalization layer, a downsampling layer and a full-connection layer.
Preferably, the specific content of step S4 is:
s41, calculating Euclidean distances of individual features of tongue image modes and individual features of sound modes in Euclidean space; calculating hyperbolic distances of tongue image characteristic mode individual features and voice mode individual features in the hyperbolic space; the cosine similarity of the Euclidean distance and the hyperbolic distance is calculated, namely the consistency of the representation space is shown;
s42, fusing individual characteristics of tongue image modes and individual characteristics of sound modes by using a cross-mode bridging fusion strategy, and outputting viscera attribute characteristics by adopting a Sigmoid activation function;
s43, mapping the output viscera attribute characteristics to European space and hyperbolic space respectively, and calculating cross entropy loss in each representing space, namely representing space complementarity;
s44, obtaining a loss function of the internal organ attribute by combining consistency and complementarity of the representation space, and obtaining the multi-mode fusion feature coding model by updating the loss function through model training parameters after repeated iterative training.
Preferably, the specific content indicating the consistency of the space in S41 is:
wherein,,mapping of individual features of tongue image modality in European space,>is soundMapping of modal personality characteristics in Euclidean space, < ->Mapping of individual characteristics of tongue image mode in hyperbolic space pair>Mapping the personality characteristics of the sound modes in hyperbolic space;
d e for the euclidean space distance metric:
the hyperbolic spatial distance metric is:
wherein c is a space curvature constant, c <0;
d 1 taking cosine distance loss:
preferably, the specific content indicating the complementarity of the space is:
the euclidean spatial distance similarity measure is:
the hyperbolic space structure similarity measure is:
d 2 、d 3 taking the cross entropy loss d respectively ce Distance between hyperbolic spaces
Where c is the space curvature constant, c <0.
Preferably, the specific content of S44 is:
the consistency constraint of the modal personality characteristics in the Euclidean space and the hyperbolic space is as follows:
the complementarity constraint of the modal personality characteristics in the Euclidean space and hyperbolic space is:
the loss function of the multi-modal fusion feature coding model is:
wherein W is the weight of each sub-item, in particular W ce Is thatWeight, W of (2) consis Is->Weight, W of (2) compl Is->Is a weight of (2).
An viscera attribute coding system integrating multi-modal features comprises a data acquisition module, a data processing module, a multi-modal fusion feature coding model and a model construction training module;
the model building training module comprises a labeling unit, a feature extraction unit, a modal feature fusion unit and a supervised learning unit;
the data acquisition module is used for acquiring tongue images and patient sounds;
the data processing module is used for respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms;
the multi-mode fusion feature coding model is used for obtaining visceral organ attribute labels corresponding to tongue images and voice according to the processed batch tongue image data and voice data spectrograms;
the model construction training module is used for constructing and training to obtain a multi-mode fusion feature coding model;
the labeling unit is used for labeling and acquiring visceral organ attribute labels corresponding to tongue images and sounds, wherein the labels comprise visceral organ category labels and organ attribute labels corresponding to each visceral organ;
the feature extraction unit is used for taking the processed batch tongue images as input image data, taking the converted batch spectrograms as input sound data, and extracting the individual features of the tongue image mode and the individual features of the sound mode respectively by using the deep neural network model;
the modal feature fusion unit is used for fusing the individual features of the tongue image data and the individual features of the sound data by taking consistency and complementarity of the representation space as constraints;
and the supervised learning unit performs supervised learning by using the internal organ labels and the organ attribute labels so as to embed priori guiding knowledge of the internal organ attributes and obtain a multimodal fusion feature coding model of the internal organ attributes embedded with the priori knowledge.
Compared with the prior art, the invention discloses a method and a system for encoding the viscera attribute by fusing the multi-mode characteristics, which are used for more comprehensively knowing information contained in each mode from different view angles and modes by combining tongue images and voice data and utilizing a multi-mode fusion mode, so that potential modes and rules behind the data are mined, and the classification performance is improved; the correlation of the consistency and the complementarity of the representation space is provided, and the correlation is applied to the study of the multi-mode model, and the consistency component and the complementation component between different mode data are utilized to improve the multi-mode feature fusion capability.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a multi-modal organ attribute encoding method according to the present invention;
FIG. 2 is a schematic diagram of a neural network combined MLP model structure provided by the invention;
FIG. 3 is a schematic diagram of structural parameters of a multi-mode fusion feature coding model provided by the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses an internal organ attribute coding method integrating multi-mode characteristics, which comprises the following steps:
s1, collecting tongue images and patient sounds, and labeling and obtaining visceral organ attribute tags corresponding to the tongue images and the sounds, wherein the tags comprise visceral organ category tags and organ attribute tags corresponding to each visceral organ;
s2, respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms;
s3, taking the processed batch tongue images as input image data, taking the converted batch spectrograms as input sound data, and respectively extracting the individual characteristics of tongue image modes and the individual characteristics of sound modes by using a deep neural network model;
s4, taking consistency and complementarity of the representation space as constraints, carrying out multi-mode feature fusion on individual features of tongue image data and individual features of sound data, and carrying out supervised learning by using the internal organ labels and organ attribute labels so as to embed priori guiding knowledge of internal organ attributes and obtain a multi-mode fusion feature coding model of the internal organ attributes embedded with priori knowledge.
In this embodiment, the visceral organ category labels include large intestine, gall bladder, lung, liver, bladder, spleen, kidney, stomach, small intestine, heart and unknowns; organ property labels include qi deficiency, blood deficiency, yin deficiency, yang deficiency, qi stagnation, blood stasis, phlegm, wind, heat, cold, dryness, dampness and unknowns.
In order to further implement the above technical solution, an internal organ attribute coding method integrating multi-modal features further includes:
s5, collecting tongue images and patient sounds, respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms, and inputting the processed batch tongue image data and sound data spectrograms into a multi-mode fusion feature coding model of the internal organ attributes embedded with priori knowledge to obtain internal organ attribute tags corresponding to the tongue images and the sounds.
In order to further implement the above technical solution, the specific content of step S2 includes:
performing tongue coating target detection by using a target detection model Faster R-CNN, cutting a target area to form an image with the size of 224 multiplied by 224, scaling the image into the size of 256 multiplied by 256 by a bilinear interpolation mode, randomly cutting according to the original size of 224 multiplied by 224 to obtain an output image copy with the same size as the original size, horizontally turning the output image copy according to the probability of 0.5, and respectively carrying out normalization processing on three basic color channels of red, green and blue of the image;
and (3) adopting a voice denoising model PHASEN to perform audio denoising, using an audio and music signal processing tool Librosa to reject a mute frame, randomly intercepting sound fragments with the length of not less than 10s and not more than 90s, performing pre-emphasis, framing and windowing processing, and converting a time domain sound signal into a spectrogram through time-frequency transformation.
In order to further implement the above technical solution, the deep neural network model in S3 includes a convolutional neural network combination type MLP model and a cyclic neural network combination type MLP model;
taking the batch tongue images after data processing as input image data, and extracting the individual features of the image data by using a convolutional neural network combined MLP (multi-level plate) model:
Z t =MLP(CNN(X t ))
taking batch spectrograms as input sound data, and extracting individual features of the sound data by using a cyclic neural network combined class MLP model:
Z s =MLP(RNN(X s ))
wherein x= { X t ,X s Is the input sample set, X t Is tongue picture, X s Is sound, Z t Z is the individual character of the extracted tongue picture s Is the extracted voice personality characteristics.
In order to further implement the technical scheme, the convolutional neural network combined MLP-like model comprises a plurality of convolutional layers, a normalization layer, a downsampling layer and a full-connection layer;
the cyclic neural network combined MLP-like model comprises a plurality of cyclic units, a normalization layer, a downsampling layer and a full-connection layer.
In this embodiment, the MLP-like receives three-dimensional input, consisting of three branches, each of which is responsible for encoding information along the Height, width Weight and Channel dimensions, inputting dataCoded X via each branch H 、X W And X C Specific:
the channel information is encoded by a weightIs used for performing projection on input X to obtain X C
Layering S segments of a given input along a channel dimension to obtainFor each segment->Performing height-channel substitution operation to obtain +.>And then connected along the channel dimension as the output of the permutation operation. Next, a weight of +.>To mix the height information. Finally, the height-channel replacement operation is carried out again, the original dimension information is restored, and X is output H Similarly, to obtain spatial information along the width for encoding, in the second branch, we perform the same operations as described above to replace the width dimension and channel dimension of X and generate X W
Three weights (a H ,A W ,A C ) And the three coded hidden variables (X H ,X W ,X C ) Weighted summation is carried out to obtain output
A=FC(X H +X W +X C )
[A H ,A W ,A C ]=softmax(A)
In order to further implement the above technical solution, the specific content of step S4 is:
s41, calculating Euclidean distances of individual features of tongue image modes and individual features of sound modes in Euclidean space; calculating hyperbolic distances of tongue image characteristic mode individual features and voice mode individual features in the hyperbolic space; the cosine similarity of the Euclidean distance and the hyperbolic distance is calculated, namely the consistency of the representation space is shown;
s42, fusing individual characteristics of tongue image modes and individual characteristics of sound modes by using a cross-mode bridging fusion strategy, and outputting viscera attribute characteristics by adopting a Sigmoid activation function;
s43, mapping the output viscera attribute characteristics to European space and hyperbolic space respectively, and calculating cross entropy loss in each representing space, namely representing space complementarity;
s44, obtaining a loss function of the internal organ attribute by combining consistency and complementarity of the representation space, and obtaining the multi-mode fusion feature coding model by updating the loss function through model training parameters after repeated iterative training.
In order to further implement the above technical solution, the specific content indicating the consistency of the space in S41 is:
wherein,,mapping of individual features of tongue image modality in European space,>mapping of personality characteristics in Euclidean space for sound modality, < >>Mapping of individual characteristics of tongue image mode in hyperbolic space pair>Mapping the personality characteristics of the sound modes in hyperbolic space;
d e for the euclidean space distance metric:
the hyperbolic spatial distance metric is:
wherein c is a space curvature constant, c <0;
d 1 taking cosine distance loss:
in this embodiment, the mapping relationship between the tongue images in two spaces is:
the mutual mapping relation of sound in two spaces is as follows:
wherein,,is Mobius addition, c<0, in particular, when c=0, into an addition in euclidean space.
In order to further implement the above technical solution, the specific content of the complementarity of the expression space in S43 is:
the euclidean spatial distance similarity measure is:
the hyperbolic space structure similarity measure is:
wherein,,is the mapping of the organ attribute prediction in Euclidean space, Y e Mapping the property label of the viscera in Euclidean space; />Is the mapping of the organ attribute prediction in hyperbolic space, Y h Is the mapping of the viscera attribute label in the hyperbolic space;
d 2 taking the cross entropy loss d ce
d 3 Taking hyperbolic space distance
Where c is the space curvature constant, c <0.
In order to further implement the above technical solution, the specific content of S44 is:
the consistency constraint of the modal personality characteristics in the Euclidean space and the hyperbolic space is as follows:
the complementarity constraint of the modal personality characteristics in the Euclidean space and hyperbolic space is:
the loss function of the multi-modal fusion feature coding model is:
wherein W is the weight of each sub-item. In particular, W ce Is thatWeight, W of (2) consis Is->Weight, W of (2) compl Is->Is a weight of (2).
In practical application, the deep learning framework PyTorch and the model library tim are utilized, all experiments are run on a server provided with 2 NVIDIA GeForce GTX 1080 GPUs, each memory is about 12G, an operating system is Ubuntu16.04, training is carried out by adopting a random gradient descent algorithm SGD, and parameters are set as follows: the weight attenuation is 5e-4, the momentum is 0.9, the batch size is 64, the total training round of the model is 100, the initial learning rate is 0.01, the cosine annealing learning rate is adopted from the 60 th round to carry out attenuation to the minimum learning rate, and the minimum learning rate is set to be 2e-4.
An viscera attribute coding system integrating multi-modal features comprises a data acquisition module, a data processing module, a multi-modal fusion feature coding model and a model construction training module;
the model building training module comprises a labeling unit, a feature extraction unit, a modal feature fusion unit and a supervised learning unit;
the data acquisition module is used for acquiring tongue images and patient sounds;
the data processing module is used for respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms;
the multi-mode fusion feature coding model is used for obtaining visceral organ attribute labels corresponding to tongue images and voice according to the processed batch tongue image data and voice data spectrograms;
the model construction training module is used for constructing and training to obtain a multi-mode fusion feature coding model;
the labeling unit is used for labeling and acquiring visceral organ attribute labels corresponding to tongue images and sounds, wherein the labels comprise visceral organ category labels and organ attribute labels corresponding to each visceral organ;
the feature extraction unit is used for taking the processed batch tongue images as input image data, taking the converted batch spectrograms as input sound data, and extracting the individual features of the tongue image mode and the individual features of the sound mode respectively by using the deep neural network model;
the modal feature fusion unit is used for fusing the individual features of the tongue image data and the individual features of the sound data by taking consistency and complementarity of the representation space as constraints;
and the supervised learning unit performs supervised learning by using the internal organ labels and the organ attribute labels so as to embed priori guiding knowledge of the internal organ attributes and obtain a multimodal fusion feature coding model of the internal organ attributes embedded with the priori knowledge.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. The method for encoding the viscera attribute by fusing the multi-modal features is characterized by comprising the following steps of:
s1, collecting tongue images and patient sounds, and labeling and obtaining visceral organ attribute tags corresponding to the tongue images and the sounds, wherein the tags comprise visceral organ category tags and organ attribute tags corresponding to each visceral organ;
s2, respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms;
s3, taking the processed batch tongue images as input image data, taking the converted batch spectrograms as input sound data, and respectively extracting the individual characteristics of tongue image modes and the individual characteristics of sound modes by using a deep neural network model;
s4, taking consistency and complementarity of the representation space as constraints, carrying out multi-mode feature fusion on individual features of tongue image data and individual features of sound data, and carrying out supervised learning by using the internal organ labels and organ attribute labels so as to embed priori guiding knowledge of internal organ attributes and obtain a multi-mode fusion feature coding model of the internal organ attributes embedded with priori knowledge.
2. The method for encoding an organ attribute that incorporates multi-modal features of claim 1, further comprising:
s5, collecting tongue images and patient sounds, respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms, and inputting the processed batch tongue image data and sound data spectrograms into a multi-mode fusion feature coding model of the internal organ attributes embedded with priori knowledge to obtain internal organ attribute tags corresponding to the tongue images and the sounds.
3. The method for encoding the visceral organ attributes with the multi-modal feature fused as set forth in claim 1, wherein the specific contents of step S2 include:
performing tongue coating target detection and target region cutting on tongue image data by adopting a target detection model, expanding an image by a bilinear interpolation mode, performing random cutting according to the original size to obtain an output image copy with the same size as the original size, horizontally overturning the output image copy, and respectively performing normalization processing on three basic color channels of red, green and blue of the image;
and (3) adopting a voice denoising model to perform audio denoising, using an audio and music signal processing tool to reject a mute frame, randomly intercepting a sound fragment, performing pre-emphasis, framing and windowing processing, and converting a time domain sound signal into a spectrogram through time-frequency transformation.
4. The method for encoding the visceral organ attributes with the multi-modal feature fused according to claim 1, wherein the deep neural network model in S3 includes a convolutional neural network combination-like MLP model and a cyclic neural network combination-like MLP model;
taking the batch tongue images after data processing as input image data, and extracting the individual features of the image data by using a convolutional neural network combination type MLP model;
and taking the batch spectrograms as input sound data, and extracting the individual characteristics of the sound data by using a cyclic neural network combined class MLP model.
5. The method for encoding the visceral organ attributes with the multi-modal feature fused as claimed in claim 4, wherein the convolutional neural network combined MLP-like model comprises a plurality of convolutional layers, a normalization layer, a downsampling layer and a full connection layer;
the cyclic neural network combined MLP-like model comprises a plurality of cyclic units, a normalization layer, a downsampling layer and a full-connection layer.
6. The method for encoding the visceral organ attributes with the multi-modal feature fused as set forth in claim 1, wherein the specific contents of step S4 are:
s41, calculating Euclidean distances of individual features of tongue image modes and individual features of sound modes in Euclidean space; calculating hyperbolic distances of tongue image characteristic mode individual features and voice mode individual features in the hyperbolic space; the cosine similarity of the Euclidean distance and the hyperbolic distance is calculated, namely the consistency of the representation space is shown;
s42, fusing individual characteristics of tongue image modes and individual characteristics of sound modes by using a cross-mode bridging fusion strategy, and outputting viscera attribute characteristics by adopting a Sigmoid activation function;
s43, mapping the output viscera attribute characteristics to European space and hyperbolic space respectively, and calculating cross entropy loss in each representing space, namely representing space complementarity;
s44, obtaining a loss function of the internal organ attribute by combining consistency and complementarity of the representation space, and obtaining the multi-mode fusion feature coding model by updating the loss function through model training parameters after repeated iterative training.
7. The method for encoding an internal organ attribute with multi-modal feature fusion according to claim 6, wherein the specific contents of S41 indicating spatial consistency are:
wherein,,mapping of individual features of tongue image modality in European space,>mapping of personality characteristics in Euclidean space for sound modality, < >>Mapping of individual characteristics of tongue image mode in hyperbolic space pair>Mapping the personality characteristics of the sound modes in hyperbolic space;
d e for the euclidean space distance metric:
d h the hyperbolic spatial distance metric is:
wherein c is a space curvature constant, c <0;
d 1 taking cosine distance loss:
8. the method for encoding the attributes of the internal organs fused with the multi-modal characteristics according to claim 7, wherein the specific contents representing the complementarity of the space are:
the euclidean spatial distance similarity measure is:
the hyperbolic space structure similarity measure is:
wherein,,is the mapping of the organ attribute prediction in Euclidean space, Y e Mapping the property label of the viscera in Euclidean space; />Is the mapping of the organ attribute prediction in hyperbolic space, Y h Is the mapping of the viscera attribute label in the hyperbolic space;
d 2 taking the cross entropy loss d ce ,θ 3 Taking hyperbolic space distance
Where c is the space curvature constant, c <0.
9. The method for encoding an internal organ attribute with multi-modal feature fusion as claimed in claim 8, wherein the specific contents of S44 are:
the consistency constraint of the modal personality characteristics in the Euclidean space and the hyperbolic space is as follows:
the complementarity constraint of the modal personality characteristics in the Euclidean space and hyperbolic space is:
the loss function of the multi-modal fusion feature coding model is:
wherein W is the weight of each sub-item, in particular W ce Is thatWeight, W of (2) consis Is->Weight, W of (2) compl Is thatIs a weight of (2).
10. An viscera attribute coding system fusing multi-modal features, based on the viscera attribute coding method fusing multi-modal features as defined in any one of claims 1-9, comprising a data acquisition module, a data processing module, a multi-modal fusion feature coding model and a model construction training module;
the model building training module comprises a labeling unit, a feature extraction unit, a modal feature fusion unit and a supervised learning unit;
the data acquisition module is used for acquiring tongue images and patient sounds;
the data processing module is used for respectively carrying out data processing on tongue image data and patient sound data to obtain processed batch tongue image data and sound data spectrograms;
the multi-mode fusion feature coding model is used for obtaining visceral organ attribute labels corresponding to tongue images and voice according to the processed batch tongue image data and voice data spectrograms;
the model construction training module is used for constructing and training to obtain a multi-mode fusion feature coding model;
the labeling unit is used for labeling and acquiring visceral organ attribute labels corresponding to tongue images and sounds, wherein the labels comprise visceral organ category labels and organ attribute labels corresponding to each visceral organ;
the feature extraction unit is used for taking the processed batch tongue images as input image data, taking the converted batch spectrograms as input sound data, and extracting the individual features of the tongue image mode and the individual features of the sound mode respectively by using the deep neural network model;
the modal feature fusion unit is used for fusing the individual features of the tongue image data and the individual features of the sound data by taking consistency and complementarity of the representation space as constraints;
and the supervised learning unit performs supervised learning by using the internal organ labels and the organ attribute labels so as to embed priori guiding knowledge of the internal organ attributes and obtain a multimodal fusion feature coding model of the internal organ attributes embedded with the priori knowledge.
CN202310404163.5A 2023-04-17 2023-04-17 Viscera attribute coding method and system integrating multi-modal characteristics Pending CN116467675A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310404163.5A CN116467675A (en) 2023-04-17 2023-04-17 Viscera attribute coding method and system integrating multi-modal characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310404163.5A CN116467675A (en) 2023-04-17 2023-04-17 Viscera attribute coding method and system integrating multi-modal characteristics

Publications (1)

Publication Number Publication Date
CN116467675A true CN116467675A (en) 2023-07-21

Family

ID=87173014

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310404163.5A Pending CN116467675A (en) 2023-04-17 2023-04-17 Viscera attribute coding method and system integrating multi-modal characteristics

Country Status (1)

Country Link
CN (1) CN116467675A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117958765A (en) * 2024-04-01 2024-05-03 华南理工大学 Multi-mode voice viscera organ recognition method based on hyperbolic space alignment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117958765A (en) * 2024-04-01 2024-05-03 华南理工大学 Multi-mode voice viscera organ recognition method based on hyperbolic space alignment

Similar Documents

Publication Publication Date Title
CN112489061B (en) Deep learning intestinal polyp segmentation method based on multi-scale information and parallel attention mechanism
Jha et al. Doubleu-net: A deep convolutional neural network for medical image segmentation
WO2021179205A1 (en) Medical image segmentation method, medical image segmentation apparatus and terminal device
CN112686856B (en) Real-time enteroscopy polyp detection device based on deep learning
CN111932529B (en) Image classification and segmentation method, device and system
CN113674253A (en) Rectal cancer CT image automatic segmentation method based on U-transducer
Muhammad et al. Visual saliency models for summarization of diagnostic hysteroscopy videos in healthcare systems
KR102067340B1 (en) Method for generating breast masses according to the lesion characteristics and the system thereof
Cheng et al. DDU-Net: A dual dense U-structure network for medical image segmentation
CN114841320A (en) Organ automatic segmentation method based on laryngoscope medical image
CN112949707B (en) Cross-modal face image generation method based on multi-scale semantic information supervision
CN117253611B (en) Intelligent early cancer screening method and system based on multi-modal knowledge distillation
CN116467675A (en) Viscera attribute coding method and system integrating multi-modal characteristics
CN114882978A (en) Stomach image processing method and system introducing picture translation information
Wu et al. Cross-modal perceptionist: Can face geometry be gleaned from voices?
Surabhi et al. Advancing Faux Image Detection: A Hybrid Approach Combining Deep Learning and Data Mining Techniques
Ruan et al. An efficient tongue segmentation model based on u-net framework
CN114283301A (en) Self-adaptive medical image classification method and system based on Transformer
Padha et al. QCLR: Quantum-LSTM contrastive learning framework for continuous mental health monitoring
CN113850796A (en) Lung disease identification method and device based on CT data, medium and electronic equipment
CN116630660A (en) Cross-modal image matching method for multi-scale reinforcement learning
CN114298979B (en) Method for generating hepatonuclear magnetic image sequence guided by description of focal lesion symptom
CN116468887A (en) Method for segmenting colon polyp with universality
CN116206105A (en) Collaborative learning enhanced colon polyp segmentation method integrating deep learning
CN113488069B (en) Speech high-dimensional characteristic rapid extraction method and device based on generation type countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination