CN117116289B

CN117116289B - Medical intercom management system for ward and method thereof

Info

Publication number: CN117116289B
Application number: CN202311376286.9A
Authority: CN
Inventors: 李莹莹; 王丽; 谷玥; 赵秋月; 李俊琪
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2023-12-26
Anticipated expiration: 2043-10-24
Also published as: CN117116289A

Abstract

The application relates to the field of intelligent voice management, and particularly discloses a medical intercom management system for a disease area and a method thereof, wherein an artificial intelligence technology based on a deep neural network model is used for intelligently carrying out feature coding and extraction on voice signals among medical staff, so that more accurate decoding of the voice signals is realized. Therefore, a medical intercom management scheme for the ward is constructed to acquire voice signals among medical staff, so that communication quality and user experience under the ward environment are improved, and the medical staff can communicate and cooperate better.

Description

Medical intercom management system for ward and method thereof

Technical Field

The application relates to the field of intelligent voice management, and more particularly relates to a medical intercom management system for a disease area and a method thereof.

Background

The medical intercom management system for the ward is a system for communication and cooperation among medical staff in the ward of a medical institution. The system aims to improve the communication efficiency between medical staff in a disease area, strengthen work coordination and improve the nursing quality of patients. The medical staff can make real-time voice call or send text message through the system. They may call a particular individual or group directly or broadcast a message to the entire ward if desired. However, in the ward environment of a medical institution, various environmental sounds may exist, such as breathing sounds of a patient, noisy sounds of a device, talking sounds of a person, and the like. These environmental sounds may cause some interference and trouble to the communications between the healthcare workers.

Thus, an optimized medical intercom management scheme for the ward is desired.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides a medical intercom management system for a disease area and a method thereof, which use an artificial intelligence technology based on a deep neural network model to intelligently perform feature coding and extraction on voice signals among medical staff so as to decode the voice signals more accurately. Therefore, a medical intercom management scheme for the ward is constructed to acquire voice signals among medical staff, so that communication quality and user experience under the ward environment are improved, and the medical staff can communicate and cooperate better.

According to one aspect of the present application, there is provided a medical intercom management system for a ward, comprising:

the signal acquisition module is used for acquiring a first sound signal of a first audio receiver and a second sound signal of a second audio receiver of the interphone;

the first time-frequency diagram conversion module is used for calculating a first time domain enhancement diagram, a first SIFT conversion time-frequency diagram and a first S conversion time-frequency diagram of the first sound signal;

the first channel aggregation module is used for aggregating the first time domain enhancement map, the first SIFT conversion time-frequency map and the first S conversion time-frequency map along the channel dimension to obtain a first multichannel time-frequency map;

The first time-frequency characteristic extraction module is used for enabling the first multichannel time-frequency graph to pass through a first convolutional neural network model serving as a characteristic extractor to obtain a first sound characteristic matrix;

the second time-frequency diagram conversion module is used for calculating a second time-domain enhancement diagram, a second SIFT conversion time-frequency diagram and a second S conversion time-frequency diagram of the second sound signal;

the second channel aggregation module is used for aggregating the second time domain enhancement map, the second SIFT conversion time-frequency map and the second S conversion time-frequency map along the channel dimension to obtain a second multi-channel time-frequency map;

the second time-frequency characteristic extraction module is used for enabling the second multichannel time-frequency graph to pass through a second convolutional neural network model serving as a characteristic extractor to obtain a second characteristic matrix;

the fusion module is used for fusing the first sound feature matrix and the second sound feature matrix to obtain a decoding feature matrix;

the optimizing module is used for carrying out density domain probability on the decoding feature matrix based on local feature distribution so as to obtain an optimized decoding feature matrix;

and the result generation module is used for enabling the optimized decoding characteristic matrix to pass through a generator to obtain a decoded voice signal.

In the above-mentioned medical intercom management system for a ward, the first time-frequency feature extraction module is configured to:

each layer of the first convolutional neural network model as the feature extractor is used for respectively carrying out the following operations on input data in forward transfer of the layer:

carrying out convolution processing on input data to obtain a convolution characteristic diagram;

pooling processing is carried out on the convolution feature map along the channel dimension to obtain a pooled feature map;

non-linear activation is carried out on the pooled feature map so as to obtain an activated feature map;

the output of the last layer of the first convolutional neural network model serving as the feature extractor is the first sound feature matrix, and the input of the first layer of the first convolutional neural network serving as the feature extractor is the first multichannel time-frequency diagram.

In the above-mentioned medical intercom management system for a ward, the second time-frequency feature extraction module is configured to:

each layer of the second convolutional neural network model as the feature extractor is used for respectively carrying out the following operations on input data in forward transfer of the layer:

wherein the output of the last layer of the second convolutional neural network model as the feature extractor is the second acoustic feature matrix, and the input of the first layer of the second convolutional neural network as the feature extractor is the second multichannel time-frequency diagram.

In the medical intercom management system for the ward, the fusion module is used for:

fusing the first sound feature matrix and the second sound feature matrix to obtain a decoded feature matrix with the following fusion formula,

wherein, the fusion formula is:

；

wherein,for the decoding feature matrix,/a>For the first sound feature matrix, +.>For the second sound feature matrix, ">"means that elements at corresponding positions of the first sound feature matrix and the second sound feature matrix are added,is a weighting parameter for controlling the balance between the first sound feature matrix and the second sound feature matrix.

In the above-mentioned medical care intercom management system for a ward, the optimization module includes:

the block segmentation unit is used for carrying out block segmentation on the decoding feature matrix to obtain a plurality of decoding sub-block feature matrices;

The averaging and pooling unit is used for respectively carrying out averaging and pooling on the plurality of decoding sub-block feature matrixes so as to obtain a plurality of decoding sub-block global semantic feature vectors;

the per-position average unit is used for calculating global per-position average vectors of the global semantic feature vectors of the plurality of decoding sub-blocks to obtain decoded global semantic pivot feature vectors;

the relative density unit is used for calculating the cross entropy between each decoding sub-block global semantic feature vector and the decoding global semantic pivot feature vector in the plurality of decoding sub-block global semantic feature vectors so as to obtain a local feature distribution relative density semantic feature vector consisting of a plurality of cross entropy values;

the activating unit is used for inputting the local feature distribution relative density semantic feature vector into a Softmax activating function to obtain a local feature distribution relative density probability feature vector;

the weighting unit is used for weighting the characteristic matrixes of the decoding sub-blocks by using the characteristic values of all positions in the local characteristic distribution relative density probability characteristic vector so as to obtain a plurality of weighted characteristic matrixes of the decoding sub-blocks;

and the splicing unit is used for splicing the plurality of weighted decoding sub-block feature matrixes to obtain the optimized decoding feature matrix.

According to another aspect of the present application, there is also provided a medical intercom management method for a ward, including:

acquiring a first sound signal of a first audio receiver and a second sound signal of a second audio receiver of the interphone;

calculating a first time domain enhancement map, a first SIFT conversion time-frequency map and a first S conversion time-frequency map of the first sound signal;

the first time domain enhancement map, the first SIFT conversion time-frequency map and the first S conversion time-frequency map are aggregated along a channel dimension to obtain a first multichannel time-frequency map;

the first multichannel time-frequency diagram is passed through a first convolution neural network model serving as a feature extractor to obtain a first sound feature matrix;

calculating a second time domain enhancement map, a second SIFT conversion time-frequency map and a second S conversion time-frequency map of the second sound signal;

aggregating the second time domain enhancement map, the second SIFT conversion time-frequency map and the second S conversion time-frequency map along a channel dimension to obtain a second multichannel time-frequency map;

the second multichannel time-frequency diagram is passed through a second convolution neural network model serving as a feature extractor to obtain a second sound feature matrix;

fusing the first sound feature matrix and the second sound feature matrix to obtain a decoding feature matrix;

Carrying out local feature distribution-based density domain probability on the decoding feature matrix to obtain an optimized decoding feature matrix;

and passing the optimized decoding characteristic matrix through a generator to obtain a decoded voice signal.

Compared with the prior art, the medical intercom management system for the ward and the method thereof provided by the application use the artificial intelligence technology based on the deep neural network model to intelligently perform feature coding and extraction on voice signals among medical staff, so that more accurate decoding of the voice signals is realized. Therefore, a medical intercom management scheme for the ward is constructed to acquire voice signals among medical staff, so that communication quality and user experience under the ward environment are improved, and the medical staff can communicate and cooperate better.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 illustrates a block diagram of a ward healthcare intercom management system, according to an embodiment of the present application.

Fig. 2 illustrates a system architecture diagram of a ward healthcare intercom management system according to an embodiment of the present application.

Fig. 3 illustrates a flow chart of a ward healthcare intercom management method according to an embodiment of the present application.

Fig. 4 illustrates a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Summary of the application

As described above in the background art, in the disease area environment of a medical institution, various environmental sounds may exist, such as breathing sounds of a patient, noisy sounds of a device, talking sounds of a person, and the like. These environmental sounds may cause some interference and trouble to the communications between the healthcare workers. Thus, an optimized medical intercom management scheme for the ward is desired.

Aiming at the technical problems, an optimized medical intercom management scheme for the ward is provided, and an artificial intelligence technology based on a deep neural network model is used for intelligently carrying out feature coding and extraction on voice signals among medical staff, so that more accurate voice signals are decoded. Therefore, a medical intercom management scheme for the ward is constructed to acquire voice signals among medical staff, so that communication quality and user experience under the ward environment are improved, and the medical staff can communicate and cooperate better.

At present, deep learning and neural networks have been widely used in the fields of computer vision, natural language processing, speech signal processing, and the like. In addition, deep learning and neural networks have also shown levels approaching and even exceeding humans in the fields of image classification, object detection, semantic segmentation, text translation, and the like.

In recent years, the development of deep learning and neural networks provides new solutions and schemes for medical intercom management in the sick area.

Specifically, first, a first sound signal of a first audio receiver and a second sound signal of a second audio receiver of an intercom are acquired. In practical use, the voice is often interfered by surrounding noise, so that the conversation quality and the hearing feeling of a listener are seriously affected. Some existing single-channel voice enhancement algorithms and voice noise reduction algorithms can eliminate noise to a certain extent, but the two problems of noise reduction and voice quality are difficult to consider at the same time, but the noise reduction is realized while the voice quality is damaged, and the more the noise is eliminated, the more serious the voice quality is damaged. In the technical scheme, the interphone adopts the sensor array, and meanwhile, the cost, the volume and the power consumption of the interphone are considered to be increased along with the increase of the number of the sensors, so that the two-sensor system is a compromise choice. The two audio receivers can enhance the perception capability of the medical intercom system for the ward to the sound signals, and improve the quality and accuracy of the sound signals, so that the communication and cooperation effects among medical staff are improved.

Then, a time domain enhancement map, a SIFT transform time-frequency map, and an S transform time-frequency map of the first sound signal and the second sound signal are calculated, respectively. The time domain enhancement map is an enhanced representation of the sound signal in the time domain. By calculating the time domain enhancement map of the first sound signal and the second sound signal, the temporal characteristics of the sound signals, such as the fluctuations, intensity and variation of the sound, can be highlighted. SIFT (Scale-Invariant Feature Transform) transformation is a method of extracting image features at different scales. In the context of sound signals, SIFT transforms may convert the sound signals into a time-frequency diagram, where the horizontal axis represents time, the vertical axis represents frequency, and the pixel values represent energy or intensity of sound. S-transformation is a method of time-frequency analysis of a signal, which can be expressed as a function of time and frequency. The S-transformed time-frequency plot may help to analyze spectral characteristics and temporal evolution of the sound signal. By calculating the time domain enhancement map, the SIFT transformed time-frequency map and the S-transformed time-frequency map, the characteristics of the sound signal can be described from different angles and representations.

And then, in order to comprehensively utilize the information represented by the different features, the time domain enhancement map, the SIFT conversion time-frequency map and the S conversion time-frequency map are aggregated along the channel dimension to obtain a first multi-channel time-frequency map and a second multi-channel time-frequency map. Considering that different feature maps reflect different aspects of the sound signal, for example, a time domain enhancement map reflects the time sequence features of the sound, and a SIFT transform time-frequency map and an S transform time-frequency map reflect the frequency features and time-varying features of the sound. By aggregating them together, their information can be leveraged to provide a more comprehensive and rich description of time-frequency characteristics. At the same time, it is contemplated that if a feature is affected by noise or interference in some cases, other features may still perform well. By aggregating them together, the interference of individual features can be mitigated, improving the stability and reliability of the overall feature.

The first multichannel time-frequency diagram is then passed through a first convolutional neural network model as a feature extractor to obtain a first acoustic feature matrix. Convolutional neural networks (Convolutional Neural Network, CNN) are a type of deep learning model that is widely used in the field of image and audio processing. Through multi-layer convolution and pooling operations, features in input data can be automatically learned and extracted. The multichannel time-frequency diagram is input into a convolutional neural network model, so that the network can automatically learn the space structure and time sequence mode of the time-frequency characteristic. The convolution layer can capture features of different scales, and the pooling layer can perform dimension reduction and abstraction on the features to further extract higher-level feature representations. Through multi-layer convolution and pooling operations, the network may gradually extract more abstract, semantically sound features.

The second multichannel time-frequency diagram is then passed through a second convolutional neural network model as a feature extractor to obtain a second acoustic feature matrix. The first multichannel time-frequency diagram and the second multichannel time-frequency diagram are respectively input into different convolutional neural network models, so that each model can be focused on learning and extracting characteristic representation of respective sound signals.

Then, the first sound feature matrix and the second sound feature matrix are fused to obtain a decoding feature matrix. The first sound feature matrix and the second sound feature matrix capture feature information of the two sound signals respectively. By fusing the two feature matrices, the information of the feature matrices can be comprehensively utilized, and richer and diversified feature representations can be provided. Different sound signals may be affected by different noise, interference or variations. By fusing different acoustic feature matrices together, the stability and robustness of the features can be enhanced. If a feature is disturbed in some cases, other features may still perform better, thereby improving the reliability of the overall feature.

Further, the decoded feature matrix is passed through a generator to obtain a decoded speech signal. In the technical scheme, the generator is a neural network model, receives the decoding feature matrix as input, and converts the feature matrix into corresponding decoding voice signals through learning and simulating the generation process of the voice signals. The generator converts the decoded feature matrix into a decoded speech signal, enabling a reverse mapping from the abstract feature space to the specific sound waveform. In this way the details and characteristics of the original sound signal can be restored to make it audible. During feature extraction and encoding, there may be some loss or compression of information. By means of the generator, it is possible to attempt to recover this lost information, making the decoded speech signal closer to the original sound signal, improving the accuracy and integrity of the recovery.

In particular, it is considered that the decoding feature matrix is extracted by different feature extractors. The feature extractor may use different convolutional neural network models or other feature extraction methods that vary in the manner in which the input data is processed and the feature representation capabilities. Thus, different feature extractors may extract features having different distributions, resulting in the feature distribution of the decoded feature matrix exhibiting heterogeneity. Meanwhile, in the process of generating the decoding feature matrix, fusion of the multi-channel features is involved. The characteristics of different channels may have different distribution patterns and statistical properties. When feature fusion is performed, there may be non-uniformity in the features of different channels, i.e., some channels have features that occupy a larger proportion of the overall feature matrix, while others have features that occupy a smaller proportion. Such non-uniformity may lead to an imbalance in the feature distribution of the decoding feature matrix.

Due to the spatial heterogeneity and non-uniformity of the decoded feature matrix over the feature distribution, the decoded speech signal input to the generator may have a class probability domain offset. This means that the generator has a certain deviation in the probability distribution of the decoded speech signals of different classes, which may lead to a degradation of the speech signal generated by the generator in certain classes. In order to solve the problem, the decoding feature matrix is subjected to density domain probability based on local feature distribution to obtain an optimized decoding feature matrix, so that the structural rationality and the robustness of the feature expression of the decoding feature matrix are improved.

Specifically, performing the probability of the density domain based on the local feature distribution on the decoding feature matrix to obtain an optimized decoding feature matrix, including: performing block segmentation on the decoding feature matrix to obtain a plurality of decoding sub-block feature matrices; respectively carrying out mean value pooling on the plurality of decoding sub-block feature matrixes to obtain a plurality of decoding sub-block global semantic feature vectors; calculating global per-position mean vectors of the plurality of decoding sub-block global semantic feature vectors to obtain decoded global semantic pivot feature vectors; calculating cross entropy between each decoding sub-block global semantic feature vector and the decoding global semantic pivot feature vector in the plurality of decoding sub-block global semantic feature vectors to obtain a local feature distribution relative density semantic feature vector composed of a plurality of cross entropy values; inputting the local feature distribution relative density semantic feature vector into a Softmax activation function to obtain a local feature distribution relative density probabilistic feature vector; weighting each decoding sub-block feature matrix by using the feature value of each position in the local feature distribution relative density probability feature vector to obtain a plurality of weighted decoding sub-block feature matrices; and splicing the weighted decoding sub-block feature matrixes to obtain the optimized decoding feature matrix.

The method comprises the steps of performing space domain block segmentation on a decoding feature matrix to obtain a plurality of decoding sub-block feature matrices, respectively performing mean pooling on the decoding sub-block feature matrices to obtain a plurality of decoding sub-block global semantic feature vectors, taking a global per-position mean value vector of the decoding sub-block global semantic feature vectors as a class center of feature distribution of the decoding sub-block feature matrices, and further calculating cross entropy between each decoding sub-block global semantic feature vector and the decoding global semantic pivot feature vector in the decoding sub-block global semantic feature vectors so as to measure feature distribution space consistency and offset of each decoding sub-block feature matrix relative to the global class center. And further, probability is carried out on the local feature distribution relative density semantic feature vector formed by the plurality of cross entropy values through a Softmax activation function, and feature values of all positions in the local feature distribution relative density probability feature vector are used for weighting all decoding sub-block feature matrices so as to carry out feature distribution correction based on space distribution consistency on all local feature matrices of the decoding feature matrix, so that the structural rationality and the robustness of feature expression of the decoding feature matrix are improved.

Having described the basic principles of the present application, various non-limiting embodiments of the present application will now be described in detail with reference to the accompanying drawings.

Exemplary System

Fig. 1 illustrates a block diagram of a ward healthcare intercom management system, according to an embodiment of the present application. As shown in fig. 1, a medical intercom management system 100 for a ward according to an embodiment of the present application includes: the signal acquisition module 110 is configured to acquire a first sound signal of a first audio receiver and a second sound signal of a second audio receiver of the intercom; the first time-frequency diagram conversion module 120 is configured to calculate a first time-domain enhancement diagram, a first SIFT transform time-frequency diagram, and a first S transform time-frequency diagram of the first sound signal; a first channel aggregation module 130, configured to aggregate the first time domain enhancement map, the first SIFT transform time-frequency map, and the first S transform time-frequency map along a channel dimension to obtain a first multi-channel time-frequency map; a first time-frequency feature extraction module 140, configured to pass the first multichannel time-frequency graph through a first convolutional neural network model serving as a feature extractor to obtain a first acoustic feature matrix; the second time-frequency diagram conversion module 150 is configured to calculate a second time-domain enhancement diagram, a second SIFT transform time-frequency diagram, and a second S transform time-frequency diagram of the second sound signal; a second channel aggregation module 160, configured to aggregate the second time domain enhancement map, the second SIFT transform time-frequency map, and the second S transform time-frequency map along a channel dimension to obtain a second multi-channel time-frequency map; a second time-frequency feature extraction module 170, configured to pass the second multichannel time-frequency graph through a second convolutional neural network model serving as a feature extractor to obtain a second feature matrix; a fusion module 180, configured to fuse the first sound feature matrix and the second sound feature matrix to obtain a decoded feature matrix; an optimization module 190, configured to perform local feature distribution-based density domain probability on the decoding feature matrix to obtain an optimized decoding feature matrix; and a result generation module 200, configured to pass the optimized decoding feature matrix through a generator to obtain a decoded speech signal.

Fig. 2 illustrates a system architecture diagram of a ward healthcare intercom management system according to an embodiment of the present application. As shown in fig. 2, in the system architecture, first, a first sound signal of a first audio receiver and a second sound signal of a second audio receiver of an intercom are acquired. Then, a first time domain enhancement map, a first SIFT transform time-frequency map, and a first S transform time-frequency map of the first sound signal are calculated. And then, the first time domain enhancement map, the first SIFT conversion time-frequency map and the first S conversion time-frequency map are aggregated along a channel dimension to obtain a first multi-channel time-frequency map. And then, the first multichannel time-frequency diagram is passed through a first convolutional neural network model serving as a feature extractor to obtain a first sound feature matrix. Then, a second time domain enhancement map, a second SIFT transform time-frequency map, and a second S transform time-frequency map of the second sound signal are calculated. And then, aggregating the second time domain enhancement map, the second SIFT transformation time-frequency map and the second S transformation time-frequency map along a channel dimension to obtain a second multi-channel time-frequency map. The second multichannel time-frequency diagram is then passed through a second convolutional neural network model as a feature extractor to obtain a second acoustic feature matrix. Then, the first sound feature matrix and the second sound feature matrix are fused to obtain a decoding feature matrix. Then, the decoding feature matrix is subjected to density domain probability based on local feature distribution to obtain an optimized decoding feature matrix. Further, the optimized decoding feature matrix is passed through a generator to obtain a decoded speech signal.

In the above-mentioned medical intercom management system 100 for a ward, the signal acquisition module 110 is configured to acquire a first sound signal of a first audio receiver and a second sound signal of a second audio receiver of an intercom. As described above in the background art, in the disease area environment of a medical institution, various environmental sounds may exist, such as breathing sounds of a patient, noisy sounds of a device, talking sounds of a person, and the like. These environmental sounds may cause some interference and trouble to the communications between the healthcare workers. Thus, an optimized medical intercom management scheme for the ward is desired.

In the above-mentioned medical intercom management system 100 for a ward, the first time-frequency diagram conversion module 120 is configured to calculate a first time-domain enhancement diagram, a first SIFT transformation time-frequency diagram and a first S transformation time-frequency diagram of the first sound signal. The time domain enhancement map is an enhanced representation of the sound signal in the time domain. By calculating the time domain enhancement map of the first sound signal and the second sound signal, the temporal characteristics of the sound signals, such as the fluctuations, intensity and variation of the sound, can be highlighted. SIFT (Scale-Invariant Feature Transform) transformation is a method of extracting image features at different scales. In the context of sound signals, SIFT transforms may convert the sound signals into a time-frequency diagram, where the horizontal axis represents time, the vertical axis represents frequency, and the pixel values represent energy or intensity of sound. S-transformation is a method of time-frequency analysis of a signal, which can be expressed as a function of time and frequency. The S-transformed time-frequency plot may help to analyze spectral characteristics and temporal evolution of the sound signal. By calculating the time domain enhancement map, the SIFT transformed time-frequency map and the S-transformed time-frequency map, the characteristics of the sound signal can be described from different angles and representations.

In the above-mentioned medical intercom management system 100 for a ward, the first channel aggregation module 130 is configured to aggregate the first time domain enhancement map, the first SIFT transform time-frequency map, and the first S transform time-frequency map along a channel dimension to obtain a first multichannel time-frequency map. Considering that different feature maps reflect different aspects of the sound signal, for example, a time domain enhancement map reflects the time sequence features of the sound, and a SIFT transform time-frequency map and an S transform time-frequency map reflect the frequency features and time-varying features of the sound. By aggregating them together, their information can be leveraged to provide a more comprehensive and rich description of time-frequency characteristics. At the same time, it is contemplated that if a feature is affected by noise or interference in some cases, other features may still perform well. By aggregating them together, the interference of individual features can be mitigated, improving the stability and reliability of the overall feature.

In the above-mentioned medical intercom management system for a ward, the first time-frequency feature extraction module 140 is configured to obtain a first sound feature matrix from the first multichannel time-frequency graph through a first convolutional neural network model serving as a feature extractor. Convolutional neural networks (Convolutional Neural Network, CNN) are a type of deep learning model that is widely used in the field of image and audio processing. Through multi-layer convolution and pooling operations, features in input data can be automatically learned and extracted. The multichannel time-frequency diagram is input into a convolutional neural network model, so that the network can automatically learn the space structure and time sequence mode of the time-frequency characteristic. The convolution layer can capture features of different scales, and the pooling layer can perform dimension reduction and abstraction on the features to further extract higher-level feature representations. Through multi-layer convolution and pooling operations, the network may gradually extract more abstract, semantically sound features.

Specifically, in the embodiment of the present application, the first time-frequency feature extraction module 140 is configured to: each layer of the first convolutional neural network model as the feature extractor is used for respectively carrying out the following operations on input data in forward transfer of the layer: carrying out convolution processing on input data to obtain a convolution characteristic diagram; pooling processing is carried out on the convolution feature map along the channel dimension to obtain a pooled feature map; non-linear activation is carried out on the pooled feature map so as to obtain an activated feature map; the output of the last layer of the first convolutional neural network model serving as the feature extractor is the first sound feature matrix, and the input of the first layer of the first convolutional neural network serving as the feature extractor is the first multichannel time-frequency diagram.

In the above-mentioned medical intercom management system 100 for a ward, the second time-frequency diagram conversion module 150 is configured to calculate a second time-domain enhancement diagram, a second SIFT transformation time-frequency diagram, and a second S transformation time-frequency diagram of the second sound signal. And extracting the characteristics of the sound signals with different angles and representation modes of the second sound signals, so that the subsequent analysis is convenient.

In the above-mentioned medical intercom management system 100 for a ward, the second channel aggregation module 160 is configured to aggregate the second time domain enhancement map, the second SIFT transform time-frequency map, and the second S transform time-frequency map along a channel dimension to obtain a second multichannel time-frequency map. In order to obtain more comprehensive information, the processing is performed in a manner of processing the first channel aggregation module.

In the above-mentioned medical intercom management system 100 for a disease area, the second time-frequency feature extraction module 170 is configured to obtain a second time-frequency feature matrix from the second multichannel time-frequency graph through a second convolutional neural network model serving as a feature extractor. The first multichannel time-frequency diagram and the second multichannel time-frequency diagram are respectively input into different convolutional neural network models, so that each model can be focused on learning and extracting characteristic representation of respective sound signals.

Specifically, in the embodiment of the present application, the second time-frequency feature extraction module 170 is configured to: each layer of the second convolutional neural network model as the feature extractor is used for respectively carrying out the following operations on input data in forward transfer of the layer: carrying out convolution processing on input data to obtain a convolution characteristic diagram; pooling processing is carried out on the convolution feature map along the channel dimension to obtain a pooled feature map; non-linear activation is carried out on the pooled feature map so as to obtain an activated feature map; wherein the output of the last layer of the second convolutional neural network model as the feature extractor is the second acoustic feature matrix, and the input of the first layer of the second convolutional neural network as the feature extractor is the second multichannel time-frequency diagram.

In the above-mentioned medical intercom management system 100 for a ward, the fusion module 180 is configured to fuse the first sound feature matrix and the second sound feature matrix to obtain a decoded feature matrix. The first sound feature matrix and the second sound feature matrix capture feature information of the two sound signals respectively. By fusing the two feature matrices, the information of the feature matrices can be comprehensively utilized, and richer and diversified feature representations can be provided. Different sound signals may be affected by different noise, interference or variations. By fusing different acoustic feature matrices together, the stability and robustness of the features can be enhanced. If a feature is disturbed in some cases, other features may still perform better, thereby improving the reliability of the overall feature.

Specifically, in the embodiment of the present application, the fusion module 180 is configured to: fusing the first sound feature matrix and the second sound feature matrix to obtain a decoding feature matrix by a fusion formula, wherein the fusion formula is as follows:

；

wherein,for the decoding feature matrix,/a>For the first sound feature matrix, +.>For the second sound feature matrix, " >"means that elements at corresponding positions of the first sound feature matrix and the second sound feature matrix are added,is a weighting parameter for controlling the balance between the first sound feature matrix and the second sound feature matrix.

In the above-mentioned medical intercom management system 100 for a ward, the optimization module 190 is configured to perform a density domain probability based on a local feature distribution on the decoding feature matrix to obtain an optimized decoding feature matrix. It is contemplated that the decoding feature matrix is extracted by a different feature extractor. The feature extractor may use different convolutional neural network models or other feature extraction methods that vary in the manner in which the input data is processed and the feature representation capabilities. Thus, different feature extractors may extract features having different distributions, resulting in the feature distribution of the decoded feature matrix exhibiting heterogeneity. Meanwhile, in the process of generating the decoding feature matrix, fusion of the multi-channel features is involved. The characteristics of different channels may have different distribution patterns and statistical properties. When feature fusion is performed, there may be non-uniformity in the features of different channels, i.e., some channels have features that occupy a larger proportion of the overall feature matrix, while others have features that occupy a smaller proportion. Such non-uniformity may lead to an imbalance in the feature distribution of the decoding feature matrix.

Specifically, in the embodiment of the present application, the optimizing module 190 includes: the block segmentation unit is used for carrying out block segmentation on the decoding feature matrix to obtain a plurality of decoding sub-block feature matrices; the averaging and pooling unit is used for respectively carrying out averaging and pooling on the plurality of decoding sub-block feature matrixes so as to obtain a plurality of decoding sub-block global semantic feature vectors; the per-position average unit is used for calculating global per-position average vectors of the global semantic feature vectors of the plurality of decoding sub-blocks to obtain decoded global semantic pivot feature vectors; the relative density unit is used for calculating the cross entropy between each decoding sub-block global semantic feature vector and the decoding global semantic pivot feature vector in the plurality of decoding sub-block global semantic feature vectors so as to obtain a local feature distribution relative density semantic feature vector consisting of a plurality of cross entropy values; the activating unit is used for inputting the local feature distribution relative density semantic feature vector into a Softmax activating function to obtain a local feature distribution relative density probability feature vector; the weighting unit is used for weighting the characteristic matrixes of the decoding sub-blocks by using the characteristic values of all positions in the local characteristic distribution relative density probability characteristic vector so as to obtain a plurality of weighted characteristic matrixes of the decoding sub-blocks; and the splicing unit is used for splicing the plurality of weighted decoding sub-block feature matrixes to obtain the optimized decoding feature matrix.

In the above-mentioned medical intercom management system 100 for a ward, the result generating module 200 is configured to send the optimized decoding feature matrix through a generator to obtain a decoded voice signal. In the technical scheme, the generator is a neural network model, receives the decoding feature matrix as input, and converts the feature matrix into corresponding decoding voice signals through learning and simulating the generation process of the voice signals. The generator converts the decoded feature matrix into a decoded speech signal, enabling a reverse mapping from the abstract feature space to the specific sound waveform. In this way the details and characteristics of the original sound signal can be restored to make it audible. During feature extraction and encoding, there may be some loss or compression of information. By means of the generator, it is possible to attempt to recover this lost information, making the decoded speech signal closer to the original sound signal, improving the accuracy and integrity of the recovery.

In summary, a medical intercom management system for a ward according to an embodiment of the present application has been elucidated, which uses an artificial intelligence technology based on a deep neural network model to intelligently perform feature encoding and extraction on a voice signal between medical staff, so as to decode the voice signal more accurately. Therefore, a medical intercom management scheme for the ward is constructed to acquire voice signals among medical staff, so that communication quality and user experience under the ward environment are improved, and the medical staff can communicate and cooperate better.

Exemplary method

Fig. 3 illustrates a flow chart of a ward healthcare intercom management method according to an embodiment of the present application. As shown in fig. 3, the medical intercom management method for a ward according to an embodiment of the present application includes the steps of: s110, acquiring a first sound signal of a first audio receiver and a second sound signal of a second audio receiver of the interphone; s120, calculating a first time domain enhancement map, a first SIFT conversion time-frequency map and a first S conversion time-frequency map of the first sound signal; s130, aggregating the first time domain enhancement map, the first SIFT conversion time-frequency map and the first S conversion time-frequency map along a channel dimension to obtain a first multichannel time-frequency map; s140, the first multichannel time-frequency diagram is passed through a first convolutional neural network model serving as a feature extractor to obtain a first sound feature matrix; s150, calculating a second time domain enhancement map, a second SIFT transformation time-frequency map and a second S transformation time-frequency map of the second sound signal; s160, aggregating the second time domain enhancement map, the second SIFT transformation time-frequency map and the second S transformation time-frequency map along a channel dimension to obtain a second multichannel time-frequency map; s170, the second multichannel time-frequency diagram is passed through a second convolutional neural network model serving as a feature extractor to obtain a second acoustic feature matrix; s180, fusing the first sound feature matrix and the second sound feature matrix to obtain a decoding feature matrix; s190, carrying out local feature distribution-based density domain probability on the decoding feature matrix to obtain an optimized decoding feature matrix; and S200, enabling the optimized decoding characteristic matrix to pass through a generator to obtain a decoded voice signal.

Here, it will be understood by those skilled in the art that the specific operations of the respective steps in the above-described ward medical intercom management method have been described in detail in the above description of the ward medical intercom management system with reference to fig. 1 to 2, and thus, repetitive descriptions thereof will be omitted.

As described above, the medical-care intercom management system 100 for a ward according to the embodiment of the present application can be implemented in various terminal devices, such as a medical-care intercom management server for a ward, and the like. In one example, the ward healthcare intercom management system 100 according to embodiments of the present application may be integrated into the terminal device as a software module and/or hardware module. For example, the ward healthcare intercom management system 100 may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the ward healthcare intercom management system 100 may also be one of a plurality of hardware modules of the terminal device.

Alternatively, in another example, the ward healthcare intercom management system 100 and the terminal device may be separate devices, and the ward healthcare intercom management system 100 may be connected to the terminal device through a wired and/or wireless network and transmit the interactive information in a agreed data format.

In summary, the medical intercom management method for the ward according to the embodiment of the application has been elucidated, which uses an artificial intelligence technology based on a deep neural network model to intelligently perform feature coding and extraction on voice signals between medical staff, so as to decode the voice signals more accurately. Therefore, a medical intercom management scheme for the ward is constructed to acquire voice signals among medical staff, so that communication quality and user experience under the ward environment are improved, and the medical staff can communicate and cooperate better.

Exemplary electronic device

Next, an electronic device according to an embodiment of the present application is described with reference to fig. 4. Fig. 4 is a block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 4, the electronic device 10 includes one or more processors 11 and a memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 11 to perform the functions in the medical ward intercom management method of the various embodiments of the present application, as described hereinabove, and/or other desired functions. Various contents such as a first sound signal of a first audio receiver and a second sound signal of a second audio receiver of an intercom may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).

The input means 13 may comprise, for example, a keyboard, a mouse, etc.

The output device 14 may output various information to the outside, including decoding a voice signal, and the like. The output means 14 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device 10 that are relevant to the present application are shown in fig. 4 for simplicity, components such as buses, input/output interfaces, etc. are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform steps in the functions of the medical ward intercom management method according to the various embodiments of the present application described in the "exemplary methods" section of the present specification.

The computer program product may write program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in the functions of the medical care intercom management method according to the various embodiments of the present application described in the above-mentioned "exemplary method" section of the present specification.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A medical intercom management system for a ward, comprising:

the result generation module is used for enabling the optimized decoding characteristic matrix to pass through a generator to obtain a decoded voice signal;

wherein, the optimization module includes:

2. The medical intercom management system of claim 1 wherein said first time-frequency feature extraction module is configured to:

3. The medical intercom management system of claim 2 wherein said second time-frequency feature extraction module is configured to:

4. The ward healthcare intercom management system of claim 3, wherein said fusion module is configured to:

wherein, the fusion formula is:

；

wherein,for the decoding feature matrix,/a >For the first sound feature matrix, +.>For the second sound feature matrix, ">"means that elements at corresponding positions of the first sound feature matrix and the second sound feature matrix are added,is a weighting parameter for controlling the balance between the first sound feature matrix and the second sound feature matrix.

5. A medical intercom management method for a disease area is characterized by comprising the following steps:

the optimized decoding characteristic matrix is passed through a generator to obtain a decoded speech signal;

the method for performing the probability of the density domain based on the local feature distribution on the decoding feature matrix to obtain an optimized decoding feature matrix comprises the following steps:

performing block segmentation on the decoding feature matrix to obtain a plurality of decoding sub-block feature matrices;

respectively carrying out mean value pooling on the plurality of decoding sub-block feature matrixes to obtain a plurality of decoding sub-block global semantic feature vectors;

calculating global per-position mean vectors of the plurality of decoding sub-block global semantic feature vectors to obtain decoded global semantic pivot feature vectors;

calculating cross entropy between each decoding sub-block global semantic feature vector and the decoding global semantic pivot feature vector in the plurality of decoding sub-block global semantic feature vectors to obtain a local feature distribution relative density semantic feature vector composed of a plurality of cross entropy values;

Inputting the local feature distribution relative density semantic feature vector into a Softmax activation function to obtain a local feature distribution relative density probabilistic feature vector;

weighting each decoding sub-block feature matrix by using the feature value of each position in the local feature distribution relative density probability feature vector to obtain a plurality of weighted decoding sub-block feature matrices;

and splicing the weighted decoding sub-block feature matrixes to obtain the optimized decoding feature matrix.

6. The method of claim 5, wherein the first multichannel time-frequency graph is passed through a first convolutional neural network model as a feature extractor to obtain a first acoustic feature matrix for:

7. The method of claim 6, wherein the second multichannel time-frequency graph is passed through a second convolutional neural network model as a feature extractor to obtain a second feature matrix for:

8. The ward medical intercom management method of claim 7 wherein fusing the first sound feature matrix and the second sound feature matrix to obtain a decoded feature matrix comprises:

Wherein, the fusion formula is:

；

wherein,for the decoding feature matrix,/a>For the first sound feature matrix, +.>For the second sound feature matrix, ">"means that elements at corresponding positions of the first sound feature matrix and the second sound feature matrix are added,for controlling the first sound characteristic matrix sumAnd weighting parameters of balance among the second sound characteristic matrixes.