CN112826446A - Medical scanning voice enhancement method, device, system and storage medium - Google Patents

Medical scanning voice enhancement method, device, system and storage medium Download PDF

Info

Publication number
CN112826446A
CN112826446A CN202011622711.4A CN202011622711A CN112826446A CN 112826446 A CN112826446 A CN 112826446A CN 202011622711 A CN202011622711 A CN 202011622711A CN 112826446 A CN112826446 A CN 112826446A
Authority
CN
China
Prior art keywords
microphone
determining
sound
space coordinate
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011622711.4A
Other languages
Chinese (zh)
Inventor
史宇航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai United Imaging Healthcare Co Ltd
Original Assignee
Shanghai United Imaging Healthcare Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai United Imaging Healthcare Co Ltd filed Critical Shanghai United Imaging Healthcare Co Ltd
Priority to CN202011622711.4A priority Critical patent/CN112826446A/en
Publication of CN112826446A publication Critical patent/CN112826446A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes

Abstract

The invention relates to a medical scanning voice enhancement method, a device, a system and a storage medium, wherein the medical scanning voice enhancement method comprises the following steps: acquiring a first image, a space coordinate corresponding to a microphone and a corresponding sound signal; inputting the first image into a positioning model, and determining a prediction space coordinate, wherein the positioning model is a model obtained by training according to a first image training set; determining at least one reception distance according to the predicted space coordinate and the space coordinate corresponding to the microphone; and determining a synthesized voice signal according to the reception distance and the sound signal corresponding to the microphone. The invention utilizes the image to position the mouth of the examined person, and then combines the space coordinate information to determine the distance from the mouth of the examined person to the plurality of microphones and determine the corresponding radio reception distance, so as to effectively feed back the effective degree of the sound signals received by each microphone by the radio reception distance, realize high-efficiency and accurate sound synthesis, avoid the influence of background noise and fully improve the high efficiency and the convenience of medical scanning.

Description

Medical scanning voice enhancement method, device, system and storage medium
Technical Field
The present invention relates to the field of medical scanning, and in particular, to a method, an apparatus, a system, and a storage medium for enhancing medical scanning speech.
Background
In medical imaging examination, a patient and a technician need to communicate through a voice interphone to complete scanning activities, and efficient voice talkback can improve scanning speed and optimize scanning processes and scanning results. However, the communication between the technician and the patient is often subject to acoustic disturbances, such as the operating sounds of the instrument, background noise, intercom echoes, multi-person speech disturbances, and the like. Specifically, in magnetic resonance, the generated noise includes noise at the time of operation of a cold head (cooler), noise at the time of operation of a gradient (gradient coil), and the like, which interfere with normal communication between a patient and a technique, resulting in low scanning efficiency, and a scanning image obtained due to the interference of the communication between the patient and the technique is also not ideal. In addition, in order to enhance the sound receiving effect of the microphone on the conversation of the patient, a plurality of microphones are often arranged in the instrument to receive the sound. In the scanning process, sound is emitted from the mouth of a patient, however, a sickbed is often moved during scanning, the position of the sound of the patient to be collected is constantly changed, noise data are continuously increased, and the difficulty in accurately collecting voice information is increased.
In summary, how to efficiently collect the voice of the examinee in the medical image examination process is an urgent problem to be solved.
Disclosure of Invention
In view of the above, there is a need to provide a method, an apparatus, a system and a storage medium for enhancing medical scanning voice, so as to solve the problem of how to efficiently acquire the voice of the subject during the medical image examination in the prior art.
The invention provides a medical scanning voice enhancement method, which comprises the following steps:
acquiring a first image, a space coordinate corresponding to at least one microphone and a corresponding sound signal;
inputting the first image into a well-trained positioning model, and determining a predicted space coordinate of a first region of interest of a detected person, wherein the positioning model is obtained by training based on a first image training set;
determining at least one radio reception distance according to the predicted space coordinate and the space coordinate corresponding to the at least one microphone;
and determining a synthesized voice signal according to the at least one sound receiving distance and the sound signal corresponding to the at least one microphone.
Further, the determining a synthesized speech signal according to the at least one radio reception distance and the sound signal corresponding to the at least one microphone specifically includes:
determining at least one corresponding sound beam forming wavelength according to the at least one radio reception distance;
and according to the at least one sound beam forming wavelength, combining and enhancing sound signals corresponding to the at least one microphone, and determining the synthesized voice signals.
Further, the determining at least one sound reception distance according to the mouth space coordinate and the space coordinate corresponding to the at least one microphone specifically includes:
and determining the corresponding at least one radio reception distance according to the coordinate difference between the predicted space coordinate and the space coordinate corresponding to the at least one microphone.
Further, the determining a synthesized speech signal according to the at least one radio reception distance and the sound signal corresponding to the at least one microphone specifically includes:
determining at least one corresponding radio reception weight according to the at least one radio reception distance;
determining the synthesized speech signal according to the at least one radio reception weight.
Further, the training process of the positioning model comprises:
acquiring a first image training set containing labeling information, wherein the labeling information comprises actual space coordinates of a first region of interest of a detected person;
inputting the first image training set into a positioning model, and determining the corresponding prediction space coordinate;
finishing the training of the positioning model according to the error between the actual space coordinate and the predicted space coordinate, and storing the positioning model;
wherein the first training set of images includes a plurality of the first images, the first images being medical images including information of a first region of interest of a subject.
Further, the determining the corresponding at least one radio reception weight according to the at least one radio reception distance includes: and determining at least one corresponding sound reception weight according to the square of the at least one sound reception distance.
Further, the synthesized speech signal is determined by the following formula:
K=k1*w1+k2*w2+Λ+kn*wn
wn=sn 2
where K is the synthesized speech signal, KnIs the sound signal corresponding to the nth microphone, n is an integer, wnThe reception weight, s, corresponding to the nth microphonenThe square of the sound receiving distance corresponding to the nth microphone.
The invention also provides a medical scanning voice enhancement device, comprising:
the acquisition unit is used for acquiring the first image, the space coordinate corresponding to at least one microphone and the corresponding sound signal;
the processing unit is used for inputting the first image into a well-trained positioning model and determining the predicted space coordinates of a first region of interest of a detected person, wherein the positioning model is obtained by training based on a first image training set; the system is also used for determining at least one radio reception distance according to the predicted space coordinate and the space coordinate corresponding to the at least one microphone;
and the synthesis unit is used for determining a synthesized voice signal according to the at least one sound receiving distance and the sound signal corresponding to the at least one microphone.
The invention also provides a medical scanning voice enhancement system, which comprises an image scanning device, at least one microphone and the medical scanning voice enhancement device, wherein the image scanning device is used for acquiring a first image, and the at least one microphone is used for acquiring at least one path of sound signals.
The invention also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a medical scanning speech enhancement method as described above.
Compared with the prior art, the invention has the beneficial effects that: in the medical scanning voice enhancement method, firstly, a first image, a space coordinate corresponding to a microphone and a sound signal corresponding to the microphone are obtained, so that multiple information of the image, the voice and the space coordinate is effectively combined, and the accuracy of subsequent voice synthesis is ensured; furthermore, the first image is input into the positioning model to determine the prediction space coordinate, so that the complexity of identifying the first region of interest is reduced, and the accuracy and the speed of identifying the first region of interest are improved; then, according to the predicted space coordinate and the space coordinate corresponding to the microphone, determining a corresponding sound receiving distance, and effectively reflecting the position relation between the first region of interest of the examinee and the microphone; finally, voice synthesis is carried out by utilizing the sound signals of the plurality of microphones and the corresponding reception distances, the synthesized voice signals are efficiently determined, the effective degree of the sound signals is judged according to the position relation of the microphones and the first region of interest of the examinee, and the accuracy of the synthesized voice signals is ensured. In summary, the invention uses the image to position the predicted spatial coordinates of the examinee, and then combines the spatial coordinate information to determine the distances from the first region of interest of the examinee to the plurality of microphones, and determines the corresponding radio reception distances, so as to effectively feed back the effective degrees of the sound signals received by the microphones by the radio reception distances, thereby realizing efficient and accurate sound synthesis, avoiding the influence of background noise, accurately identifying the voice sent by the examinee, reducing the communication barriers between the technician and the examinee in the scanning process, and fully improving the efficiency and convenience of medical scanning.
Drawings
Fig. 1 is a schematic flow chart of a medical scanning speech enhancement method according to an embodiment of the present invention;
FIG. 2 is a first flowchart illustrating a process of determining a synthesized speech signal according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a second process for determining a synthesized speech signal according to an embodiment of the present invention;
fig. 4 is a schematic flowchart of a positioning model training method according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a positioning model according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a medical scanning speech enhancement apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a medical scanning speech enhancement system according to an embodiment of the present invention.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
An embodiment of the present invention provides a medical scanning speech enhancement method, and referring to fig. 1, fig. 1 is a schematic flow chart of the medical scanning speech enhancement method provided by the embodiment of the present invention, where the provided medical scanning speech enhancement method includes steps S101 to S104, where:
in step S101, a first image, spatial coordinates corresponding to at least one microphone, and a corresponding sound signal are acquired;
in step S102, inputting a first image into a well-trained positioning model, and determining a predicted spatial coordinate of a first region of interest of a subject, wherein the positioning model is trained based on a first image training set;
in step S103, determining at least one sound reception distance according to the predicted spatial coordinates and the spatial coordinates corresponding to the at least one microphone;
in step S104, a synthesized speech signal is determined according to the at least one sound reception distance and the sound signal corresponding to the at least one microphone.
Therefore, firstly, a first image, a space coordinate corresponding to a microphone and a sound signal corresponding to the microphone are obtained, so that various information of the image, the voice and the space coordinate is effectively combined, and the accuracy of subsequent voice synthesis is ensured; furthermore, the first image is input into the positioning model, the first region of interest of the examinee is generally a mouth, and therefore the prediction space coordinates of the mouth are determined, the complexity of mouth recognition is reduced, and meanwhile the accuracy and the rapidness of mouth recognition are improved; then, according to the predicted space coordinate and the space coordinate corresponding to the microphone, determining a corresponding sound receiving distance, and effectively reflecting the position relation between the mouth of the detected person and the microphone; finally, voice synthesis is carried out by utilizing the sound signals of the plurality of microphones and the corresponding reception distances, the synthesized voice signals are determined efficiently, the effective degree of the sound signals is judged according to the position relation of the microphones and the mouths of the examinees, the multi-path microphone signals are combined, interference signals in a non-target direction are inhibited, the sound signals in a target direction are enhanced, and the accuracy of the synthesized voice signals is ensured. It should be noted that the positioning model is determined by training through a deep learning method.
It should be noted that the first image in the embodiment of the present invention is a medical scanning image including information of the mouth of the subject, such as T1 imaging (also called spin lattice relaxation, which refers to a process of exponential recovery of a magnetization vector parallel to the direction of the external magnetic field baiB 0) and T2 imaging (also called spin relaxation, a nuclear spin system in a high energy state transfers energy to a lattice or a solvent) in magnetic resonance. Therefore, the mouth of the examinee is effectively positioned through the multiple medical scanning images, and the mouth of the examinee is positioned timely and quickly by using the medical scanning images in the scanning process.
It should be noted that the first image in the embodiment of the present invention is a camera-captured image including information of a mouth of a subject, where the camera-captured image includes a color image and/or a depth image. Therefore, the mouth of the examinee is effectively positioned by shooting images through various cameras, and the mouth of the examinee is positioned by utilizing medical scanning images timely and quickly in the scanning process.
In the embodiment of the present invention, before obtaining the sound signal of at least one microphone, a preprocessing step is performed on the initial sound signal, where the preprocessing step includes, but is not limited to, a sound denoising process and an echo cancellation process. The sound denoising processing can be performed by a deep learning method. In a specific embodiment of the present invention, initial sound signals of a plurality of microphones are obtained, and based on an RNN audio noise reduction algorithm, a GRU/LSTM model is used to perform a noise reduction process on the initial sound signals, so as to determine a sound signal of a corresponding at least one microphone.
In an embodiment of the present invention, the step S103 specifically includes:
and determining at least one corresponding sound receiving distance according to the coordinate difference between the predicted space coordinate and the space coordinate corresponding to the at least one microphone.
Therefore, according to the predicted space coordinates and the space coordinates corresponding to the microphones, the corresponding sound receiving distance is determined, namely the distance between the different microphones and the mouths of the examinees is determined, generally speaking, the voice signals of the examinees received by the microphones which are closer to the mouths of the examinees are more accurate, the noise is smaller, therefore, the position relation between the mouths of the examinees and the microphones is effectively reflected by the sound receiving distance, the accuracy degree of the voice signals received by the different microphones is fed back, and the subsequent effective voice signal synthesis is facilitated. In a specific embodiment of the present invention, first, the coordinate difference of each axis between the predicted spatial coordinates and the spatial coordinates corresponding to at least one microphone is calculated; then, the sound pickup distance between the mouth of the subject and the microphone is determined from the coordinate difference of the respective axes.
In an embodiment of the present invention, referring to fig. 2, fig. 2 is a schematic flowchart illustrating a process of determining a synthesized speech signal according to an embodiment of the present invention, where the step S104 includes steps S1041 to S1042, where:
in step S1041, determining at least one corresponding radio reception weight according to at least one radio reception distance;
in step S1042, a sound signal corresponding to at least one microphone is weighted and summed according to at least one sound collection weight to determine a synthesized speech signal.
Therefore, the sound receiving distance is used as a reference for synthesis, the corresponding sound receiving weight is determined, the sound signals are synthesized by the sound receiving weight, the multi-path microphone signals are combined, the interference signals in the non-target direction are suppressed, the sound signals in the target direction are enhanced, and the synthesized voice signals are more accurate.
In an embodiment of the present invention, the step S1041 specifically includes: and determining at least one corresponding sound reception weight according to the square of the at least one sound reception distance. Therefore, the corresponding sound receiving weight is determined according to the square of the sound receiving distance, so that the accuracy of sound signals received by different microphones is fed back by the distance, and the subsequent effective voice signal synthesis is facilitated.
In an embodiment of the present invention, the synthesized speech signal is determined by the following formula:
K=k1*w1+k2*w2+Λ+kn*wn
wn=sn 2
where K is the synthesized speech signal, KnIs the sound signal corresponding to the nth microphone, n is an integer, wnThe reception weight, s, corresponding to the nth microphonenThe square of the sound receiving distance corresponding to the nth microphone.
Therefore, effective sound signal synthesis is carried out through the formula, firstly, the sound receiving weight corresponding to each microphone is calculated according to the sound receiving distance of each microphone, and then the sound signals of each microphone and the corresponding sound receiving weight are weighted and added, so that the final synthesized sound signal is determined, the effective degree of the sound signal is judged according to the position relation between the microphone and the mouth of the examinee, and the accuracy of the synthesized sound signal is ensured.
It should be noted that, in a specific example of the present invention, the microphones are preferably omnidirectional microphones, the spatial coordinates of the mouth and the distances between the microphones are calculated to collect the effective distance of the sound, the directional sensitivities of the microphones to the sounds from different angles are different, the spatial distance often cannot reflect the actual sound collecting effect, the omnidirectional microphones are selected, the calculation is relatively simple, the spatial distance between the microphones and the mouth is similar to the sound collecting effect, and the closer the distance is, the better the sound collecting effect is
In an embodiment of the present invention, referring to fig. 3, fig. 3 is a schematic flowchart illustrating a second process of determining a synthesized speech signal according to an embodiment of the present invention, where the step S104 includes steps S1043 to S1044, where:
in step S1043, determining at least one corresponding sound beam forming wavelength according to at least one radio reception distance;
in step S1044, the sound signals corresponding to the at least one microphone are enhanced according to the at least one sound beam forming wavelength, and a synthesized speech signal is determined.
Therefore, voice data is enhanced through a beam forming method, sound signals of multiple paths of microphones are combined, optimal weights are determined by using a constraint formula, interference signals in non-target directions are suppressed, and sound signals in target directions are enhanced. And weighting, summing and filtering output signals of each array element, and finally outputting a voice signal in a desired direction, namely forming a beam.
In this embodiment of the present invention, step S1044 specifically includes: determining an initial weight matrix based on at least one sound beam forming wavelength; optimizing the initial weight matrix according to the constraint conditions to determine an optimal weight matrix; and according to the optimal weight matrix, combining and enhancing the sound signals corresponding to at least one microphone, and determining a synthesized voice signal. Therefore, the initial weight matrix is continuously optimized by utilizing the constraint condition to ensure that the weight corresponding to the sound signal received by each microphone is optimal, so as to form an effective voice signal.
In a specific embodiment of the present invention, the algorithms specifically related to step S1044 include a Minimum Variance Distortionless Response (MVDR) beam forming algorithm and a Linear Constrained Minimum Variance (LCMV) beam forming algorithm. It is understood that the algorithm for determining the optimal weight matrix according to the present invention includes, but is not limited to, the MVDR algorithm and the LCMV algorithm described above, as long as the effective solution of the optimal weight matrix can be performed.
The MVDR algorithm comprises the following specific steps: determining sound beam forming wavelengths of a plurality of microphones and the mouth of the examinee according to the sound receiving distance; determining an initial weight matrix according to sound beam forming wavelengths corresponding to a plurality of microphones; determining a first constraint formula according to the sound beam forming wavelength and the initial weight matrix; according to the first constraint formula, a criterion function under the first constraint formula is given by utilizing a Lagrange multiplier method, the alignment rule function is solved, and an optimal weight matrix is determined, wherein the optimal weight matrix comprises the weight of the sound signal corresponding to each microphone; and synthesizing the sound signals corresponding to the microphones by using the optimal weight matrix so as to synthesize effective and accurate synthesized voice signals.
The LCMV algorithm comprises the following specific steps: determining sound beam forming wavelengths of a plurality of microphones and the mouth of the examinee according to the sound receiving distance; determining an initial weight matrix according to sound beam forming wavelengths corresponding to a plurality of microphones; optimizing the initial weight matrix by a Lagrange multiplier method to minimize the output energy of the sound signals of all the microphones and determine an optimal weight matrix; and synthesizing the sound signals corresponding to the microphones by using the optimal weight matrix so as to synthesize effective and accurate synthesized voice signals.
In an embodiment of the present invention, the medical scanning speech enhancement method further includes: determining the optimal weight corresponding to the sound signal received by each microphone according to the optimal weight matrix; determining a combination weight corresponding to the sound signal received by each microphone according to the optimal weight and the sound reception weight; and according to the combined weight, the sound signals corresponding to each microphone are weighted and summed to determine a synthesized speech signal. Therefore, by combining the two methods for determining the weight (one is the sound reception weight determined according to the square of the distance, and the other is the optimal weight determined according to the sound beam forming wavelength), the combined weight is formed by combining the advantages of the two algorithms, the multi-path microphone signals are further combined, the interference signals in the non-target direction are suppressed, and the sound signals in the target direction are enhanced. It can be understood that, no matter the reception weight, the optimal weight or the combination weight, the method is suitable for combining the sound signals of the multiple microphones, and in the practical application process, the selection is performed according to the practical situation, as long as the effective weighting and combining processing can be performed.
An embodiment of the present invention provides a positioning model training method, and with reference to fig. 4, fig. 4 is a schematic flow chart of the positioning model training method provided in the embodiment of the present invention, where the provided positioning model training method includes steps S201 to S203, where:
in step S201, a first image training set including annotation information is obtained, where the annotation information includes actual spatial coordinates of a first region of interest of a subject;
in step S202, a first image training set is input into a positioning model, and a corresponding prediction space coordinate is determined;
in step S203, completing training of the positioning model according to an error between the actual spatial coordinate and the predicted spatial coordinate, and storing the positioning model;
the first image training set comprises a plurality of first images, and the first images are medical scanning images or camera shooting images comprising first region-of-interest information of a detected person.
Therefore, firstly, a medical scanning image or a camera shooting image including first interested area information (generally speaking mouth image information) is obtained to form a corresponding first image training set, and the mouth information of the examinee of each medical scanning image or camera shooting image (namely the first image) is labeled to clarify the actual space coordinates; then, inputting the first image training set into a positioning model, and determining the predicted space coordinates of the mouth part in a deep learning mode; and finally, training the positioning model by utilizing the error between the actual space coordinate and the prediction space coordinate, determining the corresponding model parameter, and storing the model parameter, so that the positioning model is directly utilized to quickly position the acquired first image in the subsequent scanning process, the prediction space coordinate corresponding to the mouth of the detected person is efficiently determined, and the synthetic voice signal is conveniently and quickly generated.
In an embodiment of the present invention, referring to fig. 5, fig. 5 is a schematic structural diagram of a positioning model according to an embodiment of the present invention, where the positioning model includes an input layer, a first convolution layer, a first normalization layer, a first pooling layer, a second convolution layer, a second normalization layer, a second pooling layer, a third convolution layer, a third normalization layer, a third pooling layer, a fully-connected layer, a loss layer, and an output layer, which are sequentially connected. Therefore, a plurality of first images in the first image training set are input into the deep learning neural network with complete training, the recognized prediction space coordinates are output, the three-layer CNN neural network is utilized to realize the part recognition function, the network model is low in complexity and high in operation speed, the mouth of the examinee can be quickly positioned, the subsequent quick synthesis of related voice signals is facilitated, and the real-time performance of communication between the examinee and a technician is guaranteed.
It should be noted that the positioning model provided by the embodiment of the present invention includes, but is not limited to, three layers of CNN neural networks, and the optional networks further include ResNet18, vgg16, and ResNet 50. In addition, in the embodiment of the present invention, the architecture of the deep learning neural network can be implemented by using tensorflow, and alternative architectures include, but are not limited to, caffe, pytorch.
An embodiment of the present invention provides a medical scanning speech enhancement apparatus, and referring to fig. 6, fig. 6 is a schematic structural diagram of the medical scanning speech enhancement apparatus provided in the embodiment of the present invention, and the medical scanning speech enhancement apparatus 600 includes:
an acquiring unit 601, configured to acquire a first image, spatial coordinates corresponding to at least one microphone, and a corresponding sound signal;
a processing unit 602, configured to input the first image into a well-trained positioning model, and determine predicted spatial coordinates of a first region of interest of a subject, where the positioning model is trained based on a first image training set; the microphone is also used for determining at least one radio reception distance according to the predicted space coordinate and the space coordinate corresponding to the at least one microphone;
a synthesizing unit 603, configured to determine a synthesized speech signal according to the at least one sound receiving distance and the sound signal corresponding to the at least one microphone.
The embodiment of the invention provides a medical scanning voice enhancement device, which comprises a processor and a memory, wherein the memory is stored with a computer program, and when the computer program is executed by the processor, the medical scanning voice enhancement device realizes the medical scanning voice enhancement method.
Fig. 7 is a schematic structural diagram of a medical scanning speech enhancement system according to an embodiment of the present invention, and when viewed in conjunction with fig. 7, fig. 7 is a schematic structural diagram of a medical scanning speech enhancement system according to an embodiment of the present invention, where the medical scanning speech enhancement system includes an image scanning device 001, at least one microphone (a first microphone 003, a second microphone 004, to an nth microphone 005), and a medical scanning speech enhancement device 002 as described above, where the image scanning device 001 is used for acquiring a first image, the at least one microphone is used for acquiring a sound signal of at least one path, and the medical scanning speech enhancement device 002 is used for implementing the medical scanning speech enhancement method as described above.
An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the medical scanning speech enhancement method as described above.
The invention discloses a medical scanning voice enhancement method, a device, a system and a storage medium, wherein in the medical scanning voice enhancement method, firstly, a first image, a space coordinate corresponding to a microphone and a sound signal corresponding to the microphone are obtained, so that various information of the image, the voice and the space coordinate are effectively combined, and the accuracy of subsequent synthesized voice is ensured; furthermore, the first image is input into the positioning model to determine the predicted space coordinate, so that the complexity of mouth recognition is reduced, the accuracy and the speed of mouth recognition are improved, the first image is rapidly positioned by directly utilizing the positioning model, the spatial coordinate of the mouth of the detected person is efficiently determined, and the rapid generation of a synthesized voice signal is facilitated; then, according to the predicted space coordinate and the space coordinate corresponding to the microphone, determining a corresponding sound receiving distance, and effectively reflecting the position relation between the mouth of the detected person and the microphone; finally, voice synthesis is carried out by utilizing the sound signals of the plurality of microphones and the corresponding reception distances, the synthesized voice signals are determined efficiently, the multi-path microphone signals are combined, interference signals in the non-target direction are inhibited, the sound signals in the target direction are enhanced, the effective degree of the sound signals is judged according to the position relation between the microphones and the mouths of the examinees, and the accuracy of the synthesized voice signals is ensured.
According to the technical scheme, the predicted space coordinates of the examinee are positioned by using the image, the distances from the mouth of the examinee to the plurality of microphones are determined by combining the space coordinate information, the corresponding radio receiving distances are determined, the effective degrees of the sound signals received by the microphones are effectively fed back by the radio receiving distances, efficient and accurate sound synthesis is further realized, the influence of background noise is avoided, the voice sent by the examinee is accurately recognized, the communication barrier between a technician and the examinee in the scanning process is reduced, and the efficiency and the convenience of medical scanning are fully improved.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (10)

1. A medical scanning speech enhancement method, comprising:
acquiring a first image, a space coordinate corresponding to at least one microphone and a corresponding sound signal;
inputting the first image into a well-trained positioning model, and determining a predicted space coordinate of a first region of interest of a detected person, wherein the positioning model is obtained by training based on a first image training set;
determining at least one radio reception distance according to the predicted space coordinate and the space coordinate corresponding to the at least one microphone;
and determining a synthesized voice signal according to the at least one sound receiving distance and the sound signal corresponding to the at least one microphone.
2. The method according to claim 1, wherein the determining a synthesized speech signal according to the at least one radio reception distance and the sound signal corresponding to the at least one microphone comprises:
determining at least one corresponding sound beam forming wavelength according to the at least one radio reception distance;
and according to the at least one sound beam forming wavelength, combining and enhancing sound signals corresponding to the at least one microphone, and determining the synthesized voice signals.
3. The method according to claim 1, wherein the determining at least one radio reception distance according to the mouth space coordinate and the space coordinate corresponding to the at least one microphone includes:
and determining the corresponding at least one radio reception distance according to the coordinate difference between the predicted space coordinate and the space coordinate corresponding to the at least one microphone.
4. The method according to claim 1, wherein the determining a synthesized speech signal according to the at least one radio reception distance and the sound signal corresponding to the at least one microphone comprises:
determining at least one corresponding radio reception weight according to the at least one radio reception distance;
determining the synthesized speech signal according to the at least one radio reception weight.
5. The medical scanning speech enhancement method of claim 1, wherein the training process of the localization model comprises:
acquiring a first image training set containing labeling information, wherein the labeling information comprises actual space coordinates of a first region of interest of a detected person;
inputting the first image training set into a positioning model, and determining the corresponding prediction space coordinate;
finishing the training of the positioning model according to the error between the actual space coordinate and the predicted space coordinate, and storing the positioning model;
wherein the first training set of images includes a plurality of the first images, the first images being medical images including information of a first region of interest of a subject.
6. The medical scanning speech enhancement method of claim 4, wherein said determining a corresponding at least one radio reception weight according to the at least one radio reception distance comprises: and determining at least one corresponding sound reception weight according to the square of the at least one sound reception distance.
7. The method of rendering a three-dimensional model on a two-dimensional plane according to claim 5, wherein the synthesized speech signal is determined by the following formula:
K=k1*w1+k2*w2+Λ+kn*wn
wn=sn 2
where K is the synthesized speech signal, KnFor the sound signal of the nth microphone, wnThe reception weight, s, corresponding to the nth microphonenIs the square of the sound receiving distance corresponding to the nth microphone, and n is an integer.
8. A medical scanning speech enhancement apparatus, comprising:
the acquisition unit is used for acquiring the first image, the space coordinate corresponding to at least one microphone and the corresponding sound signal;
the processing unit is used for inputting the first image into a well-trained positioning model and determining the predicted space coordinates of a first region of interest of a detected person, wherein the positioning model is obtained by training based on a first image training set; the system is also used for determining at least one radio reception distance according to the predicted space coordinate and the space coordinate corresponding to the at least one microphone;
and the synthesis unit is used for determining a synthesized voice signal according to the at least one sound receiving distance and the sound signal corresponding to the at least one microphone.
9. A medical scanning speech enhancement system comprising an image scanning device for acquiring a first image, at least one microphone for acquiring at least one path of sound signals, and the medical scanning speech enhancement apparatus of claim 8.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a medical scanning speech enhancement method according to any one of claims 1 to 7.
CN202011622711.4A 2020-12-30 2020-12-30 Medical scanning voice enhancement method, device, system and storage medium Pending CN112826446A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011622711.4A CN112826446A (en) 2020-12-30 2020-12-30 Medical scanning voice enhancement method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011622711.4A CN112826446A (en) 2020-12-30 2020-12-30 Medical scanning voice enhancement method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN112826446A true CN112826446A (en) 2021-05-25

Family

ID=75924227

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011622711.4A Pending CN112826446A (en) 2020-12-30 2020-12-30 Medical scanning voice enhancement method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN112826446A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140144410A (en) * 2013-06-11 2014-12-19 삼성전자주식회사 Beamforming method and apparatus for sound signal
WO2016183791A1 (en) * 2015-05-19 2016-11-24 华为技术有限公司 Voice signal processing method and device
US20170161551A1 (en) * 2015-05-29 2017-06-08 Tencent Technology (Shenzhen) Company Limited Face key point positioning method and terminal
CN109192213A (en) * 2018-08-21 2019-01-11 平安科技(深圳)有限公司 The real-time transfer method of court's trial voice, device, computer equipment and storage medium
US20190045298A1 (en) * 2018-01-12 2019-02-07 Intel Corporation Apparatus and methods for bone conduction context detection
US20190102603A1 (en) * 2017-09-29 2019-04-04 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for determining image quality
CN109683135A (en) * 2018-12-28 2019-04-26 科大讯飞股份有限公司 A kind of sound localization method and device, target capturing system
WO2020015752A1 (en) * 2018-07-20 2020-01-23 华为技术有限公司 Object attribute identification method, apparatus and system, and computing device
KR20200085041A (en) * 2019-01-04 2020-07-14 순천향대학교 산학협력단 Language rehabilitation based vocal voice evaluation apparatus and method thereof
CN111429451A (en) * 2020-04-15 2020-07-17 深圳市嘉骏实业有限公司 Medical ultrasonic image segmentation method and device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140144410A (en) * 2013-06-11 2014-12-19 삼성전자주식회사 Beamforming method and apparatus for sound signal
WO2016183791A1 (en) * 2015-05-19 2016-11-24 华为技术有限公司 Voice signal processing method and device
CN107534725A (en) * 2015-05-19 2018-01-02 华为技术有限公司 A kind of audio signal processing method and device
US20170161551A1 (en) * 2015-05-29 2017-06-08 Tencent Technology (Shenzhen) Company Limited Face key point positioning method and terminal
US20190102603A1 (en) * 2017-09-29 2019-04-04 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for determining image quality
US20190045298A1 (en) * 2018-01-12 2019-02-07 Intel Corporation Apparatus and methods for bone conduction context detection
WO2020015752A1 (en) * 2018-07-20 2020-01-23 华为技术有限公司 Object attribute identification method, apparatus and system, and computing device
CN109192213A (en) * 2018-08-21 2019-01-11 平安科技(深圳)有限公司 The real-time transfer method of court's trial voice, device, computer equipment and storage medium
CN109683135A (en) * 2018-12-28 2019-04-26 科大讯飞股份有限公司 A kind of sound localization method and device, target capturing system
KR20200085041A (en) * 2019-01-04 2020-07-14 순천향대학교 산학협력단 Language rehabilitation based vocal voice evaluation apparatus and method thereof
CN111429451A (en) * 2020-04-15 2020-07-17 深圳市嘉骏实业有限公司 Medical ultrasonic image segmentation method and device

Similar Documents

Publication Publication Date Title
CN104106267B (en) Signal enhancing beam forming in augmented reality environment
US8229134B2 (en) Audio camera using microphone arrays for real time capture of audio images and method for jointly processing the audio images with video images
CN102447697B (en) Method and system of semi-private communication in open environments
US8988970B2 (en) Method and system for dereverberation of signals propagating in reverberative environments
CN104582582B (en) Ultrasonic image-forming system memory architecture
US6157403A (en) Apparatus for detecting position of object capable of simultaneously detecting plural objects and detection method therefor
EP3807673A1 (en) Method and apparatus for ultrasound imaging with improved beamforming
Kujawski et al. A deep learning method for grid-free localization and quantification of sound sources
JP6467736B2 (en) Sound source position estimating apparatus, sound source position estimating method, and sound source position estimating program
WO2006137732A1 (en) System and method for extracting acoustic signals from signals emitted by a plurality of sources
CN112904279B (en) Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
CN106659473A (en) Ultrasound imaging apparatus
RU2550145C2 (en) Ultrasonic imaging device with adaptive beam former and method for ultrasonic imaging with adaptive beam formation
Nair et al. A fully convolutional neural network for beamforming ultrasound images
CN110444220B (en) Multi-mode remote voice perception method and device
CN110515034B (en) Acoustic signal azimuth angle measurement system and method
CN113543721A (en) Method and system for acquiring composite 3D ultrasound images
Marković et al. Extraction of acoustic sources through the processing of sound field maps in the ray space
Shivappa et al. Role of head pose estimation in speech acquisition from distant microphones
CN116309921A (en) Delay summation acoustic imaging parallel acceleration method based on CUDA technology
CN112826446A (en) Medical scanning voice enhancement method, device, system and storage medium
CN101305924B (en) Computed volume sonography
CN112859000A (en) Sound source positioning method and device
Zhu et al. Speaker localization based on audio-visual bimodal fusion
CN114038452A (en) Voice separation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination