WO2020172828A1 - 一种声源分离方法、装置及设备 - Google Patents

一种声源分离方法、装置及设备 Download PDF

Info

Publication number
WO2020172828A1
WO2020172828A1 PCT/CN2019/076371 CN2019076371W WO2020172828A1 WO 2020172828 A1 WO2020172828 A1 WO 2020172828A1 CN 2019076371 W CN2019076371 W CN 2019076371W WO 2020172828 A1 WO2020172828 A1 WO 2020172828A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
time
sound source
audio
image frame
Prior art date
Application number
PCT/CN2019/076371
Other languages
English (en)
French (fr)
Inventor
尚光双
孙凤宇
陈亮
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2019/076371 priority Critical patent/WO2020172828A1/zh
Priority to CN201980006671.XA priority patent/CN111868823A/zh
Publication of WO2020172828A1 publication Critical patent/WO2020172828A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • This application relates to the field of audio processing, and in particular to a method, device and equipment for separating sound sources.
  • the traditional speech enhancement technology used by hearing aids only the mixed sound received by the hearing aid can be enhanced.
  • this method cannot suppress the environmental noise (non-target sound source) in the mixed sound.
  • the interference is not conducive for the hearing impaired to hear the sound of the target sound source, that is to say, the environmental noise in the mixed sound is not conducive to improving the speech comprehension of the hearing aid.
  • some improved solutions can relatively suppress the environmental noise and enhance the sound of the target sound source, for the hearing impaired, the interference of the environmental noise on the sound of the target sound source cannot be ignored. Therefore, how to distinguish the sound of the target sound source from the mixed sound needs further research.
  • This application provides a sound source separation method, device, and equipment for distinguishing the sound of the target sound source from the mixed sound.
  • an embodiment of the present application provides a sound source separation method, including: acquiring a first audio signal, and acquiring at least one image frame corresponding to the first audio signal, the at least one image frame including image information of the target sound source And, according to the first audio signal and at least one image frame, obtain the time-frequency distribution information of the target sound source in the first audio signal; further, according to the obtained time-frequency distribution information, obtain from the first audio signal belonging to the target The second audio signal of the sound source.
  • the image information of the target sound source When the target sound source emits sound, the image information of the target sound source will satisfy certain characteristics, and changes in the sound intensity and the sound frequency will bring about changes in the image information. For example, when a person speaks, the image information of the face will satisfy certain characteristics. When the person changes the pitch or the sound level, the image information of the face will also change to a certain extent. Therefore, in the embodiments of the present application, the image information of the target sound source during the duration of the first audio signal is used to obtain the time-frequency distribution information of the target sound source in the first audio signal, which is beneficial to improve the accuracy of the time-frequency distribution information, and thus is beneficial to Obtain the second audio signal belonging to the target sound source from the first audio signal more accurately.
  • obtaining the time-frequency distribution information of the target sound source in the first audio signal according to the first audio signal and at least one image frame includes: obtaining the first audio feature of the first audio signal; Acquire a first image frame from at least one image frame, and identify a characteristic region in the first image frame; further, acquire a first image feature according to the characteristic region; use a neural network to process the characteristic region, the first image feature, and the first audio feature , To get time-frequency distribution information.
  • the above method provides a possible implementation manner for obtaining time-frequency distribution information. Since there is a certain correlation between the image information of the target sound source and the audio signal generated by the target sound source, the neural network can be trained to enable the neural network to simulate the correlation. Furthermore, a neural network can be used to process the feature region, the first image feature, and the first audio feature to obtain time-frequency distribution information. In addition, in the process of acquiring the time-frequency distribution information, image features of multiple dimensions including the feature area and the first image feature are used, which is also conducive to improving the accuracy of the time-frequency distribution information.
  • using a neural network to process the feature region, the first image feature, and the first audio feature to obtain time-frequency distribution information includes: using the neural network to process the feature region to obtain the second image feature; and , Using a neural network to process the first audio feature to obtain the second audio feature; further, perform data splicing on the first image feature, the second image feature, and the second audio feature to obtain the spliced feature; and use the neural network to process the spliced feature to obtain Fusion features; use neural network to process fusion features to obtain time-frequency distribution information.
  • the first image frame is any image frame in at least one image frame, or the first image frame is a central image frame in at least one image frame.
  • the center image frame is the image frame corresponding to the intermediate time point within the duration of the first audio signal. It can be understood that the central image frame is more representative than other image frames. Therefore, the time-frequency distribution information acquired based on the central image frame is beneficial to improve the accuracy of the time-frequency distribution information.
  • acquiring the first image feature according to the feature region includes: processing the feature region using an active representation model (AAM) to obtain the first image feature.
  • AAM active representation model
  • obtaining the first audio feature of the first audio signal includes: performing time-frequency conversion processing on the first audio signal to obtain the first audio feature.
  • the time-frequency distribution information includes a probability value corresponding to each time-frequency unit in the first audio signal, and the probability value is used to indicate that the audio signal generated by the target sound source falls within the time corresponding to the probability value.
  • obtaining the second audio signal belonging to the target sound source from the first audio signal includes: obtaining the first audio intensity value of each time-frequency unit in the first audio signal ; According to the first audio intensity value of each time-frequency unit and the probability value corresponding to each time-frequency unit, the second audio intensity value of each time-frequency unit is obtained; according to the second audio intensity value of each time-frequency unit, Obtain the second audio signal.
  • the method further includes: using a speech recognition model to process the second audio signal to obtain the language text carried in the second audio signal information.
  • an embodiment of the present application provides a sound source separation device, including: an audio acquisition module for acquiring a first audio signal; an image acquisition module for acquiring at least one image frame corresponding to the first audio signal; at least one The image frame includes the image information of the target sound source; the joint processing module is used to obtain the time-frequency distribution information of the target sound source in the first audio signal according to the first audio signal and at least one image frame; according to the time-frequency distribution information, from The second audio signal belonging to the target sound source is obtained from the first audio signal.
  • the joint processing module is specifically configured to: obtain the first audio feature of the first audio signal; obtain the first image frame from at least one image frame; identify the feature area in the first image frame; The feature area acquires the first image feature; the neural network is used to process the feature area, the first image feature, and the first audio feature to obtain time-frequency distribution information.
  • the first image frame is any image frame in at least one image frame, or the first image frame is a central image frame in at least one image frame.
  • the joint processing module is specifically used to process the feature region by using the Active Representation Model (AAM) to obtain the first image feature.
  • AAM Active Representation Model
  • the joint processing module is specifically configured to: perform time-frequency conversion processing on the first audio signal to obtain the first audio feature.
  • the time-frequency distribution information includes a probability value corresponding to each time-frequency unit in the first audio signal; the probability value is used to indicate the probability that the audio signal generated by the target sound source exists in the time-frequency unit ;
  • the joint processing module is specifically configured to: obtain the first audio intensity value of each time-frequency unit in the first audio signal; obtain the first audio intensity value of each time-frequency unit and the probability value corresponding to each time-frequency unit The second audio intensity value of each time-frequency unit; and the second audio signal is obtained according to the second audio intensity value of each time-frequency unit.
  • a voice recognition module is further included; a voice recognition module is used to process the second audio signal using the voice recognition model to obtain the language text information carried in the second audio signal.
  • an embodiment of the present application provides a sound source separation device, including a processor and a memory; wherein the memory is used to store program instructions; the processor is used to run the program instructions to cause the sound source separation device to perform as in the first aspect Any method.
  • an embodiment of the present application provides a sound source separation device, including the sound source separation device provided in the third aspect, and an audio collector, and/or, a video collector; wherein the audio collector is used to collect the second An audio signal; the video collector is used to collect the first video signal carrying at least one image frame.
  • a speaker is further included; the speaker is used to convert the second audio signal into external sound.
  • a display is further included; the display is used to display text information recognized from the second audio signal.
  • the transceiver is used to receive the first audio signal, and/or to receive the first video signal, and/or to send the second audio signal, and/or to send the second audio signal 2. Text information recognized in the audio signal.
  • the embodiments of the present application also provide a computer-readable storage medium, which stores instructions in the computer-readable storage medium, which when run on a computer, causes the computer to execute the methods of the above aspects.
  • the embodiments of the present application also provide a computer program product including instructions, which when run on a computer, cause the computer to execute the methods of the foregoing aspects.
  • Figure 1 is a schematic diagram of a sound source separation device provided by an embodiment of the application.
  • FIG. 2 is a schematic flowchart of a possible sound source separation method provided by an embodiment of this application.
  • FIG. 3 is a schematic diagram of a first audio signal provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram of an image frame provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of time-frequency distribution information provided by an embodiment of this application.
  • FIG. 6 is a spectrogram corresponding to a first audio signal provided by an embodiment of this application.
  • FIG. 7 is a schematic flowchart of a method for obtaining time-frequency distribution information according to an embodiment of this application.
  • FIG. 8 is a schematic diagram of a neural network structure provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of a sound source separation device provided by an embodiment of the application.
  • a sound source separation method which is suitable for a sound source separation device, which can be a chip in audio processing equipment such as hearing aids and voice recorders. , Circuit board or chipset, can run the necessary software, the device can also be an independent audio processing device.
  • the mixed audio signal obtained from the mixed sound and the image frame corresponding to the mixed audio signal are jointly processed to separate the audio signal belonging to the target sound source from the mixed audio signal, and then the audio signal belonging to the target sound source can be The audio signal of the source distinguishes the sound of the target sound source from the mixed sound, so the interference of environmental noise on the sound of the target sound source can be suppressed.
  • Fig. 1 exemplarily shows a sound source separation device to which an embodiment of the present application is applicable.
  • the device 100 includes a sound source separation device 101. In a possible implementation manner, it may also include an audio collector. 102 and video collector 103.
  • the audio collector 102 may be a microphone, which can convert the collected mixed sound into a mixed audio signal and store it.
  • the video collector 103 may be a camera, which can capture the image information of the target sound source, and store the collected image information in the form of a video signal.
  • the sound source separation device 101 includes a processor 1011 and a memory 1012.
  • the sound source separation device 101 may further include a bus 1013.
  • the processor 1011 and the memory 1012 may be connected to each other through a bus 1013; the bus 1013 may be a peripheral component interconnection standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. .
  • the bus 1013 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 1, but it does not mean that there is only one bus or one type of bus.
  • the processor 1011 may include a CPU, a microprocessor, or may further include an ASIC, or one or more integrated circuits for controlling the execution of the program of the present application.
  • the memory 1012 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions
  • the dynamic storage device can also be electrically erasable programmable read-only memory (electrically programmable read-only memory, EEPROM), compact disc read-only memory (CD-ROM), or other optical disk storage, Optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can Any other medium accessed by the computer, but not limited to this.
  • the memory 1012 may exist independently and is connected to the processor 1011 through a bus 1013.
  • the memory 1012 may also be integrated with the processor 1011.
  • the memory 1012 is used to store computer-executable instructions for executing the technical solutions provided in the embodiments of the present application, and the processor 1011 controls the execution.
  • the processor 1011 is configured to execute computer-executable instructions stored in the memory 1012, so as to implement the sound source separation method provided in the embodiment of the present application based on the mixed audio signal stored by the audio collector 102 and the video signal stored by the video collector 103.
  • the device 100 may also include other functional devices.
  • the device 100 can be a hearing aid, and the device 100 can also include a speaker 104.
  • the speaker 104 can convert the audio signal belonging to the target sound source acquired by the sound source separation device 101 into external sound and play it to the hearing impaired. It is beneficial to shield the interference of environmental noise to the sound of the target sound source and improve the language understanding of the hearing aid.
  • the device 100 may further include a display 105, which may be used to display the language text information carried in the audio signal belonging to the target sound source, which is beneficial to further improve the language understanding of the hearing aid.
  • the device 100 may further include a transceiver 106, which can support mobile hotspot (WiFi), Bluetooth and other transmission methods.
  • the transceiver 106 can send the audio signal belonging to the target sound source and/or the language text information carried in the audio signal belonging to the target sound source, such as sending the language text information carried in the audio signal belonging to the target sound source to a mobile phone or a tablet.
  • a terminal device such as a computer allows users to read language text information from the display interface of the terminal device.
  • the transceiver 106 may also receive a mixed audio signal sent by other devices, and/or an image frame corresponding to the mixed audio signal.
  • the transceiver 106 may receive a mixed audio signal collected by terminal equipment such as a mobile phone, a tablet computer, and the image frame corresponding to the mixed audio signal, and the sound source separation device 101 may use the sound source separation method provided in the embodiment of the application to separate the mixed audio signal from the mixed audio signal.
  • the audio signal is separated from the audio signal belonging to the target sound source.
  • FIG. 2 is a schematic flow chart of a possible sound source separation method provided by an embodiment of the application. The method can be applied to the sound source separation device 101 as shown in 1. As shown in 2, it mainly includes the following steps:
  • the audio collector 102 collects the mixed sound, converts the mixed sound into a mixed audio signal in the form of a digital signal, and stores the mixed audio signal.
  • the mixed sound collected by audio collector 102 includes sound 1 , Sound 2 and Sound 3.
  • the audio collector 102 converts the mixed sound into a mixed audio signal in the form of a digital signal, and then stores the obtained mixed audio signal.
  • the sound source separation device 101 may obtain all or part of the mixed audio signal in the mixed audio signal stored in the audio collector 102 at a certain time interval T, and the obtained mixed audio signal is the first audio signal.
  • the sound source separation device 101 sequentially obtains n first audio signals from the mixed audio signals stored in the audio collector 102, which are signals S1 to Sn, and each first audio signal has the same duration
  • the duration may be the same as the time interval T during which the sound source separation device 101 obtains the first audio signal, that is, the duration of each first audio signal is T.
  • the n first audio signals in FIG. 3 belong to the same continuous mixed audio signal, there may be partial time domain overlap between adjacent first audio signals, as shown by S1, S2, and S3 in FIG. 3.
  • the sound source separation method provided in the embodiments of the present application can be applied.
  • the embodiments of the present application only take one first audio signal as an example for description.
  • the sound source separation device 101 acquires at least one image frame corresponding to the first audio signal.
  • the video collector 103 collects the image information of the target sound source, converts the collected image information of the target sound source into a video signal in the form of a data signal, and stores the video signal.
  • the sound source A is the target sound source
  • the video collector 103 collects the image information of the sound source A, converts the image information of the time-frequency A into a video signal in the form of a digital signal, and stores the video signal.
  • the sound source separation device 101 may also obtain the video signal stored by the video collector 103 at a time interval T, and parse the obtained video signal to parse at least one image frame carried by the video signal, that is, the first At least one image frame corresponding to an audio signal.
  • a corresponding image frame includes the image information of the target sound source-the person.
  • the target sound source may also be other objects such as musical instruments, machinery or animals.
  • the sound source separation device 101 acquires time-frequency distribution information of the target sound source in the first audio signal according to the first audio signal and at least one image frame.
  • the time-frequency distribution information of the target sound source in the first audio signal may indicate the time-frequency distribution of the audio signal corresponding to the sound generated by the target sound source in the first audio signal. Therefore, by executing S204 based on the obtained time-frequency distribution information, the second audio signal belonging to the target sound source can be obtained from the first audio signal, thereby achieving sound source separation.
  • FIG. 5 is a schematic diagram of time-frequency distribution information provided by an embodiment of this application. It should be understood that FIG. 5 shows the time-frequency distribution information in the form of a spectrogram for the convenience of explanation only. In the actual calculation process, the time-frequency distribution information may be a series of numerical values. As shown in Figure 5, the length (horizontal axis) of the time-frequency distribution information is the time axis, and the width (vertical axis) is the frequency axis. In Figure 5, a small square represents a time-frequency unit. In a possible implementation, each time-frequency unit also corresponds to a probability value. As shown in Figure 5, the probability value corresponding to time-frequency unit a is 0.8, which means that the audio signal generated by the target sound source is in the time-frequency unit. The probability of existence in a is 0.8.
  • the first audio signal can be represented by a spectrogram, as shown in FIG. 6.
  • each time-frequency unit corresponds to a first audio intensity value, which represents the audio intensity of the first audio signal on the time-frequency unit.
  • the first audio intensity value corresponding to the time-frequency unit a is 100, which means that the audio intensity of the first audio signal on the time-frequency unit a is 100.
  • the sound source separation device 101 can, based on the time-frequency distribution information shown in FIG. 5 and the first audio signal shown in FIG. 6, the first audio intensity value of each time-frequency unit and the probability value corresponding to each time-frequency unit,
  • the second audio intensity value of each time-frequency unit can be obtained.
  • the product 80 between the first audio intensity value 100 of the time-frequency unit a and the corresponding probability value 0.8 can be used as the second audio intensity value of the time-frequency unit a, and other time-frequency units are the same.
  • the sound source separation device 101 obtains the second audio intensity value of each time-frequency unit, and then obtains the second audio signal belonging to the target sound source.
  • the sound source separation device 101 can obtain the second audio signal through inverse time-frequency transformation according to the second audio intensity values of multiple time-frequency units, and the audio intensity value of each time-frequency unit in the obtained second audio signal can be Is the above-mentioned second audio intensity value.
  • the second audio intensity value of the time-frequency unit a is 80
  • multiple time-frequency units are all time-frequency inverse transformed to obtain
  • the second audio signal of the obtained second audio signal has an audio intensity value of 80 on the time-frequency unit a.
  • the image information of the target sound source meets certain characteristics, and changes in the utterance intensity and vocal frequency will bring about changes in the image information. For example, when a person speaks, the image information of the face will satisfy certain characteristics. When the person changes the pitch or the sound level, the image information of the face will also change to a certain extent. Therefore, in the embodiments of the present application, the image information of the target sound source during the duration of the first audio signal is used to obtain the time-frequency distribution information of the target sound source in the first audio signal, which is beneficial to improve the accuracy of the time-frequency distribution information, and thus is beneficial to Obtain the second audio signal belonging to the target sound source from the first audio signal more accurately.
  • FIG. 7 is a schematic flowchart of a method for obtaining time-frequency distribution information according to an embodiment of the application. As shown in FIG. 7, it mainly includes the following steps:
  • the sound source separation device 101 acquires the first audio feature of the first audio signal.
  • the first audio signal can be processed by time-frequency transform to obtain the first audio feature.
  • Fourier transform FT
  • FT Fourier transform
  • STFT short Time Fourier transform
  • STFT processes the first audio signal to obtain the first audio feature, and so on.
  • STFT is a commonly used time-frequency analysis method, which can convert the first audio signal in the time domain into the first audio feature through a fixed conversion formula.
  • the sound source separation device 101 obtains a first image frame, and obtains a characteristic region based on the first image frame.
  • the process may include: the sound source separation device 101 obtains a first image frame from at least one image frame.
  • the image frame is the first image frame.
  • the first image frame may be any image frame among the multiple image frames, or may be the center image frame among the multiple image frames.
  • the central image frame can be understood as the image frame located at the intermediate time point among the multiple image frames.
  • the image information of the sound source A in the duration T included in at least one image frame in the case where there are multiple graphic frames, the first image frame may be an image corresponding to the intermediate time point of the duration T
  • the first image frame includes the image information of the sound source A at the intermediate time point of the duration T. It can be understood that the center image frame is more representative, so the time-frequency distribution information obtained based on the center image frame will be more accurate.
  • the first audio signal can correspond to an image frame, and the image frame is an intermediate time point within the duration T Corresponding center image frame.
  • the processing of video signals can be simplified.
  • the process includes: the sound source separation device 101 further recognizes the characteristic region in the first image frame based on the obtained first image frame.
  • the selection of the characteristic area is related to the type of the target sound source, and is usually an area where a certain amount of image information changes when the target sound source emits sound.
  • the target sound source is a person, so the first image frame includes image information of the person. Since the vocalization of a person is mainly related to the image information of the human face, the characteristic area in the first image frame, that is, the area of the human face in the first image frame, can be identified through image processing algorithms such as face recognition. Or, other image recognition algorithms can be used to identify the area corresponding to the target sound source as the target area.
  • the sound source separation device 101 acquires the first image feature according to the feature area.
  • the feature area can be processed by an active appearance model (AAM) obtained by pre-training to obtain the first image feature.
  • AAM is a feature point extraction method widely used in the field of pattern recognition. It not only considers local feature information, but also comprehensively considers global shape and texture information. It builds a face model through statistical analysis of face shape features and texture features . It can also be considered that AAM uses several key points to describe a face, and the final first image feature includes some coordinate information of these key points.
  • the sound source separation device 101 uses a neural network to process the characteristic region, the first image characteristic, and the first audio characteristic to obtain time-frequency distribution information.
  • the neural network is pre-trained.
  • the neural network is trained using known sample audio, the characteristic area of the sample sound source and the image characteristics of the sample sound source, and the time-frequency distribution information of the sample sound source in the sample audio.
  • the image feature of the sample sound source may be the image feature obtained after the image frame corresponding to the sample audio is processed by the AAM algorithm.
  • some variables in the neural network such as weight values, are determined, and the neural network has the function of analyzing or calculating time-frequency distribution information.
  • the neural network processes the output information obtained from S701 to S703 To obtain the time-frequency distribution information.
  • the neural network can be a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recurrent Neural Network, RNN), a long/short term memory network (Long/Short Term Memory, LSTM), etc. It can be a combination of multiple network types, and the specific network type and structure can be adjusted according to actual effects.
  • FIG. 8 is a schematic diagram of a neural network structure provided by an embodiment of the application.
  • the neural network has a two-tower structure, which mainly includes an image stream tower, an audio stream tower, a fully connected layer, and a decoding layer.
  • the process of the sound source separation device 101 acquiring time-frequency distribution information mainly includes the following steps:
  • Step 1 Use the image stream tower in the neural network to process the feature area to obtain the second image feature.
  • the processing may include operations such as convolution, pooling, direct connection of residuals, and batch normalization.
  • the image stream tower can use the network shown in Table 1 below:
  • the image stream tower includes 6 convolutional layers.
  • the layers can have the same convolution kernel or different convolution kernels, that is, the sound source separation device 101 can use the image stream
  • the tower performs 6-layer convolution processing on the data of the characteristic area, and the size of the convolution unit and convolution kernel of each layer is shown in the table. For example, in layer 1, there are 128 convolution units.
  • the convolution kernel size used is 5 ⁇ 5.
  • the other layers are similar and will not be Repeat.
  • each convolutional layer can also include a batch normalization (BN) layer and a leaky reluctant (Leaky ReLU) unit, that is, the sound source separation device 101 uses each convolutional layer to complete After a convolution process, you can also use the BN layer to perform batch normalization processing on the data obtained by the convolution processing, and use the Leaky ReLU unit to correct the data obtained by the batch normalization processing. In addition, there is a certain amount of random inactivation (Dropout) between adjacent convolutional layers to prevent overfitting.
  • BN batch normalization
  • Leaky ReLU leaky reluctant
  • Step 2 The sound source separation device 101 uses the audio stream tower in the neural network to process the first audio feature to obtain the second audio feature.
  • operations such as convolution, pooling, residual direct connection, batch standardization, etc. may also be included.
  • the audio stream tower may use the network shown in Table 2 below:
  • Layer Convolution unit (filters) Convolution kernel (kernel) 1 64 2 ⁇ 2 2 64 1 ⁇ 1 3 128 2 ⁇ 2 4 128 2 ⁇ 1 5 128 2 ⁇ 1
  • the audio stream tower includes 5 convolutional layers.
  • the layers can have the same convolution kernel or different convolution kernels. That is, the sound source separation device 101 can use the audio stream
  • the tower performs 5-layer convolution processing on the data of the first audio feature. For example, in layer 1, there are 64 convolution units.
  • the convolution kernel used is 2 ⁇ 2, and other layers are similar. Repeat it again.
  • each convolutional layer of the audio stream tower in the embodiment of this application may also include a BN layer and a leaky ReLU layer, which will not be repeated in this embodiment of the application.
  • Step 3 The sound source separation device 101 performs feature stitching on the first image feature, the second image feature, and the second audio feature of the first image frame to obtain the stitching feature.
  • the sound source separation device 101 can combine the above three features
  • the data is finished and connected to complete the feature splicing, and the spliced data is used as the splicing feature.
  • Step 4 The sound source separation device 101 uses the fully connected layer in the neural network to process the splicing feature to obtain the fusion feature.
  • Step 5 The sound source separation device 101 uses the decoding layer in the neural network to process the fusion features to obtain time-frequency distribution information.
  • the decoding layer is the mirror network of the audio stream tower, which is equivalent to the inverse operation of the audio stream tower.
  • the sound source separation device 101 splices the second audio feature, the second image feature, and the first image feature together to obtain a 5328-dimensional splicing feature.
  • the sound source separation device 101 uses a three-layer fully connected layer to process the splicing feature, and can obtain a fusion feature including 3200 data values. After that, the sound source separation device 101 uses the decoding layer to process the fusion features to obtain time-frequency distribution information.
  • Fig. 8 is only a feasible specific example, and there are many variants on the basis of the neural network shown in Fig. 8, for example, the number of network layers of the image stream tower and audio stream tower can also be changed.
  • the number of convolution kernels, etc. for example, change the number of network layers, the number of nodes, the connection mode, etc. of the fully connected layer, or add other network modules, for example, connect a fully connected layer after the first image feature, and then After the first image feature is processed by the fully connected layer, the processed result is spliced with the second image feature and the second audio feature, which will not be listed in the embodiment of the present application.
  • the sound source separation device 101 combines the first audio feature of the first audio signal, the first image feature of the first image frame (obtained by AAM), and the second image feature (obtained by the image stream tower) through carefully designed Neural networks correlate and merge. Under the guidance of the features of each level of the image frame, the sound source separation device 101 selectively retains the parts of the first audio signal that are strongly related to the first image feature and the second image feature, and discards the irrelevant parts.
  • this solution not only uses audio and image frames at the same time, but also uses a variety of image features of different levels in the image frame, and integrates these image features of different levels in designated steps to improve the sound The accuracy of the time-frequency distribution information obtained by the source separation device 101.
  • the sound source separation device 101 may further process the second audio signal after acquiring the second audio signal of the target sound source.
  • the sound source separation device 101 may use a voice recognition model to process the second audio signal to obtain the language text information carried in the second audio signal. This process may also be called semantic recognition.
  • the speech recognition model is obtained by training based on known multiple third audio signals and language text information corresponding to these third audio signals. After the sound source separation device 101 obtains the language text information carried in the second audio signal, it can also send the language text information through the transceiver 106, and can also control the display 105 to display the language text information and so on.
  • a certain number of third audio signals are obtained according to the process shown in FIG. 2. That is to say, when training the speech recognition model, first obtain the mixed audio including the third audio signal and the language text information corresponding to the third audio signal, and obtain the third audio signal from the mixed audio according to the process shown in FIG. 2 . After that, the speech recognition model is trained according to the third audio signal and the language text information of the third audio signal obtained through the process shown in FIG. 2. In other words, part of the training data of the speech recognition model comes from the method mentioned before in the embodiment of this application.
  • the speech recognition model trained by the above method can be more adapted to the aforementioned sound source separation method, thereby improving the accuracy of the recognition result of the second audio signal.
  • the sound source separation device may also perform targeted processing on the second audio signal according to specific application scenarios. For example, when the sound source separation device is used in a hearing aid, the sound source separation device can also adapt the different frequency bands of the second audio signal to the hearing impairment of the hearing impaired.
  • the sound source separation device may include corresponding hardware structures and/or software modules for performing various functions.
  • the embodiments of the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • FIG. 9 shows a possible exemplary block diagram of a sound source separation device in an embodiment of the present application.
  • the device 900 or at least one module thereof may exist in the form of software, or in the form of hardware, or in the form of software and hardware.
  • the software can run on various types of processors, including but not limited to central processing unit (CPU), microprocessor, microcontroller, digital signal processing (DSP), or neural processor.
  • the hardware can be a semiconductor chip, a chipset or a circuit board in the sound source separation device, and can selectively execute software to work. For example, it can include a CPU, DSP, and application specific integrated circuits (ASIC).
  • CPU central processing unit
  • DSP digital signal processing
  • ASIC application specific integrated circuits
  • a field programmable gate array or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof.
  • the apparatus 900 may implement or execute various exemplary logical blocks described in conjunction with the disclosure of the method embodiments of the present application.
  • the processor may also be a combination for realizing computing functions, for example, including a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and so on.
  • the apparatus 900 may include: an audio acquisition module 901, an image acquisition module 902, and a joint processing module 903.
  • the audio acquisition module 901 is configured to acquire a first audio signal
  • the image acquisition module 902 is configured to acquire at least one image frame corresponding to the first audio signal
  • at least one image frame includes the target sound source Image information
  • the joint processing module 903 is specifically configured to: obtain the first audio feature of the first audio signal; obtain the first image frame from at least one image frame; identify the feature area in the first image frame; Acquire the first image feature according to the feature area; use the neural network to process the feature area, the first image feature, and the first audio feature to obtain time-frequency distribution information.
  • the first image frame is any image frame in at least one image frame, or the first image frame is a central image frame in at least one image frame.
  • the joint processing module 903 is specifically configured to process the feature region by using an active representation model (AAM) to obtain the first image feature.
  • AAM active representation model
  • the joint processing module 903 is specifically configured to: perform time-frequency transformation processing on the first audio signal to obtain the first audio feature.
  • the time-frequency distribution information includes the probability value corresponding to each time-frequency unit in the first audio signal; the probability value is used to indicate the probability that the audio signal generated by the target sound source falls into the time-frequency unit;
  • the processing module 903 is specifically configured to: obtain the first audio intensity value of each time-frequency unit in the first audio signal; and obtain the first audio intensity value of each time-frequency unit and the probability value corresponding to each time-frequency unit.
  • the second audio intensity value of each time-frequency unit; and the second audio signal is obtained according to the second audio intensity value of each time-frequency unit.
  • a voice recognition module 904 is further included; a voice recognition module 904 is configured to process the second audio signal using a voice recognition model to obtain the language text information carried in the second audio signal.
  • the computer-executable instructions in the embodiments of the present application may also be referred to as application program code, which is not specifically limited in the embodiments of the present application.
  • application program code which is not specifically limited in the embodiments of the present application.
  • it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions.
  • the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website site, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)), etc.
  • the various illustrative hardware logic units and circuits described in the embodiments of this application can be implemented by general-purpose processors, digital signal processors, application specific integrated circuits (ASIC), field programmable gate arrays (FPGA) or other programmable logic devices. , Discrete gates or transistor logic, discrete hardware components, or any combination of the above are designed to implement or operate the described functions.
  • the general-purpose processor may be a microprocessor, and optionally, the general-purpose processor may also be any traditional processor, controller, microcontroller, or state machine.
  • the processor can also be implemented by a combination of computing devices, such as a digital signal processor and a microprocessor, multiple microprocessors, one or more microprocessors combined with a digital signal processor core, or any other similar configuration achieve.
  • the steps of the method or algorithm described in the embodiments of the present application can be directly embedded in hardware, a software unit executed by a processor, or a combination of the two.
  • the software unit can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM or any other storage medium in the field.
  • the storage medium may be connected to the processor, so that the processor can read information from the storage medium, and can store and write information to the storage medium.
  • the storage medium may also be integrated into the processor.
  • the processor and the storage medium can be arranged in an ASIC, and the ASIC can be arranged in a terminal device.
  • the processor and the storage medium may also be arranged in different components in the terminal device.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Studio Devices (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种声源分离方法、装置及设备,其中方法包括:获取第一音频信号(S201),并获取该第一音频信号对应的至少一个图像帧(S202),该至少一个图像帧包括目标声源的图像信息;以及,根据第一音频信号和至少一个图像帧,获取目标声源在第一音频信号中的时频分布信息(S203);进而,根据时频分布信息,从第一音频信号中获取属于目标声源的第二音频信号(S204)。本方法可以较为准确地从第一音频信号中获取属于目标声源的第二音频信号。

Description

一种声源分离方法、装置及设备 技术领域
本申请涉及音频处理领域,尤其涉及一种声源分离方法、装置及设备。
背景技术
在嘈杂环境中,环境声音多为由多个声源发出来的声音混合而成的混合声音,如何对混合声音进行处理是音频处理领域的重要研究方向之一,有助于提高如助听器、录音笔、扩音器等音频处理设备的性能。
以助听器为例,在助听器多采用的传统语音增强技术中,只能对助听器接收到的混合声音进行增强,然而,该方法并不能抑制混合声音中环境噪声(非目标声源的声音)所产生的干扰,不利于听障人士听清目标声源的声音,也就是说,混合声音中环境噪声不利于提高助听器的言语理解度。虽然在一些改进的方案中,可以相对抑制环境噪声,并增强目标声源的声音,但对于听障人士而言,环境噪声对目标声源的声音所产生的干扰仍是无法忽略的。因此,如何从混合声音中区分出目标声源的声音,还有待进一步研究。
发明内容
本申请提供一种声源分离方法、装置及设备,用以从混合声音中区分出目标声源的声音。
第一方面,本申请实施例提供一种声源分离方法,包括:获取第一音频信号,并获取该第一音频信号对应的至少一个图像帧,该至少一个图像帧包括目标声源的图像信息;以及,根据第一音频信号和至少一个图像帧,获取目标声源在第一音频信号中的时频分布信息;进而,根据所得到的时频分布信息,从第一音频信号中获取属于目标声源的第二音频信号。
目标声源在发声时,目标声源的图像信息会满足一定的特征,且发声强度、发声频率的变化皆会带来图像信息的变化。例如,人在说话时脸部的图像信息会满足一定的特征,当人改变音调或声音大小时,脸部的图像信息也会产生一定变化。因此,本申请实施例利用目标声源在第一音频信号持续期间的图像信息获取目标声源在第一音频信号中的时频分布信息,有利于提高时频分布信息的准确性,进而有利于较为准确地从第一音频信号中获取属于目标声源的第二音频信号。
在一种可能的实现方式中,根据第一音频信号和至少一个图像帧,获取目标声源在第一音频信号中的时频分布信息,包括:获取第一音频信号的第一音频特征;从至少一个图像帧中获取第一图像帧,并识别第一图像帧中的特征区域;进而,根据特征区域获取第一图像特征;利用神经网络处理特征区域和第一图像特征,以及第一音频特征,以得到时频分布信息。
上述方法提供了一种获取时频分布信息的可能的实现方式。由于目标声源的图像信息和目标声源所产生的音频信号之间存在一定相关关系,因此可以通过训练神经网络,使神经网络能够模拟该相关关系。进而,可以利用神经网络处理特征区域和第一图像特征,以及第一音频特征,以得到时频分布信息。此外,在获取时频分布信息的过程中,使用了包括特征区域和第一图像特征的多个维度的图像特征,还有利于提高时频分布信息的准确 性。
在一种可能的实现方式中,利用神经网络处理特征区域和第一图像特征,以及第一音频特征,以得到时频分布信息,包括:利用神经网络处理特征区域以得到第二图像特征;以及,利用神经网络处理第一音频特征以得到第二音频特征;进而,对第一图像特征、第二图像特征和第二音频特征进行数据拼接以得到拼接特征;并利用神经网络处理拼接特征以得到融合特征;利用神经网络处理融合特征以得到时频分布信息。
在一种可能的实现方式中,第一图像帧为至少一个图像帧中的任一图像帧,或,第一图像帧为至少一个图像帧中的中心图像帧。在上述方法中,中心图像帧为第一音频信号持续的时长内,中间时间点所对应的图像帧。可以理解,中心图像帧相较于其它图像帧更具有代表性,因此基于中心图像帧获取的时频分布信息,有利于提高时频分布信息的准确性。
在一种可能的实现方式中,根据特征区域获取第一图像特征,包括:利用主动表象模型(AAM)处理所述特征区域,得到第一图像特征。
在一种可能的实现方式中,获取第一音频信号的第一音频特征,包括:对所述第一音频信号作时频变换处理,得到第一音频特征。
在一种可能的实现方式中,时频分布信息包括第一音频信号中每个时频单元对应的概率值,该概率值用于指示目标声源产生的音频信号落入该概率值对应的时频单元的概率;基于此,根据时频分布信息,从第一音频信号中获取属于目标声源的第二音频信号,包括:获取第一音频信号中每个时频单元的第一音频强度值;根据每个时频单元的第一音频强度值与每个时频单元对应的概率值,得到每个时频单元的第二音频强度值;根据每个时频单元的第二音频强度值,得到第二音频信号。
在一种可能的实现方式中,从第一音频信号中获取属于目标声源的第二音频信号之后,还包括:使用语音识别模型处理第二音频信号以获取第二音频信号中承载的语言文本信息。
第二方面,本申请实施例提供一种声源分离装置,包括:音频获取模块,用于获取第一音频信号;图像获取模块,用于获取第一音频信号对应的至少一个图像帧;至少一个图像帧包括目标声源的图像信息;联合处理模块,用于根据第一音频信号和至少一个图像帧,获取目标声源在第一音频信号中的时频分布信息;根据时频分布信息,从第一音频信号中获取属于目标声源的第二音频信号。
在一种可能的实现方式中,联合处理模块具体用于:获取第一音频信号的第一音频特征;从至少一个图像帧中获取第一图像帧;识别第一图像帧中的特征区域;根据特征区域获取第一图像特征;利用神经网络处理特征区域和第一图像特征,以及第一音频特征,以得到时频分布信息。
在一种可能的实现方式中,第一图像帧为至少一个图像帧中的任一图像帧,或,第一图像帧为至少一个图像帧中的中心图像帧。
在一种可能的实现方式中,联合处理模块具体用于:利用主动表象模型(AAM)处理特征区域,得到第一图像特征。
在一种可能的实现方式中,联合处理模块具体用于:对第一音频信号作时频变换处理,得到第一音频特征。
在一种可能的实现方式中,时频分布信息包括第一音频信号中每个时频单元对应的概率值;概率值用于指示目标声源产生的音频信号在该时频单元中存在的概率;联合处理模 块具体用于:获取第一音频信号中每个时频单元的第一音频强度值;根据每个时频单元的第一音频强度值与每个时频单元对应的概率值,得到每个时频单元的第二音频强度值;根据每个时频单元的第二音频强度值,得到第二音频信号。
在一种可能的实现方式中,还包括语音识别模块;语音识别模块,用于使用语音识别模型处理第二音频信号以获取第二音频信号中承载的语言文本信息。
第三方面,本申请实施例提供一种声源分离装置,包括处理器和存储器;其中,存储器用于存储程序指令;处理器用于运行程序指令,使该声源分离装置执行如第一方面中任一项的方法。
第四方面,本申请实施例提供一种声源分离设备,包括如第三方面所提供的声源分离装置,以及音频采集器,和/或,视频采集器;其中,音频采集器用于采集第一音频信号;视频采集器用于采集承载至少一个图像帧的第一视频信号。
在一种可能的实现方式中,还包括扬声器;扬声器用于将第二音频信号转换为外放声音。
在一种可能的实现方式中,还包括显示器;显示器用于显示从第二音频信号中识别的文本信息。
在一种可能的实现方式中,还包括收发器;收发器用于接收第一音频信号,和/或,接收第一视频信号,和/或,发送第二音频信号,和/或,发送从第二音频信号中识别的文本信息。
第五方面,本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面的方法。
第六方面,本申请实施例还提供一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面的方法。
本申请的这些方面或其他方面在以下实施例的描述中会更加简明易懂。
附图说明
图1为本申请实施例提供的一种声源分离设备示意图;
图2为本申请实施例提供的一种可能的声源分离方法流程示意图;
图3为本申请实施例提供的一种第一音频信号示意图;
图4为本申请实施例提供的一种图像帧示意图;
图5为本申请实施例提供的一种时频分布信息示意图;
图6为本申请实施例提供的一种第一音频信号对应的语谱图;
图7为本申请实施例提供的一种获取时频分布信息方法流程示意图;
图8为本申请实施例提供的一种神经网络结构示意图;
图9为本申请实施例提供的一种声源分离装置示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述。方法实施例中的具体操作方法也可以应用于装置实施例或系统实施例中。需要说明的是,在本申请的描述中“至少一个”是指一个或多个,其中,多个是指两个或两 个以上。鉴于此,本申请实施例中也可以将“多个”理解为“至少两个”。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,字符“/”,如无特殊说明,一般表示前后关联对象是一种“或”的关系。另外,需要理解的是,在本申请的描述中,“第一”、“第二”等词汇,仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。
为了抑制环境噪音对目标声源产生的干扰,本申请实施例提供了一种声源分离方法,该方法适用于声源分离装置,该装置可以是如助听器、录音笔等音频处理设备中的芯片、电路板或芯片组,可以运行必要的软件,该装置也可以是一个独立的音频处理设备。在本申请实施例中,通过联合处理由混合声音得到的混合音频信号和该混合音频信号对应的图像帧,从而从混合音频信号中分离出属于目标声源的音频信号,进而可以根据属于目标声源的音频信号,从混合声音中区分出目标声源的声音,因此能够抑制环境噪声对目标声源的声音产生的干扰。
图1示例性示出了本申请实施例适用的一种声源分离设备,如图1所示,设备100包括声源分离装置101,在一种可能的实现方式中,还可以包括音频采集器102和视频采集器103。其中,音频采集器102可以是麦克风,能够将采集到的混合声音转化为混合音频信号,并存储。视频采集器103可以是摄像头,能够拍摄目标声源的图像信息,并将采集到的图像信息以视频信号的形式存储。
声源分离装置101中包括处理器1011和存储器1012。可选的,声源分离装置101还可以包括总线1013。其中,处理器1011和存储器1012可以通过总线1013相互连接;总线1013可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。所述总线1013可以分为地址总线、数据总线、控制总线等。为便于表示,图1中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
处理器1011可以包括CPU、微处理器,也可进一步包括ASIC,或一个或多个用于控制本申请方案程序执行的集成电路。存储器1012可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically er服务器able programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器1012可以是独立存在,通过总线1013与处理器1011相连接。存储器1012也可以和处理器1011集成在一起。其中,存储器1012用于存储执行本申请实施例所提供的技术方案的计算机执行指令,并由处理器1011来控制执行。处理器1011用于执行存储器1012中存储的计算机执行指令,从而基于音频采集器102存储的混合音频信号和视频采集器103存储的视频信号,实现本申请实施例提供的声源分离方法。
此外,根据设备100的具体功能,设备100还可以包括其它功能器件。例如,设备100可以为助听器,在此设备100中还可以包括扬声器104,扬声器104可以将声源分离装置 101所获取的属于目标声源的音频信号转化为外放声音并播放给听障人士,有利于屏蔽环境噪声对目标声源的声音的干扰,提高助听器的语言理解度。
在一种可能的实现方式中,设备100还可以包括显示器105,显示器105可以用于显示属于目标声源的音频信号中承载的语言文本信息,有利于进一步提高助听器的语言理解度。
在一种可能的实现方式中,设备100还可以包括收发器106,收发器106能够支持行动热点(WiFi)、蓝牙等传送方式。收发器106可以发送属于目标声源的音频信号和/或属于目标声源的音频信号中所承载的语言文本信息,如将属于目标声源的音频信号中承载的语言文本信息发送给手机、平板电脑等终端设备,使得用户可以从终端设备的显示界面阅读语言文本信息。
此外,收发器106也可以接收其它设备发送的混合音频信号,和/或该混合音频信号对应的图像帧。例如,收发器106可以接收手机、平板电脑等终端设备采集的混合音频信号以及该混合音频信号对应的图像帧,并由声源分离装置101根据本申请实施例所提供的声源分离方法从混合音频信号中分离出属于目标声源的音频信号。
接下来,基于图1所示的声源分离设备,以具体示例对本申请实施例所提供的声音分离方法作进一步介绍。图2为本申请实施例提供的一种可能的声源分离方法流程示意图,该方法可以应用于如1中的声源分离装置101,如2所示,主要包括以下步骤:
S201:获取第一音频信号。具体而言,音频采集器102采集混合声音,并将混合声音转化为数字信号形式的混合音频信号之后,存储该混合音频信号。例如,存在声源A、声源B和声源C,声源A发出声音1,声源B发出声音2,声源C发出声音3,则音频采集器102采集到的混合声音中包括声音1、声音2和声音3,音频采集器102将混合声音转化为数字信号形式的混合音频信号后,存储所得到的混合音频信号。
声源分离装置101可以以一定的时间间隔T获取音频采集器102存储的混合音频信号中的全部或部分混合音频信号,所获取的混合音频信号即为第一音频信号。例如图3所示,为声源分离装置101依次从音频采集器102存储的混合音频信号中获取的n个第一音频信号,分别为信号S1至Sn,每个第一音频信号具有相同的时长,该时长可以与声源分离装置101获取第一音频信号的时间间隔T相同,也就是说,每个第一音频信号的时长为T。此外,在图3中n个第一音频信号属于同一个连续的混合音频信号时,相邻的第一音频信号之间可以存在部分时域重叠,如图3中S1、S2和S3所示。针对其中任一第一音频信号,皆可以适用本申请实施例所提供的声源分离方法,为了便于表述,本申请实施例仅以一个第一音频信号为例进行说明。
S202:声源分离装置101获取第一音频信号对应的至少一个图像帧。具体而言,视频采集器103采集目标声源的图像信息,并将所采集的目标声源的图像信息转化为数据信号形式的视频信号后,存储该视频信号。如S201示例中,声源A为目标声源,则视频采集器103采集声源A的图像信息,并将时频A的图像信息转化为数字信号形式的视频信号后,存储该视频信号。
在本申请实施例中,声源分离装置101也可以以时间间隔T获取视频采集器103存储的视频信号,并从所获取的视频信号中解析出视频信号所承载的至少一个图像帧,即第一音频信号对应的至少一个图像帧。如图4所示,对应的一个图像帧中包括目标声源—人的图像信息。可替换地,目标声源也可以为乐器、机械或动物等其它物体。
S203:声源分离装置101根据第一音频信号和至少一个图像帧,获取目标声源在第一音频信号中的时频分布信息。其中,目标声源在第一音频信号中的时频分布信息可以指示目标声源所产生的声音对应的音频信号在第一音频信号中的时频分布情况。因此,基于所得到的时频分布信息执行S204,可以从第一音频信号中获取属于目标声源的第二音频信号,从而实现声源分离。
图5为本申请实施例提供的一种时频分布信息示意图。应理解,图5仅为便于说明而将时频分布信息以语谱图的形式表示,在实际计算过程中时频分布信息可以为一系列数值。如图5所示,时频分布信息的长(横轴)为时间轴,宽(纵轴)为频率轴,图5中1个小方格代表1个时频单元。在一种可能的实现方式中,每个时频单元还对应有概率值,如图5中,时频单元a对应的概率值为0.8,则表示目标声源所产生的音频信号在时频单元a中存在的概率是0.8。
此外,第一音频信号可以用语谱图表示,如图6所示。与时频分布信息的区别在于,图5中,每个时频单元对应有第一音频强度值,表示第一音频信号在时频单元上的音频强度。如图6中,时频单元a对应的第一音频强度值为100,表示在时频单元a上第一音频信号的音频强度为100。
声源分离装置101可以根据图5所示的时频分布信息和图6所示的第一音频信号中,每个时频单元的第一音频强度值与每个时频单元对应的概率值,可以得到每个时频单元的第二音频强度值。例如,可以将时频单元a的第一音频强度值100和对应的概率值0.8之间的乘积80作为时频单元a的第二音频强度值,其它时频单元同理。
声源分离装置101在得到每个时频单元的第二音频强度值之后,进而得到属于目标声源的第二音频信号。通常,声源分离装置101可以根据多个时频单元的第二音频强度值,通过时频反变换得到第二音频信号,所得到的第二音频信号中每个时频单元的音频强度值可以为上述第二音频强度值,如以图5和图6中的时频单元a为例,时频单元a的第二音频强度值为80,多个时频单元被全部时频反变换后得到的第二音频信号,所得到的第二音频信号在时频单元a上的音频强度值为80。
通常,目标声源在发声时,目标声源的图像信息会满足一定的特征,且发声强度、发声频率的变化皆会带来图像信息的变化。例如,人在说话时脸部的图像信息会满足一定的特征,当人改变音调或声音大小时,脸部的图像信息也会产生一定变化。因此,本申请实施例利用目标声源在第一音频信号持续期间的图像信息获取目标声源在第一音频信号中的时频分布信息,有利于提高时频分布信息的准确性,进而有利于较为准确地从第一音频信号中获取属于目标声源的第二音频信号。
接下来,本申请实施例还提供一种可能的获取时频分布信息的方法,对应之前的步骤S203。图7为本申请实施例提供的一种获取时频分布信息方法流程示意图,如图7所示,主要包括以下步骤:
S701:声源分离装置101获取第一音频信号的第一音频特征。例如,可以对第一音频信号进行时频变换处理得到第一音频特征,如可以采用傅里叶变换(Fourier transform,FT)处理第一音频信号得到第一音频特征,又例如,还可以采用短时傅里叶变换(short-time Fourier transform,STFT)处理第一音频信号得到第一音频特征,等等。其中,STFT是一种常用的时频分析方法,它通过固定的换算公式,可将时域的第一音频信号转换为第一音频特征。
S702:声源分离装置101获取第一图像帧,并基于第一图像帧得到特征区域。该过程可以包括:声源分离装置101从至少一个图像帧中获取第一图像帧。在本申请实施例中,若只有一个与第一音频信号对应的图像帧,则该图像帧即为第一图像帧。若有多个与第一音频信号对应的图像帧,则第一图像帧可以是多个图像帧中的任一图像帧,也可以是多个图像帧中的中心图像帧。其中,中心图像帧可以理解为多个图像帧中位于中间时间点的图像帧。如上例中,至少一个图像帧中包括的声源A在时长T内的图像信息,则在存在有多个图形帧的情况下,第一图像帧可以是对应于时长T的中间时间点的图像帧,该第一图像帧中包括声源A在时长T的中间时间点的图像信息。可以理解,中心图像帧更具有代表性,因此基于中心图像帧所得到的时频分布信息会更加准确。
在一种可能的实现方式中,通过合理设置声源分离装置101获取第一音频信号的时长T,可以使第一音频信号对应有一个图像帧,且该图像帧即为时长T内中间时间点对应的中心图像帧。在此情况下,可以简化视频信号的处理过程。
进一步地,该过程包括:声源分离装置101基于得到的第一图像帧,进一步识别第一图像帧中的特征区域。在本申请实施例中,特征区域的选择与目标声源的类型有关,通常为目标声源发声时会产生一定图像信息变化的区域。如图4中目标声源为人,因此第一图像帧中包括人的图像信息。由于人在发声时,发声情况主要与人脸部的图像信息有关,因此可以通过人脸识别等图像处理算法识别第一图像帧中的特征区域,即第一图像帧中人脸的区域。或者可以通过其它图像识别算法识别出对应目标声源的区域作为目标区域。
S703:声源分离装置101根据特征区域获取第一图像特征。例如,可以通过预先训练得到的主动表象模型(active appearance model,AAM)处理特征区域,得到第一图像特征。AAM是广泛应用于模式识别领域的一种特征点提取方法,它不但考虑局部特征信息,而且综合考虑到全局形状和纹理信息,通过对人脸形状特征和纹理特征进行统计分析,建立人脸模型。也可以认为,AAM用若干关键点描述一个人脸,最终的第一图像特征包括这些关键点的一些坐标信息。
S704:声源分离装置101利用神经网络处理特征区域和第一图像特征,以及第一音频特征,以得到时频分布信息。可以理解,神经网络是预先经过训练得到的。例如,使用已知的样本音频、样本声源的特征区域和样本声源的图像特征,以及样本声源在样本音频中的时频分布信息训练神经网络。其中,样本声源的图像特征可以是样本音频对应的图像帧经AAM算法处理后所得到的图像特征。经多次训练后,将神经网络中的部分变量,如权重值确定下来后,该神经网络便具备了分析或计算时频分布信息的功能,该神经网络通过处理S701至S703所得到的输出信息来得到所述时频分布信息。具体来说,神经网络可以是卷积神经网络(Convolutional Neural Networks,CNN),也可以是循环神经网络(Recurrent Neural Network,RNN)、长短期记忆网络(Long/Short Term Memory,LSTM)等,也可以是多种网络类型的组合,具体的网络类型和结构,均可以根据实际效果调整。
图8为本申请实施例提供的一种神经网络结构示意图,如图8所示神经网络为双塔结构,主要包括图像流塔、音频流塔、全连接层和解码层。基于此,声源分离装置101获取时频分布信息的过程主要包括以下步骤:
步骤一:利用神经网络中的图像流塔处理特征区域,以得到第二图像特征。其中,该处理可以包括卷积、池化(pooling)、残差直连、批标准化(batch normalization)等操作。
在一种可能的实现方式中,图像流塔可以采用如下表一所示的网络:
表一
层(layer) 卷积单元(filters) 卷积核(kernel)
1 128 5×5
2 128 5×5
3 256 3×3
4 256 3×3
5 512 3×3
6 512 3×3
如表一所示,图像流塔包括6个卷积层,层与层之间可以具有相同的卷积核,也可以具有不同的卷积核,也即,声源分离装置101可以利用图像流塔对特征区域的数据进行6层卷积处理,每一层的卷积单元和卷积核大小如表中所示。例如,层1中,有128个卷积单元,声源分离装置101利用层1对特征区域的数据进行卷积处理时,所使用的卷积核大小为5×5,其它层类似,不再赘述。
如表一所示,图像流塔中层2和层3之间存在一次2倍降采样,使得卷积单元的数量翻倍,与之类似,层4与层5之间也存在一次2倍降采样。此外,每个卷积层之后还可以包括一个批标准化(batch normalization,BN)层和一个渗漏修正线性(leaky reluctant,Leaky ReLU)单元,也即声源分离装置101利用每个卷积层完成一次卷积处理之后,还可以利用BN层对卷积处理得到的数据进行批标准化处理,以及利用Leaky ReLU单元对批标准化处理得到的数据进行修正。此外,相邻卷积层之间还有一定的随机失活(Dropout)防止过拟合。
步骤二:声源分离装置101利用神经网络中的音频流塔处理第一音频特征,以得到第二音频特征。在该处理过程中,也可以包括卷积、pooling、残差直连、批标准化等操作。在本申请实施例中,音频流塔可以采用如下表二所示的网络:
表二
层(layer) 卷积单元(filters) 卷积核(kernel)
1 64 2×2
2 64 1×1
3 128 2×2
4 128 2×1
5 128 2×1
如表二所示,音频流塔包括5个卷积层,层与层之间可以具有相同的卷积核,也可以具有不同的卷积核,也即,声源分离装置101可以利用音频流塔对第一音频特征的数据进行5层卷积处理。例如,层1中,有64个卷积单元,声源分离装置101利用层1对第一音频特征的数据进行卷积处理时,所使用的卷积核大小2×2,其它层类似,不再赘述。
与图像流塔类似,本申请实施例中音频流塔的每个卷积层之后也可以包括一个BN层和一个Leaky ReLU层,本申请实施例对此不再赘述。
步骤三:声源分离装置101对第一图像帧的第一图像特征和第二图像特征,以及第二音频特征进行特征拼接,得到拼接特征,例如,声源分离装置101可以将以上三个特征的数据收尾相接完成特征拼接,将拼接后的数据作为拼接特征。
步骤四:声源分离装置101利用神经网络中的全连接层处理拼接特征,以得到融合特 征。
步骤五:声源分离装置101利用神经网络中的解码层处理融合特征,以得到时频分布信息。其中,解码层为音频流塔的镜像网络,相当于音频流塔的逆运算。
假设声源分离装置101利用音频流塔得到的第二音频特征有2048个数据值,利用图像流塔得到的第二图像特征有3200个数据值,且第一图像特征共80个数据值,则,声源分离装置101将第二音频特征、第二图像特征和第一图像特征拼接在一起,可以得到一个5328维的拼接特征。声源分离装置101利用三层全连接层处理拼接特征,可得到包括3200个数据值的融合特征。之后,声源分离装置101利用解码层处理融合特征,得到时频分布信息。
需要指出的是,图8仅为一种可行的具体示例,在图8所示的神经网络的基础上还存在着多种变形,例如,还可以改变图像流塔、音频流塔的网络层数,卷积核个数等,又例如,改变全连接层的网络层数、节点个数、连接方式等,或者增加其它网络模块,例如,在第一图像特征后接一层全连接层,然后经该全连接层处理第一图像特征之后,再将处理后的结果与第二图像特征和第二音频特征做拼接,本申请实施例对此不再一一列举。
采用上述方法,声源分离装置101将第一音频信号的第一音频特征、第一图像帧的第一图像特征(通过AAM获得)、第二图像特征(通过图像流塔获得)通过精心设计的神经网络关联并融合起来。在图像帧各层次特征的指导下,声源分离装置101有选择性地保留第一音频信号中和第一图像特征、第二图像特征强相关的部分,而舍弃无关的部分。相较于已有方法,本方案不仅同时利用了音频、图像帧,而且在也利用了图像帧多种不同层次的图像特征,并对这些不同层次的图像特征在指定步骤进行融合,提高了声源分离装置101所得到的时频分布信息的准确性。
在本申请实施例中,声源分离装置101在获取目标声源的第二音频信号之后,还可以对第二音频信号作进一步处理。在一种可能的实现方式中,声源分离装置101可以利用语音识别模型处理第二音频信号,以获取第二音频信号中承载的语言文本信息,该过程也可以叫做语义识别。其中,语音识别模型是根据已知的多个第三音频信号和这些第三音频信号分别对应的语言文本信息训练得到的。声源分离装置101在获取第二音频信号中承载的语言文本信息之后,还可以通过收发器106发送该语言文本信息,也可以控制显示器105显示该语言文本信息等等。
在一种可能的实现方式中,用于训练语音识别模型的多个第三音频信号中,存在一定数量的第三音频信号是根据图2所示的过程得到的。也就是说,在训练语音识别模型时,先获取包括第三音频信号的混合音频,以及第三音频信号对应的语言文本信息,根据如图2所示的过程从混合音频中获取第三音频信号。之后,根据经图2所示过程获取的第三音频信号和第三音频信号的语言文本信息训练语音识别模型。也即是说,语音识别模型的部分训练数据来自于本申请实施例之前提到的方法。采用上述方法训练得到的语音识别模型可以更加适配前述声源分离方法,进而提高对第二音频信号的识别结果的准确性。
在另一种可能的实现方式中,声源分离装置还可以根据具体的应用场景对第二音频信号作针对性处理。例如,在声源分离装置应用于助听器中时,声源分离装置还可以听障人士的听力损伤情况对第二音频信号的不同频段作适应性调整等等。
可以理解的是,为了实现上述功能,声源分离装置可以包括执行各个功能相应的硬件 结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,本申请实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
例如,图9示出了本申请实施例中声源分离装置的可能的示例性框图,该装置900或其中的至少一个模块可以以软件的形式存在,也可以是以硬件形式存在或者以软硬件结合形式存在。所述软件可以运行于各类处理器,包括但不限于中央处理器(central processing unit,CPU)、微处理器、微控制器、数字信号处理(digital signal processing,DSP)或神经处理器。所述硬件可以是声源分离设备中的半导体芯片、芯片组或电路板,并可以选择性执行软件来工作,例如可以包括CPU,DSP,专用集成电路(application specific integrated circuits,ASIC),现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。装置900可以实现或执行结合本申请方法实施例公开内容所描述的各种示例性的逻辑方框。所述处理器也可以是实现计算功能的组合,例如包括一个或多个微处理器组合,DSP和微处理器的组合等等。
如图9所示,装置900可以包括:音频获取模块901、图像获取模块902和联合处理模块903。具体的,在一个实施例中,音频获取模块901,用于获取第一音频信号;图像获取模块902,用于获取第一音频信号对应的至少一个图像帧;至少一个图像帧包括目标声源的图像信息;联合处理模块903,用于根据第一音频信号和至少一个图像帧,获取目标声源在第一音频信号中的时频分布信息;根据时频分布信息,从第一音频信号中获取属于目标声源的第二音频信号。
在一种可能的实现方式中,联合处理模块903具体用于:获取第一音频信号的第一音频特征;从至少一个图像帧中获取第一图像帧;识别第一图像帧中的特征区域;根据特征区域获取第一图像特征;利用神经网络处理特征区域和第一图像特征,以及第一音频特征,以得到时频分布信息。在一种可能的实现方式中,第一图像帧为至少一个图像帧中的任一图像帧,或,第一图像帧为至少一个图像帧中的中心图像帧。在一种可能的实现方式中,联合处理模块903具体用于:利用主动表象模型(AAM)处理特征区域,得到第一图像特征。在一种可能的实现方式中,联合处理模块903具体用于:对第一音频信号作时频变换处理,得到第一音频特征。在一种可能的实现方式中,时频分布信息包括第一音频信号中每个时频单元对应的概率值;概率值用于指示目标声源产生的音频信号落入时频单元的概率;联合处理模块903具体用于:获取第一音频信号中每个时频单元的第一音频强度值;根据每个时频单元的第一音频强度值与每个时频单元对应的概率值,得到每个时频单元的第二音频强度值;根据每个时频单元的第二音频强度值,得到第二音频信号。在一种可能的实现方式中,还包括语音识别模块904;语音识别模块904,用于使用语音识别模型处理第二音频信号以获取第二音频信号中承载的语言文本信息。
可选的,本申请实施例中的计算机执行指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用 计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包括一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
本申请实施例中所描述的各种说明性的硬件逻辑单元和电路可以通过通用处理器,数字信号处理器,专用集成电路(ASIC),现场可编程门阵列(FPGA)或其它可编程逻辑装置,离散门或晶体管逻辑,离散硬件部件,或上述任何组合的设计来实现或操作所描述的功能。通用处理器可以为微处理器,可选地,该通用处理器也可以为任何传统的处理器、控制器、微控制器或状态机。处理器也可以通过计算装置的组合来实现,例如数字信号处理器和微处理器,多个微处理器,一个或多个微处理器联合一个数字信号处理器核,或任何其它类似的配置来实现。
本申请实施例中所描述的方法或算法的步骤可以直接嵌入硬件、处理器执行的软件单元、或者这两者的结合。软件单元可以存储于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动磁盘、CD-ROM或本领域中其它任意形式的存储媒介中。示例性地,存储媒介可以与处理器连接,以使得处理器可以从存储媒介中读取信息,并可以向存储媒介存写信息。可选地,存储媒介还可以集成到处理器中。处理器和存储媒介可以设置于ASIC中,ASIC可以设置于终端设备中。可选地,处理器和存储媒介也可以设置于终端设备中的不同的部件中。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包括这些改动和变型在内。

Claims (19)

  1. 一种声源分离方法,其特征在于,包括:
    获取第一音频信号;
    获取所述第一音频信号对应的至少一个图像帧;所述至少一个图像帧包括目标声源的图像信息;
    根据所述第一音频信号和所述至少一个图像帧,获取所述目标声源在所述第一音频信号中的时频分布信息;
    根据所述时频分布信息,从所述第一音频信号中获取属于所述目标声源的第二音频信号。
  2. 如权利要求1所述的方法,其特征在于,根据所述第一音频信号和所述至少一个图像帧,获取所述目标声源在所述第一音频信号中的时频分布信息,包括:
    获取所述第一音频信号的第一音频特征;
    从所述至少一个图像帧中获取第一图像帧;
    识别所述第一图像帧中的特征区域;
    根据所述特征区域获取第一图像特征;
    利用神经网络处理所述特征区域和所述第一图像特征,以及所述第一音频特征,以得到所述时频分布信息。
  3. 如权利要求2所述的方法,其特征在于,所述第一图像帧为所述至少一个图像帧中的任一图像帧,或,所述第一图像帧为所述至少一个图像帧中的中心图像帧。
  4. 如权利要求2或3所述的方法,其特征在于,根据所述特征区域获取第一图像特征,包括:
    利用主动表象模型(AAM)处理所述特征区域,得到所述第一图像特征。
  5. 如权利要求2至4中任一项所述的方法,其特征在于,获取所述第一音频信号的第一音频特征,包括:
    对所述第一音频信号作时频变换处理,得到所述第一音频特征。
  6. 如权利要求1至5中任一项所述的方法,其特征在于,所述时频分布信息包括所述第一音频信号中每个时频单元对应的概率值;所述概率值用于指示所述目标声源产生的音频信号在所述时频单元中存在的概率;
    根据所述时频分布信息,从所述第一音频信号中获取属于所述目标声源的第二音频信号,包括:
    获取所述第一音频信号中每个时频单元的第一音频强度值;
    根据每个时频单元的第一音频强度值与所述每个时频单元对应的概率值,得到所述每个时频单元的第二音频强度值;
    根据所述每个时频单元的第二音频强度值,得到所述第二音频信号。
  7. 如权利要求1至6中任一项所述的方法,其特征在于,从所述第一音频信号中获取属于所述目标声源的第二音频信号之后,还包括:
    使用语音识别模型处理所述第二音频信号以获取所述第二音频信号中承载的语言文本信息。
  8. 一种声源分离装置,其特征在于,包括:
    音频获取模块,用于获取第一音频信号;
    图像获取模块,用于获取所述第一音频信号对应的至少一个图像帧;所述至少一个图像帧包括目标声源的图像信息;
    联合处理模块,用于根据所述第一音频信号和所述至少一个图像帧,获取所述目标声源在所述第一音频信号中的时频分布信息;根据所述时频分布信息,从所述第一音频信号中获取属于所述目标声源的第二音频信号。
  9. 如权利要求8所述的装置,其特征在于,所述联合处理模块具体用于:获取所述第一音频信号的第一音频特征;从所述至少一个图像帧中获取第一图像帧;识别所述第一图像帧中的特征区域;根据所述特征区域获取第一图像特征;利用神经网络处理所述特征区域和所述第一图像特征,以及所述第一音频特征,以得到所述时频分布信息。
  10. 如权利要求9所述的装置,其特征在于,所述第一图像帧为所述至少一个图像帧中的任一图像帧,或,所述第一图像帧为所述至少一个图像帧中的中心图像帧。
  11. 如权利要求9或10所述的装置,其特征在于,所述联合处理模块具体用于:利用主动表象模型(AAM)处理所述特征区域,得到所述第一图像特征。
  12. 如权利要求9至11中任一项所述的装置,其特征在于,所述联合处理模块具体用于:对所述第一音频信号作时频变换处理,得到所述第一音频特征。
  13. 如权利要求8至12中任一项所述的装置,其特征在于,所述时频分布信息包括所述第一音频信号中每个时频单元对应的概率值;所述概率值用于指示所述目标声源产生的音频信号在所述时频单元中的概率;
    所述联合处理模块具体用于:获取所述第一音频信号中每个时频单元的第一音频强度值;根据每个时频单元的第一音频强度值与所述每个时频单元对应的概率值,得到所述每个时频单元的第二音频强度值;根据所述每个时频单元的第二音频强度值,得到所述第二音频信号。
  14. 如权利要求8至13中任一项所述的装置,其特征在于,还包括语音识别模块;
    所述语音识别模块,用于使用语音识别模型处理所述第二音频信号以获取所述第二音频信号中承载的语言文本信息。
  15. 一种声源分离装置,其特征在于,包括处理器和存储器;
    所述存储器用于存储程序指令;
    所述处理器用于运行所述程序指令,使所述装置执行如权利要求1至7中任一项所述的方法。
  16. 一种声源分离设备,其特征在于,包括如权利要求15所述的声源分离装置,以及音频采集器,和/或,视频采集器;
    所述音频采集器用于采集所述第一音频信号;
    所述视频采集器用于采集承载所述至少一个图像帧的第一视频信号。
  17. 如权利要求16所述的设备,其特征在于,还包括扬声器;
    所述扬声器用于将所述第二音频信号转换为外放声音。
  18. 如权利要求16或17所述的设备,其特征在于,还包括显示器;
    所述显示器用于显示从所述第二音频信号中识别的文本信息。
  19. 如权利要求16至18中任一项所述的设备,其特征在于,还包括收发器;
    所述收发器用于接收所述第一音频信号,和/或,接收所述第一视频信号,和/或,发 送所述第二音频信号,和/或,发送从所述第二音频信号中识别的文本信息。
PCT/CN2019/076371 2019-02-27 2019-02-27 一种声源分离方法、装置及设备 WO2020172828A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/076371 WO2020172828A1 (zh) 2019-02-27 2019-02-27 一种声源分离方法、装置及设备
CN201980006671.XA CN111868823A (zh) 2019-02-27 2019-02-27 一种声源分离方法、装置及设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/076371 WO2020172828A1 (zh) 2019-02-27 2019-02-27 一种声源分离方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2020172828A1 true WO2020172828A1 (zh) 2020-09-03

Family

ID=72238795

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/076371 WO2020172828A1 (zh) 2019-02-27 2019-02-27 一种声源分离方法、装置及设备

Country Status (2)

Country Link
CN (1) CN111868823A (zh)
WO (1) WO2020172828A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116343809A (zh) * 2022-11-18 2023-06-27 上海玄戒技术有限公司 视频语音增强的方法及装置、电子设备和存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113393643B (zh) * 2021-06-10 2023-07-21 上海安亭地平线智能交通技术有限公司 异常行为预警方法、装置、车载终端以及介质
CN113889140A (zh) * 2021-09-24 2022-01-04 北京有竹居网络技术有限公司 音频信号播放方法、装置和电子设备
CN115174959B (zh) * 2022-06-21 2024-01-30 咪咕文化科技有限公司 视频3d音效设置方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1930575A (zh) * 2004-03-30 2007-03-14 英特尔公司 分离和评估音频和视频源数据的技术
CN101656070A (zh) * 2008-08-22 2010-02-24 展讯通信(上海)有限公司 一种语音检测方法
CN104795065A (zh) * 2015-04-30 2015-07-22 北京车音网科技有限公司 一种提高语音识别率的方法和电子设备
CN105389097A (zh) * 2014-09-03 2016-03-09 中兴通讯股份有限公司 一种人机交互装置及方法
CN107221324A (zh) * 2017-08-02 2017-09-29 上海木爷机器人技术有限公司 语音处理方法及装置
CN107800860A (zh) * 2016-09-07 2018-03-13 中兴通讯股份有限公司 语音处理方法、装置及终端设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105096935B (zh) * 2014-05-06 2019-08-09 阿里巴巴集团控股有限公司 一种语音输入方法、装置和系统
JP6464449B2 (ja) * 2014-08-29 2019-02-06 本田技研工業株式会社 音源分離装置、及び音源分離方法
US10109277B2 (en) * 2015-04-27 2018-10-23 Nuance Communications, Inc. Methods and apparatus for speech recognition using visual information
JP6686977B2 (ja) * 2017-06-23 2020-04-22 カシオ計算機株式会社 音源分離情報検出装置、ロボット、音源分離情報検出方法及びプログラム
CN108877787A (zh) * 2018-06-29 2018-11-23 北京智能管家科技有限公司 语音识别方法、装置、服务器及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1930575A (zh) * 2004-03-30 2007-03-14 英特尔公司 分离和评估音频和视频源数据的技术
CN101656070A (zh) * 2008-08-22 2010-02-24 展讯通信(上海)有限公司 一种语音检测方法
CN105389097A (zh) * 2014-09-03 2016-03-09 中兴通讯股份有限公司 一种人机交互装置及方法
CN104795065A (zh) * 2015-04-30 2015-07-22 北京车音网科技有限公司 一种提高语音识别率的方法和电子设备
CN107800860A (zh) * 2016-09-07 2018-03-13 中兴通讯股份有限公司 语音处理方法、装置及终端设备
CN107221324A (zh) * 2017-08-02 2017-09-29 上海木爷机器人技术有限公司 语音处理方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116343809A (zh) * 2022-11-18 2023-06-27 上海玄戒技术有限公司 视频语音增强的方法及装置、电子设备和存储介质
CN116343809B (zh) * 2022-11-18 2024-04-02 上海玄戒技术有限公司 视频语音增强的方法及装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN111868823A (zh) 2020-10-30

Similar Documents

Publication Publication Date Title
WO2020172828A1 (zh) 一种声源分离方法、装置及设备
US11823679B2 (en) Method and system of audio false keyphrase rejection using speaker recognition
WO2020006935A1 (zh) 动物声纹特征提取方法、装置及计算机可读存储介质
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
WO2019210557A1 (zh) 语音质检方法、装置、计算机设备及存储介质
CN102388416B (zh) 信号处理装置及信号处理方法
CN107799126A (zh) 基于有监督机器学习的语音端点检测方法及装置
CN101023469A (zh) 数字滤波方法和装置
Gong et al. Detecting replay attacks using multi-channel audio: A neural network-based method
CN108320732A (zh) 生成目标说话人语音识别计算模型的方法和装置
US20170287489A1 (en) Synthetic oversampling to enhance speaker identification or verification
CN110837758B (zh) 一种关键词输入方法、装置及电子设备
US20240177726A1 (en) Speech enhancement
CN112382302A (zh) 婴儿哭声识别方法及终端设备
CN113921026A (zh) 语音增强方法和装置
CN112908336A (zh) 一种用于语音处理装置的角色分离方法及其语音处理装置
CN112185342A (zh) 语音转换与模型训练方法、装置和系统及存储介质
JP4864783B2 (ja) パタンマッチング装置、パタンマッチングプログラム、およびパタンマッチング方法
CN117643075A (zh) 用于言语增强的数据扩充
CN115050350A (zh) 标注检查方法及相关装置、电子设备、存储介质
CN113299309A (zh) 语音翻译方法及装置、计算机可读介质和电子设备
CN115206347A (zh) 肠鸣音的识别方法、装置、存储介质及计算机设备
CN113724694B (zh) 语音转换模型训练方法、装置、电子设备及存储介质
CN113257284B (zh) 语音活动检测模型训练、语音活动检测方法及相关装置
CN117688344B (zh) 一种基于大模型的多模态细粒度倾向分析方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19916836

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19916836

Country of ref document: EP

Kind code of ref document: A1