WO2023020620A1 - 基于音频的处理方法和装置 - Google Patents

基于音频的处理方法和装置 Download PDF

Info

Publication number
WO2023020620A1
WO2023020620A1 PCT/CN2022/113733 CN2022113733W WO2023020620A1 WO 2023020620 A1 WO2023020620 A1 WO 2023020620A1 CN 2022113733 W CN2022113733 W CN 2022113733W WO 2023020620 A1 WO2023020620 A1 WO 2023020620A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
speaker
source signal
sound source
audio
Prior art date
Application number
PCT/CN2022/113733
Other languages
English (en)
French (fr)
Inventor
程光伟
Original Assignee
深圳地平线机器人科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳地平线机器人科技有限公司 filed Critical 深圳地平线机器人科技有限公司
Priority to US18/260,120 priority Critical patent/US20240304201A1/en
Publication of WO2023020620A1 publication Critical patent/WO2023020620A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Definitions

  • the present disclosure relates to the field of vehicle technology and the field of audio processing technology, in particular to an audio-based processing method and device.
  • noise can be reduced to a certain extent through signal acquisition and noise reduction.
  • the existing noise reduction method is to suppress wind noise and tire noise.
  • the signal played by the speaker is a mixed signal of multiple voices, and the listener will hear his own voice from the speaker. , poor user experience.
  • embodiments of the present disclosure provide an audio-based processing method and device.
  • an audio-based processing method including:
  • an audio-based processing device including:
  • the sound source signal extraction module is used to extract the target sound source signal from the mixed audio signal collected by the microphone array;
  • a sound source signal identification module configured to identify text content corresponding to the target sound source signal from the target sound source signal
  • a target speaker determination module configured to determine the target speaker based on the text content
  • a control module configured to control the target speaker to play the voice corresponding to the target sound source signal
  • An echo cancellation module configured to perform an echo detection in the sound zone to which the target sound source signal belongs based on the position of the target speaker, the position of the speaker in the sound zone to which the target sound source signal belongs, and the voice playback volume of the target speaker speakers for echo cancellation.
  • a computer-readable storage medium stores a computer program, and the computer program is used to execute the audio-based processing method described in the above-mentioned first aspect.
  • an electronic device includes:
  • the processor is configured to read the executable instructions from the memory, and execute the instructions to implement the audio-based processing method described in the first aspect above.
  • the target sound source signal is extracted from the mixed audio signal collected by the microphone array, and then the target sound source signal is identified from the target sound source signal Corresponding text content, then determine the speaker to be used according to the text content, and then control the target speaker to play the voice corresponding to the target sound source signal on the one hand, so as to realize the smooth communication among the people in the vehicle under the high-speed driving state.
  • the echo cancellation is performed on the speaker in the sound zone to which the target sound source signal belongs, so as to prevent the speaker from The speakers belonging to the sound zone play their own voice, which improves the user experience.
  • Fig. 1 is a schematic flowchart of an audio-based processing method according to an embodiment of the present disclosure.
  • Fig. 2 is a structural block diagram of an audio-based processing device according to an embodiment of the present disclosure.
  • Fig. 3 is a structural block diagram of the echo cancellation module 250 in an embodiment of the present disclosure.
  • Fig. 4 is a structural block diagram of the sound source signal extraction module 210 in an embodiment of the present disclosure.
  • Fig. 5 is a structural block diagram of the target speaker determination module 230 in an embodiment of the present disclosure.
  • Fig. 6 is a structural diagram of an electronic device provided by an exemplary embodiment of the present disclosure.
  • plural may refer to two or more than two, and “at least one” may refer to one, two or more than two.
  • Embodiments of the present disclosure may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which may operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well known terminal devices, computing systems, environments and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers include, but are not limited to: personal computer systems, server computer systems, thin clients, thick client Computers, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the foregoing, etc.
  • Fig. 1 is a schematic flowchart of an audio-based processing method according to an embodiment of the present disclosure. This embodiment can be applied to electronic equipment, as shown in Figure 1, including the following steps:
  • a microphone array is provided in the vehicle, and the sound source signal of passengers at each seat can be collected through the microphone array.
  • a microphone and a loudspeaker are respectively arranged for each seat.
  • the microphone array includes five microphones, which are respectively arranged in the main driver's seat, the co-pilot's seat, the rear left passenger seat, the rear middle passenger seat and the rear right passenger seat.
  • Each microphone belongs to a fixed sound zone, for example, the microphone of the main driver’s seat belongs to the sound zone of the main driver’s seat, the microphone of the passenger’s seat belongs to the sound zone of the passenger’s seat, and so on.
  • the mixed audio signal can be processed to extract a target sound source signal, such as a voice signal of a passenger, from the mixed audio signal.
  • a target sound source signal such as a voice signal of a passenger
  • the mixed audio signal collected by the microphone array if only one person speaks in a certain period of time (for example, within 5 seconds), then only noise and the sound source signal of this person are included in the period of time; If there are many people speaking, the mixed audio signal in this time period includes noise and sound source signals of multiple people. It is necessary to extract the target sound source signal from the sound source signals of multiple people, and for each remaining sound source signal other than the target sound source signal A sound source signal is also subjected to the same processing through subsequent steps (ie, steps S2 to S5).
  • an audio recognition technology is used to identify the target sound source signal to obtain text content corresponding to the target sound source signal.
  • the words containing the chat object can be determined based on the text processing of the text content, and further, the speaker corresponding to the chat object can be determined and used as the target speaker.
  • S5 Based on the position of the target speaker, the position of the speaker in the sound zone to which the target sound source signal belongs, and the volume of the voice, perform echo cancellation on the speakers in the sound zone to which the target sound source signal belongs.
  • the vehicle audio system pre-stores the position information of each speaker, and the vehicle audio system preliminarily models the actual sound emission measurement of each position in the vehicle, and calculates and obtains the optimal offset function of each position.
  • the speaker position of the sound zone to which the speaker belongs that is, the speaker position in the sound zone to which the target sound source signal belongs
  • the speaker position of the sound zone to which the chat object belongs that is, the position of the target speaker
  • the optimal cancellation function generates a cancellation signal for canceling out the speaker's audio, and based on the cancellation signal, the voice of the speaker in the speaker to which the speaker belongs can be canceled out.
  • the target sound source signal is extracted from the mixed audio signal collected by the microphone array, and then the text content corresponding to the target sound source signal is identified from the target sound source signal, and then the text content to be used is determined according to the text content.
  • the speaker and then on the one hand, control the target speaker to play the voice corresponding to the text content, so as to realize the smooth communication among the people in the vehicle when the vehicle is driving at high speed;
  • the position of the speaker and the volume of the voice perform echo cancellation on the speakers in the sound zone to which the target sound source signal belongs, preventing the speaker from playing his or her own voice through the speakers in the sound zone to which the target sound source signal belongs, and improving the user experience.
  • step S5 includes:
  • S5-1 Obtain the position of the auditory organ of the person generating the target sound source signal in space.
  • a camera for capturing in-vehicle video or in-vehicle images is installed in the vehicle. Based on the image captured by the camera and the parameters of the camera, the position of the auditory organ of the person generating the target sound source signal in space, that is, the position of the ear of the speaker, can be determined through image analysis.
  • the parameters of the camera include parameters such as focal length and resolution of the camera.
  • a radar is installed in the vehicle, and the position of the auditory organ of the person generating the target sound source signal in space can be determined by analyzing the point cloud data scanned by the radar.
  • S5-2 Based on the position of the auditory organs of the person generating the target sound source signal in space, the position of the target speaker, the position of the speaker in the sound zone to which the target sound source signal belongs, and the volume of the voice, the target sound source signal belongs to The speakers in the sound zone perform echo cancellation.
  • the distance between the speaker and the human ear can be calculated based on the position of the human ear and the position of the speaker in the sound zone to which the speaker belongs.
  • the optimal offset function based on the distance between the speaker and the human ear, and the speaker position of the sound zone to which the chat object belongs (that is, the position of the target speaker) and the volume of the playback voice, the optimal offset function is used to generate
  • the optimal canceling signal for canceling the speaker's audio frequency can cancel the voice of the speaker in the speaker to which the speaker belongs to the maximum extent based on the optimal canceling signal.
  • the distance between the speaker and the speaker can be determined by obtaining the position of the speaker's auditory organ in space and the position of the speaker's speaker, and the offset signal can be dynamically adjusted based on the real-time distance, so that the maximum Cancels the sound of the speaker's own speaking in the speakers' register.
  • step S1 includes:
  • S1-1 Detect whether there are people on each seat in the car.
  • S1-2 Perform vocal separation on the target audio signal based on the sound zone of the microphone corresponding to the seat with people on it, and extract the target sound source signal according to the result of the vocal separation, wherein the microphone array is arranged at each seat in the car microphone.
  • the human voice separation model can be trained on a plurality of sound source signals, for example, the human voice separation model can be trained by the sound source signals of people who often ride in the vehicle, and an effective human voice separation model can be performed based on the trained human voice separation model.
  • a dynamic gain control function for suppressing dynamic tire noise and wind noise is provided in advance, and real-time tire noise and wind noise are suppressed based on the dynamic gain control function.
  • the vocal separation is performed only on the microphones with people on the seats, which can improve the efficiency of vocal separation and reduce the consumption of system resources.
  • real-time tire noise suppression and wind noise suppression can be performed through the dynamic gain control function. After human voice separation, tire noise suppression and wind noise suppression, the target sound source signal can be accurately extracted.
  • step S3 includes:
  • S3-1 Extract keywords in the text content. Wherein, word segmentation processing and keyword extraction are performed on the text content to extract keywords in the text content.
  • S3-3 Determine the target speaker based on the matching result.
  • the corresponding preset keywords include A 1 , A 2 and A 3 , and if any of the keywords in A 1 , A 2 and A 3 are included in the text content, it can be determined that speaker A is target speaker. For other speakers, corresponding preset keywords are also set correspondingly.
  • the keywords in the text content are matched with multiple preset keywords, and the target speaker can be determined quickly and accurately according to the matching results.
  • step S3-3 includes:
  • S3-3-1 Establish correspondence between at least two speakers and multiple preset keywords.
  • the at least two speakers include the speaker corresponding to the sound zone to which the target sound source signal belongs.
  • the at least two speakers include a speaker of a speaker and a speaker of a chat partner.
  • the correspondence between all the speakers and the corresponding keywords is pre-established, for example, if five speakers are set in a five-seater car, the correspondence between the five speakers and the corresponding preset keywords can be pre-established .
  • S3-3-2 Match each of the plurality of preset keywords with keywords in the text content to obtain matching results between at least two speakers and the text content.
  • S3-3-3 Determine the target speaker based on the matching result and the corresponding relationship.
  • the correspondence between some speakers and preset keywords may be established only for some speakers in the vehicle. If the keyword in the text content is successfully matched with one of the preset keywords among the multiple preset keywords, it means that the speaker of the sound source signal has a specified chat partner, and the The loudspeaker is used as the target loudspeaker; if the keyword in the text content fails to match all the keywords in the multiple preset keywords, it means that the speaker of the sound source signal has no designated chat object, and it can be that all seats are seated with someone speakers or all speakers as the target speakers.
  • step S3-3-1 includes:
  • S3-3-1-1 Establish the first matching relationship between at least two target seats and multiple preset keywords, and/or establish the first matching relationship between the personnel on at least two target seats and multiple keywords Two matching relationships, wherein at least two target seats are set in one-to-one correspondence with at least two loudspeakers.
  • the target seat may be bound to a preset keyword, or the person on the target seat may be bound to a preset keyword.
  • the person's name, alias or code name of the person on the target seat can be bound to the preset keyword. For example, bind the alias "old three" with the designated person.
  • S3-3-1-2 Based on the first matching relationship and/or the second matching relationship, establish a corresponding relationship between at least two speakers and multiple preset keywords.
  • a corresponding relationship between at least two speakers and multiple preset keywords may be established.
  • the speaker in the main driving position may be associated with keywords such as “main driving position”, “main driving”, “driver” and “driver”.
  • keywords such as "main driving position”, “main driving”, “driver” and “driver”.
  • it is also possible to establish a corresponding relationship between the speaker of the passenger seat and keywords such as "passenger seat”, “passenger seat” and "front passenger”.
  • a corresponding relationship between at least two speakers and multiple preset keywords may also be established.
  • a person with the alias "Third” is sitting in the rear left passenger seat.
  • the speaker on the left passenger seat in the rear row can be connected to the key
  • the word "old three” establishes a corresponding relationship, and it may be to establish a corresponding relationship between the speaker on the left passenger seat and the real name of "old three".
  • a matching relationship with the speaker may be established based on the seat, person's name, alias or code name, as the corresponding relationship between the speaker and the preset keyword.
  • a preset keyword in the matching relationship appears in the keywords of the text content corresponding to the target sound source signal, the target speaker and chat object can be determined quickly and accurately.
  • the audio-based processing method further includes: when the designated speaker plays the audio of the target type, based on the position of the designated speaker, the positions of the remaining speakers other than the designated speaker, and the volume of the target audio, downscaling the remaining speakers noise.
  • the target type audio includes the output audio of a certain passenger when performing human-computer interaction, listening to music or watching a movie.
  • a passenger When a passenger is interacting with humans, listening to music, or watching movies, it can be based on the volume of the audio played by the speaker at the passenger, the position of the speaker at the passenger, and the speaker that needs to be noise-reduced (for example, there are people in the seat, and the seat Noise reduction in the position of the speaker in the position where the person does not want to be disturbed).
  • step S5 after step S5, it also includes:
  • a preset chat keyword such as "stop chatting” or “stop talking”, etc.
  • Any audio-based processing method provided by the embodiments of the present disclosure may be executed by any appropriate device capable of data processing, including but not limited to: a terminal device, a server, and the like.
  • any audio-based processing method provided in the embodiments of the present disclosure may be executed by a processor, for example, the processor executes any audio-based processing method mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in a memory. I won't go into details below.
  • Fig. 2 is a structural block diagram of an audio-based processing device according to an embodiment of the present disclosure.
  • the audio-based processing device of the embodiment of the present disclosure includes: a sound source signal extraction module 210 , a sound source signal identification module 220 , a target speaker determination module 230 , a control module 240 and an echo cancellation module 250 .
  • the sound source signal extraction module 210 is used to extract the target sound source signal from the mixed audio signal collected by the microphone array;
  • the sound source signal identification module 220 is used to identify the text content corresponding to the target sound source signal from the target sound source signal;
  • a target speaker determination module 230 configured to determine a target speaker based on text content
  • the control module 240 is used to control the target speaker to play the voice corresponding to the target sound source signal
  • the echo cancellation module 250 is configured to perform echo cancellation on the speakers in the sound zone to which the target sound source signal belongs based on the position of the target speaker, the speaker position in the sound zone to which the target sound source signal belongs, and the voice playback volume of the target speaker.
  • Fig. 3 is a structural block diagram of the echo cancellation module 250 in an embodiment of the present disclosure. As shown in FIG. 3, in an embodiment of the present disclosure, the echo cancellation module 250 includes:
  • Hearing organ locating unit 2501 configured to acquire the position of the auditory organ of the person generating the target sound source signal in space;
  • the echo canceling unit 2502 is configured to, based on the position of the auditory organ of the person generating the target sound source signal in space, the position of the target speaker, the position of the speaker in the sound zone to which the target sound source signal belongs, and the volume of the voice, to perform a sound detection on the target sound source.
  • the speakers in the sound zone to which the signal belongs perform echo cancellation.
  • Fig. 4 is a structural block diagram of the sound source signal extraction module 210 in an embodiment of the present disclosure. As shown in Figure 4, in one embodiment of the present disclosure, the sound source signal extraction module 210 includes:
  • the detection unit 2101 is used to detect whether there are people on each seat in the car;
  • the sound source signal processing unit 2102 is configured to perform vocal separation on the target audio signal based on the assigned sound zone of the microphone corresponding to the seat with people on it, and extract the target sound source signal according to the result of the vocal separation, wherein the microphone array includes Microphones at each seat in the car.
  • Fig. 5 is a structural block diagram of the target speaker determination module 230 in an embodiment of the present disclosure. As shown in FIG. 5, in one embodiment of the present disclosure, the target speaker determination module 230 includes:
  • a keyword extraction unit 2301 configured to extract keywords in the text content
  • a keyword matching unit 2302 configured to match keywords in the text content with a plurality of preset keywords
  • a target speaker determining unit 2303 configured to determine the target speaker based on the matching result.
  • the target speaker determining unit 2303 is configured to establish correspondence between at least two speakers and multiple preset keywords, wherein at least two speakers include all The speaker corresponding to the assigned sound zone;
  • the target speaker determination unit 2303 is further configured to match each of the plurality of preset keywords with keywords in the text content to obtain matching results between at least two speakers and the text content;
  • the target speaker determining unit 2303 is further configured to determine the target speaker based on the matching result and the corresponding relationship.
  • the target speaker determination unit 2303 is configured to establish a first matching relationship between at least two target seats and a plurality of preset keywords, and/or establish at least two people on the target seats A second matching relationship with multiple keywords, wherein at least two target seats are set in one-to-one correspondence with at least two speakers;
  • the target speaker determination unit 2303 is further configured to establish a correspondence between at least two speakers and multiple preset keywords based on the first matching relationship and/or the second matching relationship.
  • control module 240 is further configured to perform noise reduction on the remaining speakers based on the position of the designated speaker, the positions of the remaining speakers other than the designated speaker, and the volume of the target audio when the designated speaker plays the target type audio .
  • the specific implementation of the audio-based processing device in the embodiment of the present disclosure is similar to the specific implementation of the audio-based processing method in the embodiment of the present disclosure.
  • FIG. 6 illustrates a block diagram of an electronic device according to an embodiment of the disclosure.
  • electronic device 10 includes one or more processors 610 and memory 620 .
  • Processor 610 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
  • CPU central processing unit
  • Processor 610 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
  • Memory 620 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache).
  • the non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, and the like.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 11 may execute the program instructions to implement the audio-based processing methods and/or the various embodiments of the present disclosure described above or other desired functionality.
  • Various contents such as input signal, signal component, noise component, etc. may also be stored in the computer-readable storage medium.
  • the electronic device may further include: an input device 630 and an output device 640, and these components are interconnected through a bus system and/or other forms of connection mechanisms (not shown).
  • the input device 630 may also include, for example, a keyboard, a mouse, and the like.
  • the output device 640 may include, for example, a display, a speaker, a printer, a communication network and remote output devices connected thereto, and the like.
  • the electronic device may also include any other suitable components according to specific applications.
  • embodiments of the present disclosure may also be computer program products, which include computer program instructions that, when executed by a processor, cause the processor to perform the above-mentioned "exemplary method" of this specification.
  • the steps in the audio-based processing method according to various embodiments of the present disclosure are described in the section.
  • the methods and apparatus of the present disclosure may be implemented in many ways.
  • the methods and apparatuses of the present disclosure may be implemented by software, hardware, firmware or any combination of software, hardware, and firmware.
  • the above sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise.
  • the present disclosure can also be implemented as programs recorded in recording media including machine-readable instructions for realizing the method according to the present disclosure.
  • the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
  • each component or each step can be decomposed and/or reassembled. These decompositions and/or recombinations should be considered equivalents of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种基于音频的处理方法和装置,处理方法包括:从麦克风阵列采集的混合音频信号中提取出目标声源信号(S1);从目标声源信号中识别出与目标声源信号对应的文本内容(S2);基于文本内容确定目标扬声器(S3);控制目标扬声器播放目标声源信号对应的语音(S4);基于目标扬声器的位置、目标声源信号所归属音区内的扬声器位置和目标扬声器的语音播放音量,对目标声源信号所归属音区内的扬声器进行回声消除(S5)。可以实现在车辆高速行驶状态下车内人员之间的顺畅交流,还可以避免通过说话者自身所归属音区的扬声器播放自己说话的声音,提升了用户体验。

Description

基于音频的处理方法和装置
本申请要求于2021年8月20日提交到国家知识产权局、申请号为202110959350.0、发明名称为“基于音频的处理方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及车辆技术领域和音频处理技术领域,尤其涉及一种基于音频的处理方法和装置。
背景技术
在车辆高速行驶过程中,车内的噪音会严重影响车内人员的听觉,尤其是针对驾驶员,较强的噪音会导致驾驶员分散注意力,影响行车安全。
相关技术中,通过信号采集和降噪,可以在一定程度上降低噪音。但是,现有的降噪方式是对风噪和胎噪进行抑制,当车内存在多人闲聊时,扬声器播放的信号是多人声的混合信号,听者会从扬声器听到自己说话的声音,用户体验差。
发明内容
为了解决上述技术问题,本公开的实施例提供了一种基于音频的处理方法和装置。
根据本公开实施例的第一方面,提供了一种基于音频的处理方法,包括:
从麦克风阵列采集的混合音频信号中提取出目标声源信号;
从所述目标声源信号中识别出与所述目标声源信号对应的文本内容;
基于所述文本内容确定所述目标扬声器;
控制所述目标扬声器播放所述目标声源信号对应的语音;
基于所述目标扬声器的位置、所述目标声源信号所归属音区内的扬声器位置和所述目标扬声器的语音播放音量,对所述目标声源信号所归属音区内的扬声器进行回声消除。
根据本公开实施例的第二方面,提供了一种基于音频的处理装置,包括:
声源信号提取模块,用于从麦克风阵列采集的混合音频信号中提取出目标声源信号;
声源信号识别模块,用于从所述目标声源信号中识别出与所述目标声源信号对应的文本内容;
目标扬声器确定模块,用于基于所述文本内容确定所述目标扬声器;
控制模块,用于控制所述目标扬声器播放所述目标声源信号对应的语音;
回声消除模块,用于基于所述目标扬声器的位置、所述目标声源信号所归属音区内的扬声器位置和所述目标扬声器的语音播放音量,对所述目标声源信号所归属音区内的扬声器进行回声消除。
根据本公开实施例的第三方面,提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述第一方面所述的基于音频的处理方法。
根据本公开实施例的第四方面,提供了一种电子设备,所述电子设备包括:
处理器;
用于存储所述处理器可执行指令的存储器;
所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述第一方面所述的基于音频的处理方法。
基于本公开上述实施例提供的基于音频的处理方法和装置,从麦克风阵列采集的混合音频信号中提取出目标声源信号,然后从所述目标声源信号中识别出与所述目标声源信号对应的文本内容,接着根据文本内容确定所需使用的扬声器,再接着一方面控制所述目标扬声器播放所述目标声源信号对应的语音,实现在车辆高速行驶状态下车内人员之间的顺畅交流,另一方面基于目标扬声器的位置、目标声源信号所归属音区内的扬声器位置和语音的音量,对目标声源信号所归属音区内的扬声器进行回声消除,避免说话者通过自身所归属音区的扬声器播放自己说话的声音,提升了用户体验。
下面通过附图和实施例,对本公开的技术方案做进一步的详细描述。
附图说明
通过结合附图对本公开实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显。附图用来提供对本公开实施例的进一步理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开,并不构成对本公开的限制。在附图中,相同的参考标号通常代表相同部件或步骤。
图1是本公开实施例的基于音频的处理方法的流程示意图。
图2是本公开实施例的基于音频的处理装置的结构框图。
图3是本公开一个实施例中回声消除模块250的结构框图。
图4是本公开一个实施例中声源信号提取模块210的结构框图。
图5是本公开一个实施例中目标扬声器确定模块230的结构框图。
图6是本公开一示例性实施例提供的电子设备的结构图。
具体实施方式
下面,将参考附图详细地描述根据本公开的示例实施例。显然,所描述的实施例仅仅是本公开的一部分实施例,而不是本公开的全部实施例,应理解,本公开不受这里描述的示例实施例的限制。
应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
本领域技术人员可以理解,本公开实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。
还应理解,在本公开实施例中,“多个”可以指两个或两个以上,“至少一个”可以指一个、两个或两个以上。
本公开实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
示例性方法
图1是本公开实施例的基于音频的处理方法的流程示意图。本实施例可应用在电子设备上,如图1所示,包括如下步骤:
S1:从麦克风阵列采集的混合音频信号中提取出目标声源信号。
具体地,在车内设置麦克风阵列,可以通过麦克风阵列采集每个座位处乘客的声源信号。其中,为每个座位分别设置一个麦克风和一个扬声器。以两排五座车辆为例,麦克风阵列包括五个麦克风,分别设置在主驾驶位、副驾驶位、后排左乘客位、后排中间乘客位和后排右乘客位。每个麦克风归属一个固定的音区,例如主驾驶位的麦克风归属主驾驶位音区,副驾驶位的麦克风归属副驾驶位的音区,等等。
通过麦克风阵列采集到包括噪音和至少一个人声的混合音频信号之后,通过对混合音频信号进行处理,可以从混合音频信号中提取目标声源信号,例如某个乘客的语音信号。其中,通过麦克风阵列采集的混合音频信号,在某个时间段内(例如5秒内)如果只有一个人说话,则该时间段内仅包括噪音和这个人的声源信号;如果该时间段内有多个人说话,则该时间段内的混合音频信号包括噪声和多个人的声源信号,需要从多个人的声源信号中提取出目标声源信号,并针对目标声源信号以外剩余的每个声源信号,也通过后续步骤(即步骤S2至S5)进行相同处理。
S2:从目标声源信号中识别出与目标声源信号对应的文本内容。
具体地,使用音频识别技术对目标声源信号进行识别,得到与目标声源信号对应的文本内容。
S3:基于文本内容确定目标扬声器。
具体地,对文本内容进行文本处理,例如分词处理,可以得到每句话中的名词、动词和形容词,等等。由于文本内容通常会出现一些可以确定聊天对象的词语,基于对文本内容进行文本处理后可以确定包含聊天对象的词语,进一步的,可以确定与聊天对象对应的扬声器,并将该扬声器作为目标扬声器。
S4:控制目标扬声器播放目标声源信号对应的语音。
S5:基于目标扬声器的位置、目标声源信号所归属音区内的扬声器位置和语音的音量,对目标声源信号所归属音区内的扬声器进行回声消除。
具体地,车载音频系统中预先存储有各个扬声器的位置信息,且车载音频系统预先对车内各个位置实际发声测量建模,计算获取各个位置最优抵消函数。基于说话者所归属音区的扬声器位置(即目标声源信号所归属音区内的扬声器位置)、聊天对象所归属音区的扬声器位置(即目标扬声器的位置)和播放语音的音量,通过最优抵消函数生成用于抵消说话者音频的抵消信号,基于抵消信号可以抵消说话者所归属音区的扬声器中自己说话的声音。
在本实施例中,从麦克风阵列采集的混合音频信号中提取出目标声源信号,然后从目标声源信号中识别出与目标声源信号对应的文本内容,接着根据文本内容确定所需使用的扬声器,再接着一方面控制目标扬声器播放文本内容对应的语音,实现在车辆高速行驶下车内人员之间的顺畅交流,另一方面基于目标扬声器的位置、目标声源信号所归属音区内的扬声器位置和语音的音量,对目标声源信号所归属音区内的扬声器进行回声消除,避免说话者通过自身所归属音区的扬声器播放自己说话的声音,提升了用户体验。
在本公开的一个实施例中,步骤S5包括:
S5-1:获取目标声源信号的产生人员的听觉器官在空间中的位置。
在本公开的一个示例中,在车内安装有用于拍摄车内视频或车内图像的摄像头。基于摄像头拍摄的图像和摄像头的参数,通过图像分析可以确定目标声源信号的产生人员的听觉器官在空间中的位置,即说话者的耳朵位置。其中,摄像头的参数包括摄像头的焦距和分辨率等参数。
在本公开的另一个示例中,在车内安装有雷达,通过雷达扫描的点云数据进行分析,可以确定目标声源信号的产生人员的听觉器官在空间中的位置。
S5-2:基于目标声源信号的产生人员的听觉器官在空间中的位置、目标扬声器的位置、目标声源信号所归属音区内的扬声器位置和语音的音量,对目标声源信号所归属音区内的扬声器进行回声消除。
具体地,可以基于人耳位置和说话者所归属音区的扬声器位置计算出该扬声器与人耳之间的距离。在使用最优抵消函数时,基于该扬声器与人耳之间的距离,以及聊天对象所归属音区的扬声器位置(即目标扬声器的位置)和播放语音的音量,通过最优抵消函数生成用于抵消说话者音频的最优抵消信号,基于最优抵消信号可以最大程度地抵消说话者所归属音区的扬声器中自己说话的声音。
在本实施例中,通过获取说话者的听觉器官在空间中的位置和说话者的扬声器位置可以确定说话者与扬声器之间的距离,基于实时的距离可以动态调整抵消信号,从而可以最大程度地抵消说话者所归属音区的扬声器中自己说话的声音。
在本公开的一个实施例中,步骤S 1包括:
S1-1:检测车内各个座位上是否乘坐有人。
具体地,可以通过图像识别、红外线检测或者座椅重量检测等方式确定车内各个座位上是否乘坐有人。
S1-2:基于乘坐有人的座位对应的麦克风的所归属音区,对目标音频信号进行人声分离,根据人声分离结果提取目标声源信号,其中,麦克风阵列包括设置在车内各个座位处的麦克风。
具体地,仅对乘坐有人的座位对应的麦克风进行人声分离后,进行胎噪抑制和风噪抑制,最终可以得到目标声源信号。在本实施例中,可以通过对多个声源信号进行训练人声分离模型,例如通过经常乘坐该车辆人员的声源信号训练人声分离模型,基于训练好的人声分离模型可以进行有效的人声分离。在本实施例中,预先给出根据动态的胎噪噪音和风噪噪音进行抑制的动态增益控制函数,基于动态增益控制函数对实时的胎噪噪音和风噪噪音进行抑制。
在本实施例中,仅对座位上有人的麦克风进行人声分离,可以提升人声分离效率,降低系统资源消耗。此外,通过动态增益控制函数可以进行实时的胎噪抑制和风噪抑制。在人声分离、胎噪抑制和风噪抑制之后,可以准确地提取出目标声源信号。
在本公开的一个实施例中,步骤S3包括:
S3-1:提取文本内容中的关键词。其中,对文本内容进行分词处理和关键词提取,可以提取出文本内容中的关键词。
S3-2:将文本内容中的关键词与多个预设关键词进行匹配。其中,多个预设关键词中的每个关键词均对应有相应的扬声器。
S3-3:基于匹配结果确定目标扬声器。示例性地,对于扬声器A,对应的预设关键词包括A 1、A 2和A 3,如果文本内容中包括A 1、A 2和A 3中的任何一个关键词,则可以确定扬声器A为目标扬声器。对于其他的扬声器,也对应设置有相应的预设关键词。
在本实施例中,将文本内容中的关键词与多个预设关键词进行匹配,根据匹配结果可以快速、准确地确定目标扬声器。
在本公开的一个实施例中,步骤S3-3包括:
S3-3-1:建立至少两个扬声器与多个预设关键词之间的对应关系。其中,至少两个扬声器中包括与目标声源信号的所归属音区对应的扬声器。
具体地,至少两个扬声器包括说话者的扬声器和聊天对象的扬声器。可选地,预先建立好所有 扬声器与对应关键词之间的对应关系,例如五座车内设置了五个扬声器,则可以预先建立好五个扬声器与对应的预设关键词之间的对应关系。
S3-3-2:将多个预设关键词中的每个关键词分别与文本内容中的关键词进行匹配,得到至少两个扬声器和文本内容之间的匹配结果。
S3-3-3:基于匹配结果和对应关系,确定目标扬声器。
在本实施例中,可以仅针对车内部分扬声器,建立部分扬声器与预设关键词的对应关系。如果文本内容中的关键词与多个预设关键词中某个预设关键词匹配成功,则表示声源信号的说话者有指定的聊天对象,此时将匹配成功的预设关键词对应的扬声器作为目标扬声器;如果文本内容中的关键词与多个预设关键词中的所有关键词均匹配失败,则表示声源信号的说话者没有指定的聊天对象,可以是将所有座位上乘坐有人的扬声器或所有扬声器作为目标扬声器。
在本公开的一个实施例中,步骤S3-3-1包括:
S3-3-1-1:建立至少两个目标座位与多个预设关键词之间的第一匹配关系,和/或建立至少两个目标座位上的人员与多个关键词之间的第二匹配关系,其中,至少两个目标座位与至少两个扬声器一一对应设置。
具体地,可以通过将目标座位与预设关键词进行绑定,也可以将目标座位上的人员与预设关键词进行绑定。在将目标座位上的人员与预设关键词进行绑定时,可以将目标座位上人员的人名、别名或代号与预设关键词进行绑定。例如将别名为“老三”与指定人员进行绑定。
S3-3-1-2:基于第一匹配关系和/或第二匹配关系,建立至少两个扬声器与多个预设关键词之间的对应关系。
具体地,可以基于第一匹配关系,建立至少两个扬声器与多个预设关键词之间的对应关系。例如可以是将主驾驶位的扬声器与“主驾驶位”、“主驾”、“司机”和“驾驶员”等关键词建立对应关系。例如还可以是将副驾驶位的扬声器与“副驾驶位”、“副驾”和“前排乘客”等关键词建立对应关系。
此外,还可以基于第二匹配关系,建立至少两个扬声器与多个预设关键词之间的对应关系。例如别名为“老三”的人员坐在了后排左乘客位,通过图像识别等方式确定了别名为“老三”的人员的乘坐位置后,可以是将后排左乘客位的扬声器与关键词“老三”建立对应关系,且可以是将左乘客位的扬声器与“老三”的真名建立对应关系。
在本实施例中,可以基于座位、人名、别名或代号,与扬声器之间建立匹配关系,作为扬声器与预设关键词之间的对应关系。当目标声源信号对应的文本内容的关键词中出现了匹配关系中的某个预设关键词时,可以快速、准确地确定目标扬声器和聊天对象。
在本公开的一个实施例中,基于音频的处理方法还包括:在指定扬声器播放目标类型音频时,基于指定扬声器的位置、指定扬声器以外剩余扬声器的位置和目标音频的音量,对剩余扬声器进行降噪。
在本实施例中,目标类型音频包括某个乘客进行人机交互、听音乐或看电影时的输出音频。在某个乘客进行人机交互、听音乐或看电影时,可以基于该乘客处的扬声器播放音频的音量、该乘客处的扬声器的位置和需要进行降噪的扬声器(例如座位上有人,且座位上的人不希望被打扰的位置处的扬声器)的位置进行降噪。
在本公开的一个实施例中,在步骤S5之后,还包括:
S6:如果从麦克风阵列采集的混合音频信号中识别出预设的结束聊天关键词时,关闭结束聊天关键词对应的声源信号所归属的扬声器。
在本实施例中,在车辆人员聊天过程中,如果检测到某个人说出了预设的结束聊天关键词(例如“不聊了”或“不说了”等),表示这个人不想继续聊天了,此时关闭这个人的扬声器,避免其他人聊天时(例如某个人与无目的对象的闲聊)打扰这个人。
本公开实施例提供的任一种基于音频的处理方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本公开实施例提供的任一种基于音频的处理方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本公开实施例提及的任一种基于音频的处理方法。下文不再赘述。
示例性装置
图2是本公开实施例的基于音频的处理装置的结构框图。如图2所示,本公开实施例的基于音频的处理装置,包括:声源信号提取模块210、声源信号识别模块220、目标扬声器确定模块230、控制模块240和回声消除模块250。
其中,声源信号提取模块210,用于从麦克风阵列采集的混合音频信号中提取出目标声源信号;
声源信号识别模块220,用于从目标声源信号中识别出与目标声源信号对应的文本内容;
目标扬声器确定模块230,用于基于文本内容确定目标扬声器;
控制模块240,用于控制目标扬声器播放目标声源信号对应的语音;
回声消除模块250,用于基于目标扬声器的位置、目标声源信号所归属音区内的扬声器位置和目标扬声器的语音播放音量,对目标声源信号所归属音区内的扬声器进行回声消除。
图3是本公开一个实施例中回声消除模块250的结构框图。如图3所示,在本公开的一个实施例中,回声消除模块250包括:
听觉器官定位单元2501,用于获取目标声源信号的产生人员的听觉器官在空间中的位置;
回声消除单元2502,用于基于目标声源信号的产生人员的听觉器官在空间中的位置、目标扬声器的位置、目标声源信号所归属音区内的扬声器位置和语音的音量,对目标声源信号所归属音区内的扬声器进行回声消除。
图4是本公开一个实施例中声源信号提取模块210的结构框图。如图4所示,在本公开的一个实施例中,声源信号提取模块210包括:
检测单元2101,用于检测车内各个座位上是否乘坐有人;
声源信号处理单元2102,用于基于乘坐有人的座位对应的麦克风的所归属音区,对目标音频信号进行人声分离,根据人声分离结果提取目标声源信号,其中,麦克风阵列包括设置在车内各个座位处的麦克风。
图5是本公开一个实施例中目标扬声器确定模块230的结构框图。如图5所示,在本公开的一个实施例中,目标扬声器确定模块230包括:
关键词提取单元2301,用于提取文本内容中的关键词;
关键词匹配单元2302,用于将文本内容中的关键词与多个预设关键词进行匹配;
目标扬声器确定单元2303,用于基于匹配结果确定目标扬声器。
在本公开的一个实施例中,目标扬声器确定单元2303,用于建立至少两个扬声器与多个预设关键词之间的对应关系,其中,至少两个扬声器中包括与目标声源信号的所归属音区对应的扬声器;
目标扬声器确定单元2303,还用于将多个预设关键词中的每个关键词分别与文本内容中的关键词进行匹配,得到至少两个扬声器和文本内容之间的匹配结果;
目标扬声器确定单元2303,还用于基于匹配结果和对应关系,确定目标扬声器。
在本公开的一个实施例中,目标扬声器确定单元2303,用于建立至少两个目标座位与多个预设 关键词之间的第一匹配关系,和/或建立至少两个目标座位上的人员与多个关键词之间的第二匹配关系,其中,至少两个目标座位与至少两个扬声器一一对应设置;
目标扬声器确定单元2303,还用于基于第一匹配关系和/或第二匹配关系,建立至少两个扬声器与多个预设关键词之间的对应关系。
在本公开的一个实施例中,控制模块240,还用于在指定扬声器播放目标类型音频时,基于指定扬声器的位置、指定扬声器以外剩余扬声器的位置和目标音频的音量,对剩余扬声器进行降噪。
需要说明的是,本公开实施例的基于音频的处理装置的具体实施方式与本公开实施例的基于音频的处理方法的具体实施方式类似,具体参见基于音频的处理方法部分,为了减少冗余,不作赘述。
示例性电子设备
下面,参考图6来描述根据本公开实施例的电子设备。
图6图示了根据本公开实施例的电子设备的框图。如图6所示,电子设备10包括一个或多个处理器610和存储器620。
处理器610可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备中的其他组件以执行期望的功能。
存储器620可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器11可以运行所述程序指令,以实现上文所述的本公开的各个实施例的基于音频的处理方法以及/或者其他期望的功能。在所述计算机可读存储介质中还可以存储诸如输入信号、信号分量、噪声分量等各种内容。
在一个示例中,电子设备还可以包括:输入装置630和输出装置640,这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。
此外,该输入设备630还可以包括例如键盘、鼠标等等。
该输出装置640可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图6中仅示出了该电子设备中与本公开有关的组件中的一些,省略了诸如总线、输入/输出接口等等的组件。除此之外,根据具体应用情况,电子设备还可以包括任何其他适当的组件。
示例性计算机程序产品和计算机可读存储介质
除了上述方法和设备以外,本公开的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本公开各种实施例的基于音频的处理方法中的步骤。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于系统实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。
可能以许多方式来实现本公开的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和装置。用于所述方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。
还需要指出的是,在本公开的装置、设备和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本公开。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本公开的范围。因此,本公开不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。

Claims (10)

  1. 一种基于音频的处理方法,包括:
    从麦克风阵列采集的混合音频信号中提取出目标声源信号;
    从所述目标声源信号中识别出与所述目标声源信号对应的文本内容;
    基于所述文本内容确定所述目标扬声器;
    控制所述目标扬声器播放所述目标声源信号对应的语音;
    基于所述目标扬声器的位置、所述目标声源信号所归属音区内的扬声器位置和所述目标扬声器的语音播放音量,对所述目标声源信号所归属音区内的扬声器进行回声消除。
  2. 根据权利要求1所述的基于音频的处理方法,其中,所述基于所述目标扬声器的位置、所述目标声源信号所归属音区内的扬声器位置和所述语音的音量,对所述目标声源信号所归属音区内的扬声器进行回声消除,包括:
    获取所述目标声源信号的产生人员的听觉器官在空间中的位置;
    基于所述目标声源信号的产生人员的听觉器官在空间中的位置、所述目标扬声器的位置、所述目标声源信号所归属音区内的扬声器位置和所述语音的音量,对所述目标声源信号所归属音区内的扬声器进行回声消除。
  3. 根据权利要求1所述的基于音频的处理方法,其中,所述从麦克风阵列采集的混合音频信号中提取出目标声源信号,包括:
    检测车内各个座位上是否乘坐有人;
    基于乘坐有人的座位对应的麦克风的所归属音区,对所述目标音频信号进行人声分离,根据人声分离结果提取所述目标声源信号,其中,所述麦克风阵列包括设置在车内各个座位处的麦克风。
  4. 根据权利要求1所述的基于音频的处理方法,其中,所述基于所述文本内容确定所述目标扬声器,包括:
    提取所述文本内容中的关键词;
    将所述文本内容中的关键词与多个预设关键词进行匹配;
    基于匹配结果确定所述目标扬声器。
  5. 根据权利要求4所述的基于音频的处理方法,其中,所述将所述文本内容中的关键词与多个预设关键词进行匹配,基于匹配结果确定目标扬声器,包括:
    建立至少两个扬声器与所述多个预设关键词之间的对应关系,其中,所述至少两个扬声器中包括与所述目标声源信号的所归属音区对应的扬声器;
    将所述多个预设关键词中的每个关键词分别与所述文本内容中的关键词进行匹配,得到所述至少两个扬声器和所述文本内容之间的匹配结果;
    基于所述匹配结果和所述对应关系,确定所述目标扬声器。
  6. 根据权利要求5所述的基于音频的处理方法,其中,所述建立至少两个扬声器与所述多个预设关键词之间的对应关系,包括:
    建立至少两个目标座位与所述多个预设关键词之间的第一匹配关系,和/或建立所述至少两个目标座位上的人员与所述多个关键词之间的第二匹配关系,其中,所述至少两个目标座位与所述至少两个扬声器一一对应设置;
    基于所述第一匹配关系和/或所述第二匹配关系,建立所述建立至少两个扬声器与所述多个预设关键词之间的对应关系。
  7. 根据权利要求1所述的基于音频的处理方法,其中,还包括:
    在指定扬声器播放目标类型音频时,基于所述指定扬声器的位置、所述指定扬声器以外剩余扬声器的位置和所述目标音频的音量,对所述剩余扬声器进行降噪。
  8. 一种基于音频的处理装置,包括:
    声源信号提取模块,用于从麦克风阵列采集的混合音频信号中提取出目标声源信号;
    声源信号识别模块,用于从所述目标声源信号中识别出与所述目标声源信号对应的文本内容;
    目标扬声器确定模块,用于基于所述文本内容确定所述目标扬声器;
    控制模块,用于控制所述目标扬声器播放所述目标声源信号对应的语音;
    回声消除模块,用于基于所述目标扬声器的位置、所述目标声源信号所归属音区内的扬声器位置和所述目标扬声器的语音播放音量,对所述目标声源信号所归属音区内的扬声器进行回声消除。
  9. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1-7任一所述的基于音频的处理方法。
  10. 一种电子设备,所述电子设备包括:
    处理器;
    用于存储所述处理器可执行指令的存储器;
    所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现上述权利要求1-7任一所述的基于音频的处理方法。
PCT/CN2022/113733 2021-08-20 2022-08-19 基于音频的处理方法和装置 WO2023020620A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/260,120 US20240304201A1 (en) 2021-08-20 2022-08-19 Audio-based processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110959350.0 2021-08-20
CN202110959350.0A CN113674754A (zh) 2021-08-20 2021-08-20 基于音频的处理方法和装置

Publications (1)

Publication Number Publication Date
WO2023020620A1 true WO2023020620A1 (zh) 2023-02-23

Family

ID=78544306

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113733 WO2023020620A1 (zh) 2021-08-20 2022-08-19 基于音频的处理方法和装置

Country Status (3)

Country Link
US (1) US20240304201A1 (zh)
CN (1) CN113674754A (zh)
WO (1) WO2023020620A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118571219A (zh) * 2024-08-02 2024-08-30 成都赛力斯科技有限公司 座舱内人员对话增强方法、装置、设备及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113674754A (zh) * 2021-08-20 2021-11-19 深圳地平线机器人科技有限公司 基于音频的处理方法和装置
CN114678021B (zh) * 2022-03-23 2023-03-10 小米汽车科技有限公司 音频信号的处理方法、装置、存储介质及车辆

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040101145A1 (en) * 2002-11-26 2004-05-27 Falcon Stephen R. Dynamic volume control
CN109817240A (zh) * 2019-03-21 2019-05-28 北京儒博科技有限公司 信号分离方法、装置、设备及存储介质
CN110070868A (zh) * 2019-04-28 2019-07-30 广州小鹏汽车科技有限公司 车载系统的语音交互方法、装置、汽车和机器可读介质
US20200219493A1 (en) * 2019-01-07 2020-07-09 2236008 Ontario Inc. Voice control in a multi-talker and multimedia environment
CN111629301A (zh) * 2019-02-27 2020-09-04 北京地平线机器人技术研发有限公司 用于控制多个扬声器播放音频的方法、装置和电子设备
US20210092522A1 (en) * 2019-09-20 2021-03-25 Peiker Acustic Gmbh System, method, and computer readable storage medium for controlling an in car communication system
CN113674754A (zh) * 2021-08-20 2021-11-19 深圳地平线机器人科技有限公司 基于音频的处理方法和装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9947334B2 (en) * 2014-12-12 2018-04-17 Qualcomm Incorporated Enhanced conversational communications in shared acoustic space
KR20180102871A (ko) * 2017-03-08 2018-09-18 엘지전자 주식회사 이동단말기 및 이동단말기의 차량 제어 방법
CN108022597A (zh) * 2017-12-15 2018-05-11 北京远特科技股份有限公司 一种声音处理系统、方法及车辆
CN113053402B (zh) * 2021-03-04 2024-03-12 广州小鹏汽车科技有限公司 一种语音处理方法、装置和车辆

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040101145A1 (en) * 2002-11-26 2004-05-27 Falcon Stephen R. Dynamic volume control
US20200219493A1 (en) * 2019-01-07 2020-07-09 2236008 Ontario Inc. Voice control in a multi-talker and multimedia environment
CN111629301A (zh) * 2019-02-27 2020-09-04 北京地平线机器人技术研发有限公司 用于控制多个扬声器播放音频的方法、装置和电子设备
CN109817240A (zh) * 2019-03-21 2019-05-28 北京儒博科技有限公司 信号分离方法、装置、设备及存储介质
CN110070868A (zh) * 2019-04-28 2019-07-30 广州小鹏汽车科技有限公司 车载系统的语音交互方法、装置、汽车和机器可读介质
US20210092522A1 (en) * 2019-09-20 2021-03-25 Peiker Acustic Gmbh System, method, and computer readable storage medium for controlling an in car communication system
CN113674754A (zh) * 2021-08-20 2021-11-19 深圳地平线机器人科技有限公司 基于音频的处理方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118571219A (zh) * 2024-08-02 2024-08-30 成都赛力斯科技有限公司 座舱内人员对话增强方法、装置、设备及存储介质
CN118571219B (zh) * 2024-08-02 2024-10-15 成都赛力斯科技有限公司 座舱内人员对话增强方法、装置、设备及存储介质

Also Published As

Publication number Publication date
US20240304201A1 (en) 2024-09-12
CN113674754A (zh) 2021-11-19

Similar Documents

Publication Publication Date Title
WO2023020620A1 (zh) 基于音频的处理方法和装置
EP3776535B1 (en) Multi-microphone speech separation
US20230122905A1 (en) Audio-visual speech separation
US9293133B2 (en) Improving voice communication over a network
US9324322B1 (en) Automatic volume attenuation for speech enabled devices
Chen et al. The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results
CN109410978B (zh) 一种语音信号分离方法、装置、电子设备及存储介质
JP2023536270A (ja) 拡張現実におけるバイノーラル再生のためのヘッドホン等化および室内適応のためのシステムおよび方法
US11810585B2 (en) Systems and methods for filtering unwanted sounds from a conference call using voice synthesis
KR20230098266A (ko) 통화들 및 오디오 메시지들로부터 다른 화자들의 음성 필터링
US12073849B2 (en) Systems and methods for filtering unwanted sounds from a conference call
JP2023546703A (ja) マルチチャネル音声アクティビティ検出
US12073844B2 (en) Audio-visual hearing aid
CN110737422B (zh) 一种声音信号采集方法及装置
WO2023040820A1 (zh) 音频播放方法、装置、计算机可读存储介质及电子设备
WO2023116087A1 (zh) 语音交互指令处理方法、装置及计算机可读存储介质
CN118571219B (zh) 座舱内人员对话增强方法、装置、设备及存储介质
US11935557B2 (en) Techniques for detecting and processing domain-specific terminology
CN114708878A (zh) 音频信号处理方法、装置、存储介质和电子设备
CN114974245A (zh) 语音分离方法和装置、电子设备和存储介质
JP6169526B2 (ja) 特定音声抑圧装置、特定音声抑圧方法及びプログラム
Rani et al. Desktop Based Speech Recognition For Hearing And Visually Impaired Using Machine Learning
CN117953890A (zh) 车辆中音量控制方法、车辆、电子设备及存储介质
CN115214503A (zh) 车内声音控制方法、装置及汽车
WO2024123364A1 (en) Annotating automatic speech recognition transcription

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22857929

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18260120

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE