WO2020043037A1 - Voice transcription device, system and method, and electronic device - Google Patents

Voice transcription device, system and method, and electronic device Download PDF

Info

Publication number
WO2020043037A1
WO2020043037A1 PCT/CN2019/102482 CN2019102482W WO2020043037A1 WO 2020043037 A1 WO2020043037 A1 WO 2020043037A1 CN 2019102482 W CN2019102482 W CN 2019102482W WO 2020043037 A1 WO2020043037 A1 WO 2020043037A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
voice signal
target
signal
transcription
Prior art date
Application number
PCT/CN2019/102482
Other languages
French (fr)
Chinese (zh)
Inventor
余涛
许云峰
刘章
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020043037A1 publication Critical patent/WO2020043037A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording

Definitions

  • the present application relates to the technical field of voice signal processing, and in particular, to a voice transcription device, system, method, and electronic device.
  • Speech transcription technology has been a hot research topic in the field of speech signal processing in recent years. With the continuous deepening of research, this technology has been widely used in trial venues and multi-person conferences.
  • Figure 1 shows a common speech transcription scene.
  • This solution is equipped with a gooseneck microphone device in front of each person.
  • the gooseneck microphone device collects each person's audio, transmits the collected audio to the audio processing device, and the audio processing device performs amplification processing on the collected original audio.
  • the amplified audio is then sent to the transcription cloud service, and the speech transcription process is performed on the amplified audio through the transcription cloud service.
  • the present application provides a voice transcription device to solve the problems that the target voice cannot be picked up and external crosstalk interference exists in the prior art.
  • the application additionally provides a speech transcription system and method, and an electronic device.
  • This application provides a voice transcription device, including:
  • a voice acquisition device for acquiring a voice signal in the receiving range of the array through a microphone array
  • a sound source positioning device configured to determine a sound source position of the voice signal if the voice signal includes a voice signal
  • a target voice filtering device configured to use the voice signal as a target voice signal if the sound source position is within a target range
  • the signal sending device is configured to send the target voice signal outward, so that a voice transcription server performs voice transcription on the target voice signal.
  • Optional also includes:
  • a voice noise reduction device configured to perform voice enhancement on the target voice signal according to the position of the sound source
  • the signal sending device is specifically configured to send an enhanced target voice signal outward.
  • Optional also includes:
  • a noise covariance determining device configured to determine a noise covariance of the voice signal if the voice signal includes a noise signal
  • the voice noise reduction device is further configured to suppress the noise signal according to the noise covariance.
  • Optional also includes:
  • the target range configuration device is configured to acquire the target range and store the target range.
  • Optional also includes:
  • the target voice filtering device is further configured to shield the voice signal if the sound source position is not within the target range.
  • the arrangement of the microphone array includes a square array or a circular array.
  • Optional also includes:
  • a voice detection device is used to detect whether the voice signal includes a voice signal; if so, the sound source localization device is activated.
  • Optional also includes:
  • a voice detection device is used to detect whether the voice signal includes the noise signal; if so, the noise covariance determination device is activated.
  • This application also provides a speech transcription system, including:
  • the above-mentioned voice transcription device and a voice transcription server; wherein the server is configured to perform voice transcription on a target voice signal uploaded by the voice transcription device.
  • This application also provides a voice transcription method, including:
  • the voice signal includes a voice signal, determining a sound source position of the voice signal
  • Optional also includes:
  • the sending the target voice signal outward includes:
  • Optional also includes:
  • the voice signal includes a noise signal, determining a noise covariance of the voice signal
  • Optional also includes:
  • Optional also includes:
  • the voice signal is shielded.
  • Optional also includes:
  • This application also provides an electronic device, including:
  • a memory for storing a program that implements the voice transcription method after the device is powered on and runs the program for the voice transcription method through the processor, the following steps are performed: the microphone array is used to collect voice signals within the array receiving range; if If the voice signal includes a voice signal, determine a sound source position of the voice signal; if the sound source position is within a target range, use the voice signal as a target voice signal; and send the target voice signal outward , So that the voice transcription server performs voice transcription on the target voice signal.
  • the present application also provides a computer-readable storage medium having instructions stored in the computer-readable storage medium that, when run on a computer, causes the computer to execute the various methods described above.
  • the present application also provides a computer program product including instructions that, when run on a computer, causes the computer to perform the various methods described above.
  • the voice transcription device collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determining a sound source position of the voice signal; if the sound source position is at a target Within the range, the voice signal is used as the target voice signal; the target voice signal is sent to a voice transcription server, so that the server performs voice transcription of the target voice signal; this processing method is based on a microphone array Multi-microphone enhancement of the voice signal in the pickup area, while determining whether it is the target voice according to the position of the sound source, and filtering the sound outside the target area to ensure that the sound outside the area does not enter the transcription server; therefore, it can effectively ensure Pick up the target speech, improve the anti-interference ability to non-target speech, and improve the quality of speech transcription.
  • FIG. 1 is a diagram of a voice transcription scene in the prior art
  • FIG. 2 is a schematic structural diagram of an embodiment of a voice transcription device provided by the present application.
  • 3a is a schematic diagram of a microphone array of an embodiment of a voice transcription device provided by the present application.
  • 3b is a schematic diagram of a microphone array of an embodiment of a voice transcription device provided by the present application.
  • FIG. 4 is a specific structural schematic diagram of an embodiment of a voice transcription device provided by the present application.
  • FIG. 5 is a schematic diagram of another specific structure of an embodiment of a voice transcription device provided by the present application.
  • FIG. 6 is a data processing flowchart of an embodiment of a voice transcription device provided by the present application.
  • FIG. 7 is a system schematic diagram of an embodiment of a voice transcription system provided by the present application.
  • FIG. 8 is a schematic scenario diagram of an embodiment of a speech transcription system provided by the present application.
  • FIG. 9 is a specific flowchart of an embodiment of a voice transcription method provided by the present application.
  • FIG. 10 is a schematic diagram of an embodiment of an electronic device provided by the present application.
  • FIG. 2 is a schematic diagram of an embodiment of a voice transcription device provided by the present application.
  • the device includes a voice acquisition device 1, a sound source localization device 2, a target voice filtering device 3, and a signal transmission device 4.
  • the voice acquisition device 1 is configured to acquire a voice signal within a receiving range of the array through a microphone array.
  • the microphone array includes a plurality of microphones, and each microphone is an array element in the array.
  • a microphone is an energy conversion device that converts a sound signal into an electrical signal. It is also called a microphone, a microphone, or a microphone.
  • the microphone can be transmitted by the vibration of the sound to the diaphragm of the microphone, and the magnet inside is pushed to form a changed current, so that the changed current is sent to the subsequent sound processing circuit for amplification processing.
  • the microphone array can pick up a voice signal within its receiving range.
  • the receiving range is called the array receiving range and refers to the range of the voice signal that the microphone array can receive.
  • the receiving range of the array depends on the arrangement of the array elements and the number of array elements.
  • the size of the microphone array is not only closely related to the collection of speech and noise signals, but also has a certain impact on the accuracy of sound source localization.
  • a microphone is a sound sensor that converts sound signals into voltage signals. When the sound source is far away from the microphone, the microphone cannot collect the sound signal or the collected voltage signal is very small, which causes the signal-to-noise ratio to be too low, which is disadvantageous for estimating the orientation of the sound source.
  • the larger the distance between the microphones the larger the phase difference between the sound sources, and the easier the orientation of the sound source is. The smaller the distance is, the more the resolution will decrease due to the spatial aliasing of the phase difference.
  • the arrangement of the microphone array can be flexibly adjusted according to actual needs.
  • the arrangement of the array elements includes, but is not limited to, a circle, a square, and a linearly arranged shape.
  • FIG. 3a and FIG. 3b are schematic diagrams of a microphone array of an embodiment of a speech transcription system.
  • Fig. 3a shows a square microphone array with the characteristics of the array elements being the same and equally spaced
  • Fig. 3b shows a circular microphone array with the characteristics of the array elements being the same and arranged at equal intervals on the circumference Each array element.
  • the voice acquisition device 1 can use a microphone array to perform space-time sampling of voice signals within its receiving range under noisy backgrounds, such as conference venues, multimedia classrooms, large-scale stages, video conferences, car hands-free phones, and battlefields.
  • the voice signal may include only a voice signal or a noise signal, and may also include both a voice signal and a noise signal.
  • the speech acquisition device 1 includes three parts, namely: 1) a microphone array; 2) a front-end amplification unit; 2) a multi-channel synchronous sampling unit.
  • the microphone array is used to collect the voice signals in the receiving range of the array and convert the voice signals into analog electric signals; then the analog electric signals are amplified by the front-end amplification unit; then the analog electric signals are sampled by the multi-channel synchronous sampling unit and converted into Digital electrical signals can be sampled simultaneously on multiple channels.
  • the sound source positioning device 2 is configured to determine a sound source position of the voice signal if the voice signal includes a voice signal.
  • Sound localization refers to the behavior of the listener to determine the direction and distance of the sound source using sound stimuli in the environment. Depends on the physical characteristics of the sound reaching the ears, including differences in frequency, intensity, and duration.
  • the device provided in the embodiment of the present application locates a sound source position through a signal of a multi-channel microphone, and can obtain position information of the sound source according to a delay difference between different sound sources reaching the microphone.
  • the maximum delay-and-sum (delay-and-sum) information of each time-frequency point (TF) can be searched to obtain spatial mapping information to obtain the position of the sound source.
  • the sound source localization algorithm is not limited to this algorithm, and may be: algorithms such as music, cics, SPR-PHAT, and the like.
  • Existing sound source localization algorithms can be roughly divided into three categories: a) algorithms based on time-delay estimation (TDE); b) algorithms based on high-resolution spectral estimation; c) algorithms based on sparse representation.
  • TDE time-delay estimation
  • b) algorithms based on high-resolution spectral estimation e.
  • a sound source localization algorithm can be selected according to requirements.
  • the target voice filtering device 3 is configured to use the voice signal as a target voice signal if the sound source position is within a target range.
  • the target range refers to a spatial range in which a target sound source is located, and can be set by a user according to requirements.
  • the device further includes: target range configuration means, configured to acquire the target range and save the range information in a memory.
  • the target voice filtering device 3 can specifically determine whether the voice position of the voice signal is within the target range according to the sound source position information, if it is, the current voice signal is retained, and if not, the current voice signal is shielded.
  • the signal transmitting device 4 is configured to send the target voice signal outward, so that a voice transcription server performs voice transcription on the target voice signal.
  • the enhanced target voice signal is sent to a cloud voice transcription server for voice transcription via a data collection device deployed at a sound source site (such as a conference or trial site).
  • FIG. 4 is a specific schematic diagram of an embodiment of a voice transcription device provided by the present application.
  • the device further includes: a voice noise reduction device 5 configured to perform voice enhancement on the target voice signal according to the position of the sound source; correspondingly, the signal transmission device 4 is specifically configured to: Sending the enhanced target voice signal to the voice transcription server.
  • the current direction vector is calculated according to the actual sound direction, and the enhanced direction of the beam is adjusted in real time to achieve the optimal enhancement effect.
  • FIG. 5 is another specific schematic diagram of an embodiment of a voice transcription device provided by the present application.
  • the device further includes: a noise covariance determination device 6.
  • the noise covariance determination device 6 is configured to determine the noise covariance of the speech signal if the speech signal includes a noise signal; correspondingly, the speech noise reduction device 5 is further configured to determine the noise covariance according to the noise covariance. Variance to suppress the noise signal.
  • the noise covariance determining device may perform covariance calculation between microphones according to noise audio data in a non-speech segment.
  • the following formula can be used to calculate the noise covariance:
  • ⁇ n ⁇ X (n, k) ⁇ X (n, k) T
  • the noise covariance ⁇ n is a vector composed of a plurality of microphone signals at the TF point at the k frequency point at time n. Covariance matrix is obtained by conjugate transpose multiplication.
  • the voice noise reduction device 5 uses a beamforming technique to separate the target voice signal under a noisy background and enhance it to obtain an enhanced target voice signal.
  • the microphone noise reduction processing is performed by an algorithm such as MVDR, and the optimal noise suppression effect can be obtained according to the current noise field and the target sound source direction.
  • the spatial filter coefficient can be calculated using the following formula:
  • V is the sound propagation direction vector calculated from the sound source localization.
  • the spatial filtering formula is as follows:
  • Y (n, k) is the output frequency after beamforming.
  • the device further includes: a voice detection device 7 for detecting whether the voice signal includes a voice signal; if so, activating the sound source localization device 2; and detecting whether the voice signal includes a voice signal.
  • the noise signal if yes, the noise covariance determination device 6 is activated.
  • Voice detection also known as voice activation detection (VAD, Voice Activity Detection) refers to a process used to identify whether voice data bits appear. The purpose is to detect whether the current voice signal contains a voice signal, that is, to judge the input signal, to distinguish the voice signal from various background noise signals, and to use different processing methods for the two signals, respectively.
  • VAD Voice Activity Detection
  • the device provided in the embodiment of the present application finds the start point and the end point of a voice from a section of a signal including a voice through voice detection, so that the voice signal can be processed for voice transcription and voice enhancement. Effective endpoint detection not only reduces processing time, but also eliminates noise interference from silent sections.
  • VAD detection can be performed by calculating the energy of each frame of the speech signal.
  • FIG. 6 is a data processing flowchart of an embodiment of a voice transcription device.
  • a microphone array such as a circular array or a square array
  • multiple audio signals are sent to the sound source localization device.
  • Noise covariance determination device and voice noise reduction device and send any voice signal to the voice detection device separately.
  • the position of the sound source is obtained by the sound source localization.
  • the speech noise reduction device performs speech enhancement on the directional sound source through the obtained sound source position information and noise covariance information, and the sound source position information is processed by the target sound source judgment at the same time to determine whether the current sound source is the target sound source.
  • the target voice source filtering device filters the enhanced voice signal to obtain the voice signal of the target area, and finally sends the enhanced target voice signal through the signal sending device through a data collection device deployed at the sound source site (such as a conference or court site) Go to the cloud for voice transcription.
  • the voice transcription device collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determine a sound source position of the voice signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method,
  • the microphone array is used to multi-microphone enhance the voice signal in the pickup area, and at the same time judge whether the target voice is based on the position of the sound source, and filter the sound outside the target area to ensure that the sound outside the area will not be transmitted to the transcription server; , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.
  • a voice transcription device is provided.
  • this application also provides a voice transcription system.
  • FIG. 7 is a flowchart of an embodiment of a speech transcription system of the present application.
  • the present application further provides a voice transcription system, including: at least one voice transcription device 701 according to the above embodiment, and a voice transcription server 702.
  • the voice transcription server 702 is configured to perform voice transcription on a target voice signal uploaded by the voice transcription device 701.
  • the voice transcription device 701 is usually deployed at a sound source site, such as a conference or a trial site.
  • the voice transcription device 701 can collect a voice signal within a receiving range of the array through a microphone array; then, if the voice signal includes a voice signal, determine a sound source position of the voice signal through a sound source positioning device; if the voice signal If the source position is within the target range, the target voice signal is used as the target voice signal by the target voice filtering device; finally, the target voice signal is sent outward by the signal transmitting device, so that the voice transcription server 702 sends the target to the target.
  • Voice signals are transcribed.
  • FIG. 8 is a schematic diagram of a usage scenario of an embodiment of a voice transcription system of the present application.
  • six microphone arrays are deployed on site and include data collection equipment. Each microphone array sends its own target sound source signal to the data collection equipment, and the enhanced target voice signal is transmitted through the data collection equipment. Send to the cloud for voice transcription, and receive and display the transcription result.
  • the speech transcription system collects a speech signal within a receiving range of the array through a microphone array; if the speech signal includes a speech signal, determining a sound source position of the speech signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method,
  • the microphone array is used to multi-microphone enhance the voice signal in the pickup area, and at the same time judge whether the target voice is based on the position of the sound source, and filter the sound outside the target area to ensure that the sound outside the area will not be transmitted to the transcription server; , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.
  • a speech transcription system is provided.
  • this application also provides a speech transcription method. This method corresponds to the embodiment of the system described above.
  • FIG. 9 is a flowchart of an embodiment of a voice transcription method of the present application. Since the method embodiment is basically similar to the system embodiment, it is described relatively simply. For the relevant part, refer to the description of the system embodiment. The method embodiments described below are merely exemplary.
  • the present application further provides a voice transcription method, including:
  • Step S901 Acquire a voice signal in the receiving range of the array through the microphone array.
  • Step S903 If the voice signal includes a voice signal, determine a sound source position of the voice signal.
  • Step S905 if the sound source position is within a target range, use the voice signal as a target voice signal.
  • Step S907 Send the target voice signal outward, so that a voice transcription server performs voice transcription of the target voice signal.
  • the method provided in the embodiment of the present application may further include the following steps: performing speech enhancement on the target voice signal according to the sound source position; correspondingly, step S907 is implemented in the following manner: the enhanced target is implemented The voice signal is sent outward.
  • the method provided in the embodiment of the present application may further include the following steps: 1) if the voice signal includes a noise signal, determining a noise covariance of the voice signal; 2) according to the noise covariance, The noise signal is suppressed.
  • the method provided in the embodiment of the present application may further include the steps of: acquiring the target range, and storing the target range corresponding to the microphone array.
  • the method provided in the embodiment of the present application may further include the step of: if the sound source position is not within the target range, shielding the voice signal.
  • the method provided in the embodiment of the present application may further include the steps of: detecting whether the voice signal includes a voice signal; and detecting whether the voice signal includes the noise signal.
  • the voice transcription method collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determining a sound source position of the voice signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method,
  • the microphone array is used to perform multi-microphone enhancement on the voice signal in the pickup area, and at the same time to determine whether it is the target voice according to the position of the sound source. , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.
  • a voice transcription method is provided.
  • the present application also provides a voice transcription device. This device corresponds to an embodiment of the method described above.
  • FIG. 10 is a schematic diagram of an embodiment of an electronic device of the present application. Since the device embodiment is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment. The device embodiments described below are only schematic.
  • An electronic device in this embodiment includes: a processor 1001 and a memory 1002; the memory is configured to store a program for implementing a voice transcription method, and the device is powered on and runs the voice transcription method through the processor.
  • the following steps are performed: collecting voice signals in the receiving range of the array through the microphone array; if the voice signal includes a voice signal, determining the sound source position of the voice signal; if the sound source position is at the target Within the range, the voice signal is used as the target voice signal; the target voice signal is sent outward, so that the voice transcription server performs voice transcription on the target voice signal.
  • a computing device includes one or more processors (CPUs), input / output interfaces, network interfaces, and memory.
  • processors CPUs
  • input / output interfaces output interfaces
  • network interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-persistent memory, random access memory (RAM), and / or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media. Information can be stored by any method or technology. Information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media may be used to store information that can be accessed by computing devices. As defined herein, computer-readable media does not include non-transitory computer-readable media, such as modulated data signals and carrier waves.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • this application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.

Abstract

A voice transcription device, system and method, and an electronic device. A voice transcription device (701) acquires a voice signal within an array receiving range by means of a microphone array (S901); if the voice signal comprises a tone signal, determines a sound source position of the tone signal (S903); if the sound source position is within ta target range, uses the tone signal as a target tone signal (S905); and sends the target tone signal to a voice transcription server (702) so that the server performs voice transcription on the target tone signal (S907). By means of the processing, multi-microphone enhancement is performed on a tone signal in a pickup range on the basis of a microphone array, whether the tone is a target tone is determined according to the sound source position, and the sound beyond a target region is filtered so as to ensure that the sound beyond the region is not transmitted to a transcription server. Therefore, it can be effectively ensured that a target tone is picked up, so as to increase the anti-interference capability for a non-target tone, thereby improving voice transcription quality.

Description

语音转录设备、系统、方法、及电子设备Voice transcription equipment, system, method and electronic equipment
本申请要求2018年08月30日递交的申请号为201811004661.6、发明名称为“语音转录设备、系统、方法、及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority from a Chinese patent application filed on August 30, 2018 with the application number 201811004661.6 and the invention name is "Voice Transcription Device, System, Method, and Electronic Device", the entire contents of which are incorporated herein by reference .
技术领域Technical field
本申请涉及语音信号处理技术领域,具体涉及语音转录设备、系统、方法、及电子设备。The present application relates to the technical field of voice signal processing, and in particular, to a voice transcription device, system, method, and electronic device.
背景技术Background technique
语音转录技术是近年来语音信号处理领域的一大研究热点。随着研究的不断深入,目前该项技术已经在庭审现场和多人会议等场合中得到了广泛的应用。Speech transcription technology has been a hot research topic in the field of speech signal processing in recent years. With the continuous deepening of research, this technology has been widely used in trial venues and multi-person conferences.
图1示出了一种常见的语音转录现场示意图。该方案在每个人面前配置一个鹅颈麦克设备,通过鹅颈麦克设备采集每个人的音频,将采集到的音频传送至音频处理装置,由音频处理装置对采集到的原始音频进行扩音处理,再将扩音后的音频发送至转录云服务,通过转录云服务进行对扩音后的音频进行语音转录处理。Figure 1 shows a common speech transcription scene. This solution is equipped with a gooseneck microphone device in front of each person. The gooseneck microphone device collects each person's audio, transmits the collected audio to the audio processing device, and the audio processing device performs amplification processing on the collected original audio. The amplified audio is then sent to the transcription cloud service, and the speech transcription process is performed on the amplified audio through the transcription cloud service.
然而,在实现本发明过程中,发明人发现该技术方案至少存在如下问题:However, in the process of implementing the present invention, the inventor found that the technical solution has at least the following problems:
1)由于鹅颈麦克风本身的限制,其有效的拾音区域非常小,当用户偏离其有效区域或者距离过远的时候,用户的声音会被抑制,导致声音忽大忽小,影响转录效果;1) Due to the limitation of the gooseneck microphone itself, its effective pickup area is very small. When the user deviates from its effective area or the distance is too long, the user's voice will be suppressed, causing the sound to become louder and smaller, affecting the transcription effect;
2)由于鹅颈麦克风对声音的抑制效果又是有限的,周围人的声音也很容易被采集进去,因此在多人会议的时候或者庭审现场有噪声和回放等干扰条件下,抗干扰能力差,导致转录出现串音。综上所述,现有技术存在拾不到目标话音且外部串音干扰的问题。2) Because the suppression effect of the gooseneck microphone on the sound is limited, the voices of surrounding people can also be easily captured. Therefore, the anti-interference ability is poor when there are noises and playback during the multi-person conference or the trial. , Resulting in crosstalk in the transcription. To sum up, the prior art has the problems that the target voice cannot be picked up and external crosstalk interferes.
发明内容Summary of the Invention
本申请提供语音转录设备,以解决现有技术存在的拾不到目标话音且外部串音干扰的问题。本申请另外提供语音转录系统和方法,以及电子设备。The present application provides a voice transcription device to solve the problems that the target voice cannot be picked up and external crosstalk interference exists in the prior art. The application additionally provides a speech transcription system and method, and an electronic device.
本申请提供一种语音转录设备,包括:This application provides a voice transcription device, including:
语音采集装置,用于通过传声器阵列采集阵列接收范围内的语音信号;A voice acquisition device for acquiring a voice signal in the receiving range of the array through a microphone array;
声源定位装置,用于若所述语音信号包括话音信号,则确定所述话音信号的声源位置;A sound source positioning device, configured to determine a sound source position of the voice signal if the voice signal includes a voice signal;
目标话音过滤装置,用于若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;A target voice filtering device, configured to use the voice signal as a target voice signal if the sound source position is within a target range;
信号发送装置,用于将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录。The signal sending device is configured to send the target voice signal outward, so that a voice transcription server performs voice transcription on the target voice signal.
可选的,还包括:Optional, also includes:
语音降噪装置,用于根据所述声源位置,对所述目标话音信号进行语音增强;A voice noise reduction device, configured to perform voice enhancement on the target voice signal according to the position of the sound source;
所述信号发送装置,具体用于将增强后的目标话音信号向外发送。The signal sending device is specifically configured to send an enhanced target voice signal outward.
可选的,还包括:Optional, also includes:
噪声协方差确定装置,用于若所述语音信号包括噪音信号,则确定所述语音信号的噪声协方差;A noise covariance determining device, configured to determine a noise covariance of the voice signal if the voice signal includes a noise signal;
所述语音降噪装置,还用于根据所述噪声协方差,对所述噪音信号进行抑制。The voice noise reduction device is further configured to suppress the noise signal according to the noise covariance.
可选的,还包括:Optional, also includes:
目标范围配置装置,用于获取所述目标范围,存储所述目标范围。The target range configuration device is configured to acquire the target range and store the target range.
可选的,还包括:Optional, also includes:
所述目标话音过滤装置,还用于若所述声源位置不在所述目标范围内,则屏蔽所述话音信号。The target voice filtering device is further configured to shield the voice signal if the sound source position is not within the target range.
可选的,所述传声器阵列的排布方式包括:方形阵列或圆形阵列。Optionally, the arrangement of the microphone array includes a square array or a circular array.
可选的,还包括:Optional, also includes:
话音检测装置,用于检测所述语音信号是否包括话音信号;若是,则启动所述声源定位装置。A voice detection device is used to detect whether the voice signal includes a voice signal; if so, the sound source localization device is activated.
可选的,还包括:Optional, also includes:
话音检测装置,用于检测所述语音信号是否包括所述噪音信号;若是,则启动所述噪声协方差确定装置。A voice detection device is used to detect whether the voice signal includes the noise signal; if so, the noise covariance determination device is activated.
本申请还提供一种语音转录系统,包括:This application also provides a speech transcription system, including:
上述的语音转录设备,以及,语音转录服务器;其中,所述服务器,用于对所述语音转录设备上传的目标话音信号进行语音转录。The above-mentioned voice transcription device, and a voice transcription server; wherein the server is configured to perform voice transcription on a target voice signal uploaded by the voice transcription device.
本申请还提供一种语音转录方法,包括:This application also provides a voice transcription method, including:
通过传声器阵列采集阵列接收范围内的语音信号;Acquire voice signals in the receiving range of the array through the microphone array;
若所述语音信号包括话音信号,则确定所述话音信号的声源位置;If the voice signal includes a voice signal, determining a sound source position of the voice signal;
若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;If the sound source position is within a target range, using the voice signal as a target voice signal;
将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录。Sending the target voice signal outward, so that a voice transcription server performs voice transcription of the target voice signal.
可选的,还包括:Optional, also includes:
根据所述声源位置,对所述目标话音信号进行语音增强;Performing speech enhancement on the target voice signal according to the sound source position;
所述将所述目标话音信号向外发送,包括:The sending the target voice signal outward includes:
将增强后的目标话音信号向外发送。Send the enhanced target voice signal outward.
可选的,还包括:Optional, also includes:
若所述语音信号包括噪音信号,则确定所述语音信号的噪声协方差;If the voice signal includes a noise signal, determining a noise covariance of the voice signal;
根据所述噪声协方差,对所述噪音信号进行抑制。Suppressing the noise signal according to the noise covariance.
可选的,还包括:Optional, also includes:
获取所述目标范围,对应所述传声器阵列存储所述目标范围。Acquire the target range, and store the target range corresponding to the microphone array.
可选的,还包括:Optional, also includes:
若所述声源位置不在所述目标范围内,则屏蔽所述话音信号。If the sound source position is not within the target range, the voice signal is shielded.
可选的,还包括:Optional, also includes:
检测所述语音信号是否包括话音信号;以及,检测所述语音信号是否包括所述噪音信号。Detecting whether the voice signal includes a voice signal; and detecting whether the voice signal includes the noise signal.
本申请还提供一种电子设备,包括:This application also provides an electronic device, including:
传声器阵列;Microphone array
处理器;以及Processor; and
存储器,用于存储实现语音转录方法的程序,该设备通电并通过所述处理器运行该语音转录方法的程序后,执行下述步骤:通过所述传声器阵列采集阵列接收范围内的语音信号;若所述语音信号包括话音信号,则确定所述话音信号的声源位置;若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录。A memory for storing a program that implements the voice transcription method, after the device is powered on and runs the program for the voice transcription method through the processor, the following steps are performed: the microphone array is used to collect voice signals within the array receiving range; if If the voice signal includes a voice signal, determine a sound source position of the voice signal; if the sound source position is within a target range, use the voice signal as a target voice signal; and send the target voice signal outward , So that the voice transcription server performs voice transcription on the target voice signal.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各种方法。The present application also provides a computer-readable storage medium having instructions stored in the computer-readable storage medium that, when run on a computer, causes the computer to execute the various methods described above.
本申请还提供一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各种方法。The present application also provides a computer program product including instructions that, when run on a computer, causes the computer to perform the various methods described above.
与现有技术相比,本申请具有以下优点:Compared with the prior art, this application has the following advantages:
本申请实施例提供的语音转录设备,通过传声器阵列采集阵列接收范围内的语音信 号;若所述语音信号包括话音信号,则确定所述话音信号的声源位置;若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;将所述目标话音信号发送至语音转录服务器,以使得所述服务器对所述目标话音信号进行语音转录;这种处理方式,使得基于传声器阵列对拾音区域内的话音信号进行多麦克增强,同时根据声源位置判断是否是目标话音,对于目标区域外的声音进行过滤,保证区域外声音不会传入到转录服务器;因此,可以有效确保拾到目标话音,提升对非目标话音的抗干扰能力,从而提升语音转录质量。The voice transcription device provided in the embodiment of the present application collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determining a sound source position of the voice signal; if the sound source position is at a target Within the range, the voice signal is used as the target voice signal; the target voice signal is sent to a voice transcription server, so that the server performs voice transcription of the target voice signal; this processing method is based on a microphone array Multi-microphone enhancement of the voice signal in the pickup area, while determining whether it is the target voice according to the position of the sound source, and filtering the sound outside the target area to ensure that the sound outside the area does not enter the transcription server; therefore, it can effectively ensure Pick up the target speech, improve the anti-interference ability to non-target speech, and improve the quality of speech transcription.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是现有技术下的语音转录场景图;FIG. 1 is a diagram of a voice transcription scene in the prior art;
图2是本申请提供的一种语音转录设备的实施例的结构示意图;2 is a schematic structural diagram of an embodiment of a voice transcription device provided by the present application;
图3a是本申请提供的一种语音转录设备的实施例的传声器阵列示意图;3a is a schematic diagram of a microphone array of an embodiment of a voice transcription device provided by the present application;
图3b是本申请提供的一种语音转录设备的实施例的传声器阵列示意图;3b is a schematic diagram of a microphone array of an embodiment of a voice transcription device provided by the present application;
图4是本申请提供的一种语音转录设备的实施例的具体结构示意图;FIG. 4 is a specific structural schematic diagram of an embodiment of a voice transcription device provided by the present application; FIG.
图5是本申请提供的一种语音转录设备的实施例的又一具体结构示意图;FIG. 5 is a schematic diagram of another specific structure of an embodiment of a voice transcription device provided by the present application; FIG.
图6是本申请提供的一种语音转录设备的实施例的数据处理流程图;6 is a data processing flowchart of an embodiment of a voice transcription device provided by the present application;
图7是本申请提供的一种语音转录系统的实施例的系统示意图;FIG. 7 is a system schematic diagram of an embodiment of a voice transcription system provided by the present application; FIG.
图8是本申请提供的一种语音转录系统的实施例的场景示意图;FIG. 8 is a schematic scenario diagram of an embodiment of a speech transcription system provided by the present application; FIG.
图9是本申请提供的一种语音转录方法的实施例的具体流程图;FIG. 9 is a specific flowchart of an embodiment of a voice transcription method provided by the present application; FIG.
图10是本申请提供的电子设备的实施例的示意图。FIG. 10 is a schematic diagram of an embodiment of an electronic device provided by the present application.
具体实施方式detailed description
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。Numerous specific details are set forth in the following description to facilitate a full understanding of the application. However, this application can be implemented in many other ways than those described herein, and those skilled in the art can make similar promotion without violating the connotation of this application, so this application is not limited by the specific implementation disclosed below.
在本申请中,提供了语音转录设备、系统、方法、及电子设备。在下面的实施例中逐一对各种方案进行详细说明。In this application, a speech transcription device, system, method, and electronic device are provided. Various schemes are described in detail in the following embodiments.
第一实施例First embodiment
请参考图2,其为本申请提供的一种语音转录设备的实施例的示意图,该设备包括:语音采集装置1,声源定位装置2,目标话音过滤装置3,以及,信号发送装置4。Please refer to FIG. 2, which is a schematic diagram of an embodiment of a voice transcription device provided by the present application. The device includes a voice acquisition device 1, a sound source localization device 2, a target voice filtering device 3, and a signal transmission device 4.
所述语音采集装置1,用于通过传声器阵列采集阵列接收范围内的语音信号。The voice acquisition device 1 is configured to acquire a voice signal within a receiving range of the array through a microphone array.
所述传声器阵列,包括多个传声器,每个传声器为阵列中的一个阵元。The microphone array includes a plurality of microphones, and each microphone is an array element in the array.
传声器是将声音信号转换为电信号的能量转换器件,也称麦克风、话筒、微音器。传声器,可以是由声音的振动传到麦克风的振膜上,推动里边的磁铁形成变化的电流,这样变化的电流送到后面的声音处理电路进行放大处理。A microphone is an energy conversion device that converts a sound signal into an electrical signal. It is also called a microphone, a microphone, or a microphone. The microphone can be transmitted by the vibration of the sound to the diaphragm of the microphone, and the magnet inside is pushed to form a changed current, so that the changed current is sent to the subsequent sound processing circuit for amplification processing.
所述传声器阵列可以拾到其接收范围内的语音信号,该接收范围称为所述阵列接收范围,是指所述传声器阵列可以接收到的语音信号的范围。所述阵列接收范围,取决于阵元的排布方式以及阵元数量。The microphone array can pick up a voice signal within its receiving range. The receiving range is called the array receiving range and refers to the range of the voice signal that the microphone array can receive. The receiving range of the array depends on the arrangement of the array elements and the number of array elements.
传声器的阵列规模,不仅对于采集语音和噪声信号密切相关,也对声源定位精度有一定的影响。传声器是一种声音传感器,将声音信号转换为电压信号。当声源离传声器较远时,传声器就采集不到声音信号或者采集到的电压信号很小,这样造成信噪比太低,会对估计声源的方位不利。另外,传声器之间的距离越大,声源在传声器之间产生的相位差越大,对声源的方位更容易分辨,而距离小时由于相位差出现空间混迭,因此分辨率下降。The size of the microphone array is not only closely related to the collection of speech and noise signals, but also has a certain impact on the accuracy of sound source localization. A microphone is a sound sensor that converts sound signals into voltage signals. When the sound source is far away from the microphone, the microphone cannot collect the sound signal or the collected voltage signal is very small, which causes the signal-to-noise ratio to be too low, which is disadvantageous for estimating the orientation of the sound source. In addition, the larger the distance between the microphones, the larger the phase difference between the sound sources, and the easier the orientation of the sound source is. The smaller the distance is, the more the resolution will decrease due to the spatial aliasing of the phase difference.
传声器阵列的排布可以根据实际需求灵活调节。所述阵元的排布方式,包括但不限于:圆形、方形,还可以是线性一字排开的形状等等。The arrangement of the microphone array can be flexibly adjusted according to actual needs. The arrangement of the array elements includes, but is not limited to, a circle, a square, and a linearly arranged shape.
请参见图3a、图3b,其为语音转录系统的实施例的传声器阵列示意图。其中,图3a示出了一种方形的传声器阵列,各阵元特性相同且等间距;图3b示出了一种圆形的传声器阵列,各阵元特性相同,且在圆周上等间距排布各阵元。Please refer to FIG. 3a and FIG. 3b, which are schematic diagrams of a microphone array of an embodiment of a speech transcription system. Among them, Fig. 3a shows a square microphone array with the characteristics of the array elements being the same and equally spaced; Fig. 3b shows a circular microphone array with the characteristics of the array elements being the same and arranged at equal intervals on the circumference Each array element.
所述语音采集装置1,可以在嘈杂背景下应用传声器阵列对其接收范围内的语音信号进行空时采样,如会场、多媒体教室、大型舞台、视频会议、车载免提电话和战场等。The voice acquisition device 1 can use a microphone array to perform space-time sampling of voice signals within its receiving range under noisy backgrounds, such as conference venues, multimedia classrooms, large-scale stages, video conferences, car hands-free phones, and battlefields.
所述语音信号,可以只包括话音信号,也可以只包括噪音信号,还可以同时包括话音信号和噪音信号。The voice signal may include only a voice signal or a noise signal, and may also include both a voice signal and a noise signal.
在一个示例中,所述语音采集装置1包括三个部分,即:1)传声器阵列;2)前端放大单元;2)多通道同步采样单元。该装置的处理过程如下所述。首先通过传声器阵列采集阵列接收范围内的语音信号,并将语音信号转换为模拟电信号;然后通过前端放大单元将模拟电信号放大;接着将模拟电信号用多通道同步采样单元进行采样,转换为数字电信号可实现多通道同时采样。In one example, the speech acquisition device 1 includes three parts, namely: 1) a microphone array; 2) a front-end amplification unit; 2) a multi-channel synchronous sampling unit. The process of this device is described below. First, the microphone array is used to collect the voice signals in the receiving range of the array and convert the voice signals into analog electric signals; then the analog electric signals are amplified by the front-end amplification unit; then the analog electric signals are sampled by the multi-channel synchronous sampling unit and converted into Digital electrical signals can be sampled simultaneously on multiple channels.
所述声源定位装置2,用于若所述语音信号包括话音信号,则确定所述话音信号的声源位置。The sound source positioning device 2 is configured to determine a sound source position of the voice signal if the voice signal includes a voice signal.
声音定位(sound localization)是指听者利用环境中的声音刺激确定声源方向和距离的行为。取决于到达两耳的声音的物理特性变化,包括频率、强度和持续时间上的差别。Sound localization refers to the behavior of the listener to determine the direction and distance of the sound source using sound stimuli in the environment. Depends on the physical characteristics of the sound reaching the ears, including differences in frequency, intensity, and duration.
本申请实施例提供的设备,通过多通道麦克风的信号对声源位置进行定位,可根据不同声源到达麦克风的延时差异,获得声源的位置信息。The device provided in the embodiment of the present application locates a sound source position through a signal of a multi-channel microphone, and can obtain position information of the sound source according to a delay difference between different sound sources reaching the microphone.
具体实施时,可以采用搜索每个时频点(TF)的最大delay-and-sum(延时相加)信息获得空间映射信息来获得声源的位置。In specific implementation, the maximum delay-and-sum (delay-and-sum) information of each time-frequency point (TF) can be searched to obtain spatial mapping information to obtain the position of the sound source.
需要说明的是,声源定位算法不限于该算法,也可以是:music、cics、SPR-PHAT等算法等。现有声源定位算法可大致分为三类:a)基于时延估计(time-delay estimation,TDE)的算法;b)基于高分辨率谱估计的算法;c)基于稀疏表示的算法。具体实施时,可以根据需求选取声源定位算法。It should be noted that the sound source localization algorithm is not limited to this algorithm, and may be: algorithms such as music, cics, SPR-PHAT, and the like. Existing sound source localization algorithms can be roughly divided into three categories: a) algorithms based on time-delay estimation (TDE); b) algorithms based on high-resolution spectral estimation; c) algorithms based on sparse representation. In specific implementation, a sound source localization algorithm can be selected according to requirements.
所述目标话音过滤装置3,用于若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号。The target voice filtering device 3 is configured to use the voice signal as a target voice signal if the sound source position is within a target range.
所述目标范围,是指目标声源所在的空间范围,可以由用户根据需求进行设置。The target range refers to a spatial range in which a target sound source is located, and can be set by a user according to requirements.
在一个示例中,所述设备还包括:目标范围配置装置,用于获取所述目标范围,并将该范围信息保存在存储器中。In one example, the device further includes: target range configuration means, configured to acquire the target range and save the range information in a memory.
所述目标话音过滤装置3,具体可根据声源位置信息判断所述话音信号的声音位置是否在目标范围内,如果是则保留当前语音信号,如果不是则屏蔽当前语音信号。The target voice filtering device 3 can specifically determine whether the voice position of the voice signal is within the target range according to the sound source position information, if it is, the current voice signal is retained, and if not, the current voice signal is shielded.
所述信号发送装置4,用于将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录。The signal transmitting device 4 is configured to send the target voice signal outward, so that a voice transcription server performs voice transcription on the target voice signal.
在一个示例中,经由部署在声源现场(如会议或庭审现场)的数据搜集设备将增强后的目标话音信号送到云端语音转录服务器做语音转录。In one example, the enhanced target voice signal is sent to a cloud voice transcription server for voice transcription via a data collection device deployed at a sound source site (such as a conference or trial site).
请参考图4,其为本申请提供的语音转录设备的实施例的具体示意图。在本实施例中,所述设备还包括:语音降噪装置5,用于根据所述声源位置,对所述目标话音信号进行语音增强;相应的,所述信号发送装置4,具体用于将增强后的目标话音信号发送至所述语音转录服务器。采用这种处理方式,使得根据实际声音方向计算当前的方向矢量,实时调整波束的增强的方向,可以达到最优的增强效果。Please refer to FIG. 4, which is a specific schematic diagram of an embodiment of a voice transcription device provided by the present application. In this embodiment, the device further includes: a voice noise reduction device 5 configured to perform voice enhancement on the target voice signal according to the position of the sound source; correspondingly, the signal transmission device 4 is specifically configured to: Sending the enhanced target voice signal to the voice transcription server. By adopting this processing method, the current direction vector is calculated according to the actual sound direction, and the enhanced direction of the beam is adjusted in real time to achieve the optimal enhancement effect.
请参考图5,其为本申请提供的语音转录设备的实施例的又一具体示意图。在本实施例中,所述设备还包括:噪声协方差确定装置6。Please refer to FIG. 5, which is another specific schematic diagram of an embodiment of a voice transcription device provided by the present application. In this embodiment, the device further includes: a noise covariance determination device 6.
所述噪声协方差确定装置6,用于若所述语音信号包括噪音信号,则确定所述语音信号的噪声协方差;相应的,所述语音降噪装置5,还用于根据所述噪声协方差,对所 述噪音信号进行抑制。The noise covariance determination device 6 is configured to determine the noise covariance of the speech signal if the speech signal includes a noise signal; correspondingly, the speech noise reduction device 5 is further configured to determine the noise covariance according to the noise covariance. Variance to suppress the noise signal.
所述噪声协方差确定装置,可根据非话音段的噪声音频数据进行传声器之间的协方差计算。具体实施时,可采用如下公式计算噪声协方差:The noise covariance determining device may perform covariance calculation between microphones according to noise audio data in a non-speech segment. In specific implementation, the following formula can be used to calculate the noise covariance:
φ n=∑X(n,k)×X(n,k) T φ n = ΣX (n, k) × X (n, k) T
X(n,k)[x 1(n,k),x 2(n,k),......,x M(n,k)] T X (n, k) (x 1 (n, k), x 2 (n, k), ..., x M (n, k)) T
其中,M表示传声器阵列的阵元数量;n表示语音采样时刻;k表示语音信号包括的频率;X表示语音信号。由上述公式可见,噪声协方差Φ n为TF点在n时刻k频点上多个传声器信号组成的向量。通过共轭转置相乘获得协方差矩阵。 Among them, M represents the number of array elements of the microphone array; n represents the instant of speech sampling; k represents the frequency included in the speech signal; X represents the speech signal. It can be seen from the above formula that the noise covariance Φ n is a vector composed of a plurality of microphone signals at the TF point at the k frequency point at time n. Covariance matrix is obtained by conjugate transpose multiplication.
在一个示例中,所述语音降噪装置5使用波束形成技术在嘈杂背景下分离出所述目标话音信号,并将其增强,获得增强后的目标话音信号。例如,通过MVDR等算法进行传声器降噪处理,可以根据当前的噪声场和目标声源方向,获得最优的噪声抑制效果。其中,空间滤波系数可采用如下公式进行计算:In one example, the voice noise reduction device 5 uses a beamforming technique to separate the target voice signal under a noisy background and enhance it to obtain an enhanced target voice signal. For example, the microphone noise reduction processing is performed by an algorithm such as MVDR, and the optimal noise suppression effect can be obtained according to the current noise field and the target sound source direction. Among them, the spatial filter coefficient can be calculated using the following formula:
Figure PCTCN2019102482-appb-000001
Figure PCTCN2019102482-appb-000001
其中,V由声源定位计算出来的声音传播方向矢量。Among them, V is the sound propagation direction vector calculated from the sound source localization.
空间滤波公式如下:The spatial filtering formula is as follows:
Y(n,k)=W·X(n,k)Y (n, k) = W · X (n, k)
Y(n,k)为波束形成后的输出频点。Y (n, k) is the output frequency after beamforming.
在一个示例中,所述设备还包括:话音检测装置7,用于检测所述语音信号是否包括话音信号;若是,则启动所述声源定位装置2;以及,检测所述语音信号是否包括所述噪音信号;若是,则启动所述噪声协方差确定装置6。In one example, the device further includes: a voice detection device 7 for detecting whether the voice signal includes a voice signal; if so, activating the sound source localization device 2; and detecting whether the voice signal includes a voice signal. The noise signal; if yes, the noise covariance determination device 6 is activated.
话音检测,又称为话音激活检测(VAD,Voice Activity Detection),是指用于识别话音数据比特是否出现的处理过程。其目的是检测当前语音信号中是否包含话音信号存在,即对输入信号进行判断,将话音信号与各种背景噪声信号区分出来,分别对两种信号采用不同的处理方法。Voice detection, also known as voice activation detection (VAD, Voice Activity Detection), refers to a process used to identify whether voice data bits appear. The purpose is to detect whether the current voice signal contains a voice signal, that is, to judge the input signal, to distinguish the voice signal from various background noise signals, and to use different processing methods for the two signals, respectively.
本申请实施例提供的设备,通过话音检测,从包含语音的一段信号中找出话音的起始点及结束点,从而可对话音信号进行语音转录处理和语音增强处理。有效的端点检测不仅可以减少处理时间,而且能排除无声段的噪声干扰。The device provided in the embodiment of the present application finds the start point and the end point of a voice from a section of a signal including a voice through voice detection, so that the voice signal can be processed for voice transcription and voice enhancement. Effective endpoint detection not only reduces processing time, but also eliminates noise interference from silent sections.
具体实施时,可以通过计算每帧语音信号的能量,进行VAD检测。In specific implementation, VAD detection can be performed by calculating the energy of each frame of the speech signal.
请参见图6,其为语音转录设备的实施例的数据处理流程图。在本实施例中,首先通过传声器阵列(如圆阵或方阵)进行拾音,获得多路麦克风阵列信号,经由多麦克数据采集处理后,将多路的音频信号分别送入声源定位装置、噪声协方差确定装置和语音降噪装置,并单独送任意一路语音信号到话音检测装置。VAD装置用于检测当前是否有话音信号,如果存在话音信号(VAD=1),则通过声源定位装置进行声源定位,如果是噪声信号,则送入到噪声协方差确定装置估计噪声协方差矩阵。通过声源定位获得声源的位置。语音降噪装置通过获得的声源位置信息和噪声协方差信息对定向声源进行语音增强,声源位置信息同时经由目标声源判决的处理,判断当前声源是否是目标声源,根据判决信息由目标声源过滤装置过滤增强后的话音信号来获得目标区域的话音信号,最后经由信号发送装置将增强后的目标话音信号通过部署在声源现场(如会议或庭审现场)的数据搜集设备送到云端做语音转录。Please refer to FIG. 6, which is a data processing flowchart of an embodiment of a voice transcription device. In this embodiment, a microphone array (such as a circular array or a square array) is used for sound pick-up to obtain multiple microphone array signals. After multi-microphone data acquisition and processing, multiple audio signals are sent to the sound source localization device. , Noise covariance determination device and voice noise reduction device, and send any voice signal to the voice detection device separately. The VAD device is used to detect whether there is currently a voice signal. If a voice signal is present (VAD = 1), the sound source is located by the sound source localization device. If it is a noise signal, it is sent to the noise covariance determination device to estimate the noise covariance. matrix. The position of the sound source is obtained by the sound source localization. The speech noise reduction device performs speech enhancement on the directional sound source through the obtained sound source position information and noise covariance information, and the sound source position information is processed by the target sound source judgment at the same time to determine whether the current sound source is the target sound source. According to the judgment information, The target voice source filtering device filters the enhanced voice signal to obtain the voice signal of the target area, and finally sends the enhanced target voice signal through the signal sending device through a data collection device deployed at the sound source site (such as a conference or court site) Go to the cloud for voice transcription.
从上述实施例可见,本申请实施例提供的语音转录设备,通过传声器阵列采集阵列接收范围内的语音信号;若所述语音信号包括话音信号,则确定所述话音信号的声源位置;若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录;这种处理方式,使得基于传声器阵列对拾音区域内的话音信号进行多麦克增强,同时根据声源位置判断是否是目标话音,对于目标区域外的声音进行过滤,保证区域外声音不会传入到转录服务器;因此,可以有效确保拾到目标话音,提升对非目标话音的抗干扰能力,从而提升语音转录质量。It can be seen from the foregoing embodiments that the voice transcription device provided in the embodiments of the present application collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determine a sound source position of the voice signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method, The microphone array is used to multi-microphone enhance the voice signal in the pickup area, and at the same time judge whether the target voice is based on the position of the sound source, and filter the sound outside the target area to ensure that the sound outside the area will not be transmitted to the transcription server; , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.
在上述的实施例中,提供了一种语音转录设备,与之相对应的,本申请还提供一种语音转录系统。In the above embodiments, a voice transcription device is provided. Correspondingly, this application also provides a voice transcription system.
第二实施例Second embodiment
请参看图7,其为本申请的语音转录系统的实施例的流程图。本申请另外提供一种语音转录系统,包括:至少一个上述实施例所述的语音转录设备701,以及,语音转录服务器702。Please refer to FIG. 7, which is a flowchart of an embodiment of a speech transcription system of the present application. The present application further provides a voice transcription system, including: at least one voice transcription device 701 according to the above embodiment, and a voice transcription server 702.
所述语音转录服务器702,用于对所述语音转录设备701上传的目标话音信号进行语音转录。The voice transcription server 702 is configured to perform voice transcription on a target voice signal uploaded by the voice transcription device 701.
所述语音转录设备701,通常部署在声源现场,如会议或庭审现场。所述语音转录设备701,能够通过传声器阵列采集阵列接收范围内的语音信号;然后若所述语音信号包括话音信号,则通过声源定位装置确定所述话音信号的声源位置;若所述声源位置在目 标范围内,则通过目标话音过滤装置将所述话音信号作为目标话音信号;最后通过信号发送装置将所述目标话音信号向外发送,以使得所述语音转录服务器702对所述目标话音信号进行语音转录。The voice transcription device 701 is usually deployed at a sound source site, such as a conference or a trial site. The voice transcription device 701 can collect a voice signal within a receiving range of the array through a microphone array; then, if the voice signal includes a voice signal, determine a sound source position of the voice signal through a sound source positioning device; if the voice signal If the source position is within the target range, the target voice signal is used as the target voice signal by the target voice filtering device; finally, the target voice signal is sent outward by the signal transmitting device, so that the voice transcription server 702 sends the target to the target. Voice signals are transcribed.
请参看图8,其为本申请的语音转录系统的实施例的使用场景示意图。在本实施例中,现场部署了6个传声器阵列,且包括数据搜集设备,每个传声器阵列将各自的目标声源信号发送至该数据搜集设备,经由该数据搜集设备将增强后的目标话音信号送到云端做语音转录,并接收及显示转录结果。Please refer to FIG. 8, which is a schematic diagram of a usage scenario of an embodiment of a voice transcription system of the present application. In this embodiment, six microphone arrays are deployed on site and include data collection equipment. Each microphone array sends its own target sound source signal to the data collection equipment, and the enhanced target voice signal is transmitted through the data collection equipment. Send to the cloud for voice transcription, and receive and display the transcription result.
从上述实施例可见,本申请实施例提供的语音转录系统,通过传声器阵列采集阵列接收范围内的语音信号;若所述语音信号包括话音信号,则确定所述话音信号的声源位置;若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录;这种处理方式,使得基于传声器阵列对拾音区域内的话音信号进行多麦克增强,同时根据声源位置判断是否是目标话音,对于目标区域外的声音进行过滤,保证区域外声音不会传入到转录服务器;因此,可以有效确保拾到目标话音,提升对非目标话音的抗干扰能力,从而提升语音转录质量。It can be seen from the foregoing embodiments that the speech transcription system provided by the embodiment of the present application collects a speech signal within a receiving range of the array through a microphone array; if the speech signal includes a speech signal, determining a sound source position of the speech signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method, The microphone array is used to multi-microphone enhance the voice signal in the pickup area, and at the same time judge whether the target voice is based on the position of the sound source, and filter the sound outside the target area to ensure that the sound outside the area will not be transmitted to the transcription server; , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.
在上述的实施例中,提供了一种语音转录系统,与之相对应的,本申请还提供一种语音转录方法。该方法是与上述系统的实施例相对应。In the above embodiments, a speech transcription system is provided. Correspondingly, this application also provides a speech transcription method. This method corresponds to the embodiment of the system described above.
第三实施例Third embodiment
请参看图9,其为本申请的语音转录方法的实施例的流程图。由于方法实施例基本相似于系统实施例,所以描述得比较简单,相关之处参见系统实施例的部分说明即可。下述描述的方法实施例仅仅是示意性的。Please refer to FIG. 9, which is a flowchart of an embodiment of a voice transcription method of the present application. Since the method embodiment is basically similar to the system embodiment, it is described relatively simply. For the relevant part, refer to the description of the system embodiment. The method embodiments described below are merely exemplary.
本申请另外提供一种语音转录方法,包括:The present application further provides a voice transcription method, including:
步骤S901:通过传声器阵列采集阵列接收范围内的语音信号。Step S901: Acquire a voice signal in the receiving range of the array through the microphone array.
步骤S903:若所述语音信号包括话音信号,则确定所述话音信号的声源位置。Step S903: If the voice signal includes a voice signal, determine a sound source position of the voice signal.
步骤S905:若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号。Step S905: if the sound source position is within a target range, use the voice signal as a target voice signal.
步骤S907:将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录。Step S907: Send the target voice signal outward, so that a voice transcription server performs voice transcription of the target voice signal.
在一个示例中,本申请实施例提供的方法还可包括如下步骤:根据所述声源位置,对所述目标话音信号进行语音增强;相应的,步骤S907采用如下方式实现:将增强后的目标话音信号向外发送。In an example, the method provided in the embodiment of the present application may further include the following steps: performing speech enhancement on the target voice signal according to the sound source position; correspondingly, step S907 is implemented in the following manner: the enhanced target is implemented The voice signal is sent outward.
在一个示例中,本申请实施例提供的方法还可包括如下步骤:1)若所述语音信号包括噪音信号,则确定所述语音信号的噪声协方差;2)根据所述噪声协方差,对所述噪音信号进行抑制。In one example, the method provided in the embodiment of the present application may further include the following steps: 1) if the voice signal includes a noise signal, determining a noise covariance of the voice signal; 2) according to the noise covariance, The noise signal is suppressed.
在一个示例中,本申请实施例提供的方法还可包括如下步骤:获取所述目标范围,对应所述传声器阵列存储所述目标范围。In an example, the method provided in the embodiment of the present application may further include the steps of: acquiring the target range, and storing the target range corresponding to the microphone array.
在一个示例中,本申请实施例提供的方法还可包括如下步骤:若所述声源位置不在所述目标范围内,则屏蔽所述话音信号。In an example, the method provided in the embodiment of the present application may further include the step of: if the sound source position is not within the target range, shielding the voice signal.
在一个示例中,本申请实施例提供的方法还可包括如下步骤:检测所述语音信号是否包括话音信号;以及,检测所述语音信号是否包括所述噪音信号。In an example, the method provided in the embodiment of the present application may further include the steps of: detecting whether the voice signal includes a voice signal; and detecting whether the voice signal includes the noise signal.
从上述实施例可见,本申请实施例提供的语音转录方法,通过传声器阵列采集阵列接收范围内的语音信号;若所述语音信号包括话音信号,则确定所述话音信号的声源位置;若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录;这种处理方式,使得基于传声器阵列对拾音区域内的话音信号进行多麦克增强,同时根据声源位置判断是否是目标话音,对于目标区域外的声音进行过滤,保证区域外声音不会传入到转录服务器;因此,可以有效确保拾到目标话音,提升对非目标话音的抗干扰能力,从而提升语音转录质量。It can be seen from the foregoing embodiments that the voice transcription method provided in the embodiments of the present application collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determining a sound source position of the voice signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method, The microphone array is used to perform multi-microphone enhancement on the voice signal in the pickup area, and at the same time to determine whether it is the target voice according to the position of the sound source. , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.
在上述的实施例中,提供了一种语音转录方法,与之相对应的,本申请还提供一种语音转录装置。该装置是与上述方法的实施例相对应。In the above embodiments, a voice transcription method is provided. Correspondingly, the present application also provides a voice transcription device. This device corresponds to an embodiment of the method described above.
第四实施例Fourth embodiment
请参考图10,其为本申请的电子设备实施例的示意图。由于设备实施例基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的设备实施例仅仅是示意性的。Please refer to FIG. 10, which is a schematic diagram of an embodiment of an electronic device of the present application. Since the device embodiment is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment. The device embodiments described below are only schematic.
本实施例的一种电子设备,该电子设备包括:处理器1001和存储器1002;所述存储器,用于存储实现语音转录方法的程序,该设备通电并通过所述处理器运行该语音转录方法的程序后,执行下述步骤:通过所述传声器阵列采集阵列接收范围内的语音信号;若所述语音信号包括话音信号,则确定所述话音信号的声源位置;若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录。An electronic device in this embodiment includes: a processor 1001 and a memory 1002; the memory is configured to store a program for implementing a voice transcription method, and the device is powered on and runs the voice transcription method through the processor. After the program, the following steps are performed: collecting voice signals in the receiving range of the array through the microphone array; if the voice signal includes a voice signal, determining the sound source position of the voice signal; if the sound source position is at the target Within the range, the voice signal is used as the target voice signal; the target voice signal is sent outward, so that the voice transcription server performs voice transcription on the target voice signal.
本申请虽然以较佳实施例公开如上,但其并不是用来限定本申请,任何本领域技术 人员在不脱离本申请的精神和范围内,都可以做出可能的变动和修改,因此本申请的保护范围应当以本申请权利要求所界定的范围为准。Although the present application is disclosed above with the preferred embodiments, it is not intended to limit the present application. Any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application. The scope of protection shall be subject to the scope defined by the claims of this application.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input / output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-persistent memory, random access memory (RAM), and / or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
1、计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。1. Computer-readable media include permanent and non-permanent, removable and non-removable media. Information can be stored by any method or technology. Information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media may be used to store information that can be accessed by computing devices. As defined herein, computer-readable media does not include non-transitory computer-readable media, such as modulated data signals and carrier waves.
2、本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。2. Those skilled in the art should understand that the embodiments of the present application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

Claims (16)

  1. 一种语音转录设备,其特征在于,包括:A voice transcription device, comprising:
    语音采集装置,用于通过传声器阵列采集阵列接收范围内的语音信号;A voice acquisition device for acquiring a voice signal in the receiving range of the array through a microphone array;
    声源定位装置,用于若所述语音信号包括话音信号,则确定所述话音信号的声源位置;A sound source positioning device, configured to determine a sound source position of the voice signal if the voice signal includes a voice signal;
    目标话音过滤装置,用于若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;A target voice filtering device, configured to use the voice signal as a target voice signal if the sound source position is within a target range;
    信号发送装置,用于将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录。The signal sending device is configured to send the target voice signal outward, so that a voice transcription server performs voice transcription on the target voice signal.
  2. 根据权利要求1所述的设备,其特征在于,还包括:The device according to claim 1, further comprising:
    语音降噪装置,用于根据所述声源位置,对所述目标话音信号进行语音增强;A voice noise reduction device, configured to perform voice enhancement on the target voice signal according to the position of the sound source;
    所述信号发送装置,具体用于将增强后的目标话音信号向外发送。The signal sending device is specifically configured to send an enhanced target voice signal outward.
  3. 根据权利要求2所述的设备,其特征在于,还包括:The device according to claim 2, further comprising:
    噪声协方差确定装置,用于若所述语音信号包括噪音信号,则确定所述语音信号的噪声协方差;A noise covariance determining device, configured to determine a noise covariance of the voice signal if the voice signal includes a noise signal;
    所述语音降噪装置,还用于根据所述噪声协方差,对所述噪音信号进行抑制。The voice noise reduction device is further configured to suppress the noise signal according to the noise covariance.
  4. 根据权利要求1所述的设备,其特征在于,还包括:The device according to claim 1, further comprising:
    目标范围配置装置,用于获取所述目标范围,存储所述目标范围。The target range configuration device is configured to acquire the target range and store the target range.
  5. 根据权利要求1所述的设备,其特征在于,还包括:The device according to claim 1, further comprising:
    所述目标话音过滤装置,还用于若所述声源位置不在所述目标范围内,则屏蔽所述话音信号。The target voice filtering device is further configured to shield the voice signal if the sound source position is not within the target range.
  6. 根据权利要求1所述的设备,其特征在于,The device according to claim 1, characterized in that:
    所述传声器阵列的排布方式包括:方形阵列或圆形阵列。The arrangement of the microphone array includes a square array or a circular array.
  7. 根据权利要求1所述的设备,其特征在于,还包括:The device according to claim 1, further comprising:
    话音检测装置,用于检测所述语音信号是否包括话音信号;若是,则启动所述声源定位装置。A voice detection device is used to detect whether the voice signal includes a voice signal; if so, the sound source localization device is activated.
  8. 根据权利要求2所述的设备,其特征在于,还包括:The device according to claim 2, further comprising:
    话音检测装置,用于检测所述语音信号是否包括噪音信号;若是,则启动噪声协方差确定装置。A voice detection device is used to detect whether the voice signal includes a noise signal; if so, a noise covariance determination device is activated.
  9. 一种语音转录系统,其特征在于,包括:A speech transcription system, comprising:
    根据上述权利要求1-8任一项所述的语音转录设备,以及,语音转录服务器;其中,所述服务器,用于对所述语音转录设备上传的目标话音信号进行语音转录。The voice transcription device according to any one of the preceding claims 1-8, and a voice transcription server; wherein the server is configured to perform voice transcription on a target voice signal uploaded by the voice transcription device.
  10. 一种语音转录方法,其特征在于,包括:A speech transcription method, comprising:
    通过传声器阵列采集阵列接收范围内的语音信号;Acquire voice signals in the receiving range of the array through the microphone array;
    若所述语音信号包括话音信号,则确定所述话音信号的声源位置;If the voice signal includes a voice signal, determining a sound source position of the voice signal;
    若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;If the sound source position is within a target range, using the voice signal as a target voice signal;
    将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录。Sending the target voice signal outward, so that a voice transcription server performs voice transcription of the target voice signal.
  11. 根据权利要求10所述的方法,其特征在于,还包括:The method according to claim 10, further comprising:
    根据所述声源位置,对所述目标话音信号进行语音增强;Performing speech enhancement on the target voice signal according to the sound source position;
    所述将所述目标话音信号向外发送,包括:The sending the target voice signal outward includes:
    将增强后的目标话音信号向外发送。Send the enhanced target voice signal outward.
  12. 根据权利要求11所述的方法,其特征在于,还包括:The method according to claim 11, further comprising:
    若所述语音信号包括噪音信号,则确定所述语音信号的噪声协方差;If the voice signal includes a noise signal, determining a noise covariance of the voice signal;
    根据所述噪声协方差,对所述噪音信号进行抑制。Suppressing the noise signal according to the noise covariance.
  13. 根据权利要求11所述的方法,其特征在于,还包括:The method according to claim 11, further comprising:
    获取所述目标范围,对应所述传声器阵列存储所述目标范围。Acquire the target range, and store the target range corresponding to the microphone array.
  14. 根据权利要求11所述的方法,其特征在于,还包括:The method according to claim 11, further comprising:
    若所述声源位置不在所述目标范围内,则屏蔽所述话音信号。If the sound source position is not within the target range, the voice signal is shielded.
  15. 根据权利要求12所述的方法,其特征在于,还包括:The method according to claim 12, further comprising:
    检测所述语音信号是否包括话音信号;以及,检测所述语音信号是否包括所述噪音信号。Detecting whether the voice signal includes a voice signal; and detecting whether the voice signal includes the noise signal.
  16. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    传声器阵列;Microphone array
    处理器;以及Processor; and
    存储器,用于存储实现语音转录方法的程序,该设备通电并通过所述处理器运行该语音转录方法的程序后,执行下述步骤:通过所述传声器阵列采集阵列接收范围内的语音信号;若所述语音信号包括话音信号,则确定所述话音信号的声源位置;若所述声源位置在目标范围内,则将所述话音信号作为目标话音信号;将所述目标话音信号向外发送,以使得语音转录服务器对所述目标话音信号进行语音转录。A memory for storing a program that implements the voice transcription method, after the device is powered on and runs the program for the voice transcription method through the processor, the following steps are performed: the microphone array is used to collect voice signals within the array receiving range; if If the voice signal includes a voice signal, determine a sound source position of the voice signal; if the sound source position is within a target range, use the voice signal as a target voice signal; and send the target voice signal outward , So that the voice transcription server performs voice transcription on the target voice signal.
PCT/CN2019/102482 2018-08-30 2019-08-26 Voice transcription device, system and method, and electronic device WO2020043037A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811004661.6A CN110875056B (en) 2018-08-30 2018-08-30 Speech transcription device, system, method and electronic device
CN201811004661.6 2018-08-30

Publications (1)

Publication Number Publication Date
WO2020043037A1 true WO2020043037A1 (en) 2020-03-05

Family

ID=69643925

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102482 WO2020043037A1 (en) 2018-08-30 2019-08-26 Voice transcription device, system and method, and electronic device

Country Status (2)

Country Link
CN (1) CN110875056B (en)
WO (1) WO2020043037A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4178224A4 (en) * 2020-07-16 2024-01-10 Huawei Tech Co Ltd Conference voice enhancement method, apparatus and system

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516989A (en) * 2020-03-27 2021-10-19 浙江宇视科技有限公司 Sound source audio management method, device, equipment and storage medium
CN112750455A (en) * 2020-12-29 2021-05-04 苏州思必驰信息科技有限公司 Audio processing method and device
CN113345462B (en) * 2021-05-17 2023-12-29 浪潮金融信息技术有限公司 Pickup denoising method, system and medium
CN115482828A (en) * 2021-06-15 2022-12-16 华为技术有限公司 Sound signal processing method and device, and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150379990A1 (en) * 2014-06-30 2015-12-31 Rajeev Conrad Nongpiur Detection and enhancement of multiple speech sources
CN105335336A (en) * 2015-10-12 2016-02-17 中国人民解放军国防科学技术大学 Sensor array steady adaptive beamforming method
CN107316649A (en) * 2017-05-15 2017-11-03 百度在线网络技术(北京)有限公司 Audio recognition method and device based on artificial intelligence
CN108122563A (en) * 2017-12-19 2018-06-05 北京声智科技有限公司 Improve voice wake-up rate and the method for correcting DOA
CN108269582A (en) * 2018-01-24 2018-07-10 厦门美图之家科技有限公司 A kind of orientation sound pick-up method and computing device based on two-microphone array

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442833B2 (en) * 2009-02-17 2013-05-14 Sony Computer Entertainment Inc. Speech processing with source location estimation using signals from two or more microphones
EP2498250B1 (en) * 2011-03-07 2021-05-05 Accenture Global Services Limited Client and server system for natural language-based control of a digital network of devices
TW201316328A (en) * 2011-10-14 2013-04-16 Hon Hai Prec Ind Co Ltd Sound feedback device and work method thereof
US9689959B2 (en) * 2011-10-17 2017-06-27 Foundation de l'Institut de Recherche Idiap Method, apparatus and computer program product for determining the location of a plurality of speech sources
JP2014240940A (en) * 2013-06-12 2014-12-25 株式会社東芝 Dictation support device, method and program
US20160071526A1 (en) * 2014-09-09 2016-03-10 Analog Devices, Inc. Acoustic source tracking and selection
US11076052B2 (en) * 2015-02-03 2021-07-27 Dolby Laboratories Licensing Corporation Selective conference digest
CN106297794A (en) * 2015-05-22 2017-01-04 西安中兴新软件有限责任公司 The conversion method of a kind of language and characters and equipment
US20170243582A1 (en) * 2016-02-19 2017-08-24 Microsoft Technology Licensing, Llc Hearing assistance with automated speech transcription
CN106782596A (en) * 2016-11-18 2017-05-31 深圳市行者机器人技术有限公司 A kind of auditory localization system for tracking and method based on microphone array
CN107172018A (en) * 2017-04-27 2017-09-15 华南理工大学 The vocal print cryptosecurity control method and system of activation type under common background noise
CN107527626A (en) * 2017-08-30 2017-12-29 北京嘉楠捷思信息技术有限公司 Audio identification system
CN107742522B (en) * 2017-10-23 2022-01-14 科大讯飞股份有限公司 Target voice obtaining method and device based on microphone array

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150379990A1 (en) * 2014-06-30 2015-12-31 Rajeev Conrad Nongpiur Detection and enhancement of multiple speech sources
CN105335336A (en) * 2015-10-12 2016-02-17 中国人民解放军国防科学技术大学 Sensor array steady adaptive beamforming method
CN107316649A (en) * 2017-05-15 2017-11-03 百度在线网络技术(北京)有限公司 Audio recognition method and device based on artificial intelligence
CN108122563A (en) * 2017-12-19 2018-06-05 北京声智科技有限公司 Improve voice wake-up rate and the method for correcting DOA
CN108269582A (en) * 2018-01-24 2018-07-10 厦门美图之家科技有限公司 A kind of orientation sound pick-up method and computing device based on two-microphone array

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4178224A4 (en) * 2020-07-16 2024-01-10 Huawei Tech Co Ltd Conference voice enhancement method, apparatus and system

Also Published As

Publication number Publication date
CN110875056A (en) 2020-03-10
CN110875056B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
WO2020043037A1 (en) Voice transcription device, system and method, and electronic device
CN106782584B (en) Audio signal processing device, method and electronic device
WO2020108614A1 (en) Audio recognition method, and target audio positioning method, apparatus and device
CN110970057B (en) Sound processing method, device and equipment
TWI543149B (en) Noise cancellation method
CN108109617B (en) Remote pickup method
WO2015196729A1 (en) Microphone array speech enhancement method and device
WO2014161309A1 (en) Method and apparatus for mobile terminal to implement voice source tracking
CN206349145U (en) Audio signal processing apparatus
US20080175408A1 (en) Proximity filter
US20100098266A1 (en) Multi-channel audio device
US10154345B2 (en) Surround sound recording for mobile devices
US20190164567A1 (en) Speech signal recognition method and device
WO2020029882A1 (en) Azimuth estimation method, device, and storage medium
CN107124647A (en) A kind of panoramic video automatically generates the method and device of subtitle file when recording
CN107452398B (en) Echo acquisition method, electronic device and computer readable storage medium
KR20090037845A (en) Method and apparatus for extracting the target sound signal from the mixed sound
WO2017045512A1 (en) Voice recognition method and apparatus, terminal, and voice recognition device
CN113409800A (en) Processing method and device for monitoring audio, storage medium and electronic equipment
WO2023056905A1 (en) Sound source localization method and apparatus, and device
Pasha et al. A survey on ad hoc signal processing: Applications, challenges and state-of-the-art techniques
US11961501B2 (en) Noise reduction method and device
WO2023065317A1 (en) Conference terminal and echo cancellation method
WO2023088156A1 (en) Sound velocity correction method and apparatus
WO2022178852A1 (en) Listening assisting method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19854563

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19854563

Country of ref document: EP

Kind code of ref document: A1