WO2020043037A1 - Dispositif, système et procédé de transcription vocale, et dispositif électronique - Google Patents

Dispositif, système et procédé de transcription vocale, et dispositif électronique Download PDF

Info

Publication number
WO2020043037A1
WO2020043037A1 PCT/CN2019/102482 CN2019102482W WO2020043037A1 WO 2020043037 A1 WO2020043037 A1 WO 2020043037A1 CN 2019102482 W CN2019102482 W CN 2019102482W WO 2020043037 A1 WO2020043037 A1 WO 2020043037A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
voice signal
target
signal
transcription
Prior art date
Application number
PCT/CN2019/102482
Other languages
English (en)
Chinese (zh)
Inventor
余涛
许云峰
刘章
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020043037A1 publication Critical patent/WO2020043037A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording

Definitions

  • the present application relates to the technical field of voice signal processing, and in particular, to a voice transcription device, system, method, and electronic device.
  • Speech transcription technology has been a hot research topic in the field of speech signal processing in recent years. With the continuous deepening of research, this technology has been widely used in trial venues and multi-person conferences.
  • Figure 1 shows a common speech transcription scene.
  • This solution is equipped with a gooseneck microphone device in front of each person.
  • the gooseneck microphone device collects each person's audio, transmits the collected audio to the audio processing device, and the audio processing device performs amplification processing on the collected original audio.
  • the amplified audio is then sent to the transcription cloud service, and the speech transcription process is performed on the amplified audio through the transcription cloud service.
  • the present application provides a voice transcription device to solve the problems that the target voice cannot be picked up and external crosstalk interference exists in the prior art.
  • the application additionally provides a speech transcription system and method, and an electronic device.
  • This application provides a voice transcription device, including:
  • a voice acquisition device for acquiring a voice signal in the receiving range of the array through a microphone array
  • a sound source positioning device configured to determine a sound source position of the voice signal if the voice signal includes a voice signal
  • a target voice filtering device configured to use the voice signal as a target voice signal if the sound source position is within a target range
  • the signal sending device is configured to send the target voice signal outward, so that a voice transcription server performs voice transcription on the target voice signal.
  • Optional also includes:
  • a voice noise reduction device configured to perform voice enhancement on the target voice signal according to the position of the sound source
  • the signal sending device is specifically configured to send an enhanced target voice signal outward.
  • Optional also includes:
  • a noise covariance determining device configured to determine a noise covariance of the voice signal if the voice signal includes a noise signal
  • the voice noise reduction device is further configured to suppress the noise signal according to the noise covariance.
  • Optional also includes:
  • the target range configuration device is configured to acquire the target range and store the target range.
  • Optional also includes:
  • the target voice filtering device is further configured to shield the voice signal if the sound source position is not within the target range.
  • the arrangement of the microphone array includes a square array or a circular array.
  • Optional also includes:
  • a voice detection device is used to detect whether the voice signal includes a voice signal; if so, the sound source localization device is activated.
  • Optional also includes:
  • a voice detection device is used to detect whether the voice signal includes the noise signal; if so, the noise covariance determination device is activated.
  • This application also provides a speech transcription system, including:
  • the above-mentioned voice transcription device and a voice transcription server; wherein the server is configured to perform voice transcription on a target voice signal uploaded by the voice transcription device.
  • This application also provides a voice transcription method, including:
  • the voice signal includes a voice signal, determining a sound source position of the voice signal
  • Optional also includes:
  • the sending the target voice signal outward includes:
  • Optional also includes:
  • the voice signal includes a noise signal, determining a noise covariance of the voice signal
  • Optional also includes:
  • Optional also includes:
  • the voice signal is shielded.
  • Optional also includes:
  • This application also provides an electronic device, including:
  • a memory for storing a program that implements the voice transcription method after the device is powered on and runs the program for the voice transcription method through the processor, the following steps are performed: the microphone array is used to collect voice signals within the array receiving range; if If the voice signal includes a voice signal, determine a sound source position of the voice signal; if the sound source position is within a target range, use the voice signal as a target voice signal; and send the target voice signal outward , So that the voice transcription server performs voice transcription on the target voice signal.
  • the present application also provides a computer-readable storage medium having instructions stored in the computer-readable storage medium that, when run on a computer, causes the computer to execute the various methods described above.
  • the present application also provides a computer program product including instructions that, when run on a computer, causes the computer to perform the various methods described above.
  • the voice transcription device collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determining a sound source position of the voice signal; if the sound source position is at a target Within the range, the voice signal is used as the target voice signal; the target voice signal is sent to a voice transcription server, so that the server performs voice transcription of the target voice signal; this processing method is based on a microphone array Multi-microphone enhancement of the voice signal in the pickup area, while determining whether it is the target voice according to the position of the sound source, and filtering the sound outside the target area to ensure that the sound outside the area does not enter the transcription server; therefore, it can effectively ensure Pick up the target speech, improve the anti-interference ability to non-target speech, and improve the quality of speech transcription.
  • FIG. 1 is a diagram of a voice transcription scene in the prior art
  • FIG. 2 is a schematic structural diagram of an embodiment of a voice transcription device provided by the present application.
  • 3a is a schematic diagram of a microphone array of an embodiment of a voice transcription device provided by the present application.
  • 3b is a schematic diagram of a microphone array of an embodiment of a voice transcription device provided by the present application.
  • FIG. 4 is a specific structural schematic diagram of an embodiment of a voice transcription device provided by the present application.
  • FIG. 5 is a schematic diagram of another specific structure of an embodiment of a voice transcription device provided by the present application.
  • FIG. 6 is a data processing flowchart of an embodiment of a voice transcription device provided by the present application.
  • FIG. 7 is a system schematic diagram of an embodiment of a voice transcription system provided by the present application.
  • FIG. 8 is a schematic scenario diagram of an embodiment of a speech transcription system provided by the present application.
  • FIG. 9 is a specific flowchart of an embodiment of a voice transcription method provided by the present application.
  • FIG. 10 is a schematic diagram of an embodiment of an electronic device provided by the present application.
  • FIG. 2 is a schematic diagram of an embodiment of a voice transcription device provided by the present application.
  • the device includes a voice acquisition device 1, a sound source localization device 2, a target voice filtering device 3, and a signal transmission device 4.
  • the voice acquisition device 1 is configured to acquire a voice signal within a receiving range of the array through a microphone array.
  • the microphone array includes a plurality of microphones, and each microphone is an array element in the array.
  • a microphone is an energy conversion device that converts a sound signal into an electrical signal. It is also called a microphone, a microphone, or a microphone.
  • the microphone can be transmitted by the vibration of the sound to the diaphragm of the microphone, and the magnet inside is pushed to form a changed current, so that the changed current is sent to the subsequent sound processing circuit for amplification processing.
  • the microphone array can pick up a voice signal within its receiving range.
  • the receiving range is called the array receiving range and refers to the range of the voice signal that the microphone array can receive.
  • the receiving range of the array depends on the arrangement of the array elements and the number of array elements.
  • the size of the microphone array is not only closely related to the collection of speech and noise signals, but also has a certain impact on the accuracy of sound source localization.
  • a microphone is a sound sensor that converts sound signals into voltage signals. When the sound source is far away from the microphone, the microphone cannot collect the sound signal or the collected voltage signal is very small, which causes the signal-to-noise ratio to be too low, which is disadvantageous for estimating the orientation of the sound source.
  • the larger the distance between the microphones the larger the phase difference between the sound sources, and the easier the orientation of the sound source is. The smaller the distance is, the more the resolution will decrease due to the spatial aliasing of the phase difference.
  • the arrangement of the microphone array can be flexibly adjusted according to actual needs.
  • the arrangement of the array elements includes, but is not limited to, a circle, a square, and a linearly arranged shape.
  • FIG. 3a and FIG. 3b are schematic diagrams of a microphone array of an embodiment of a speech transcription system.
  • Fig. 3a shows a square microphone array with the characteristics of the array elements being the same and equally spaced
  • Fig. 3b shows a circular microphone array with the characteristics of the array elements being the same and arranged at equal intervals on the circumference Each array element.
  • the voice acquisition device 1 can use a microphone array to perform space-time sampling of voice signals within its receiving range under noisy backgrounds, such as conference venues, multimedia classrooms, large-scale stages, video conferences, car hands-free phones, and battlefields.
  • the voice signal may include only a voice signal or a noise signal, and may also include both a voice signal and a noise signal.
  • the speech acquisition device 1 includes three parts, namely: 1) a microphone array; 2) a front-end amplification unit; 2) a multi-channel synchronous sampling unit.
  • the microphone array is used to collect the voice signals in the receiving range of the array and convert the voice signals into analog electric signals; then the analog electric signals are amplified by the front-end amplification unit; then the analog electric signals are sampled by the multi-channel synchronous sampling unit and converted into Digital electrical signals can be sampled simultaneously on multiple channels.
  • the sound source positioning device 2 is configured to determine a sound source position of the voice signal if the voice signal includes a voice signal.
  • Sound localization refers to the behavior of the listener to determine the direction and distance of the sound source using sound stimuli in the environment. Depends on the physical characteristics of the sound reaching the ears, including differences in frequency, intensity, and duration.
  • the device provided in the embodiment of the present application locates a sound source position through a signal of a multi-channel microphone, and can obtain position information of the sound source according to a delay difference between different sound sources reaching the microphone.
  • the maximum delay-and-sum (delay-and-sum) information of each time-frequency point (TF) can be searched to obtain spatial mapping information to obtain the position of the sound source.
  • the sound source localization algorithm is not limited to this algorithm, and may be: algorithms such as music, cics, SPR-PHAT, and the like.
  • Existing sound source localization algorithms can be roughly divided into three categories: a) algorithms based on time-delay estimation (TDE); b) algorithms based on high-resolution spectral estimation; c) algorithms based on sparse representation.
  • TDE time-delay estimation
  • b) algorithms based on high-resolution spectral estimation e.
  • a sound source localization algorithm can be selected according to requirements.
  • the target voice filtering device 3 is configured to use the voice signal as a target voice signal if the sound source position is within a target range.
  • the target range refers to a spatial range in which a target sound source is located, and can be set by a user according to requirements.
  • the device further includes: target range configuration means, configured to acquire the target range and save the range information in a memory.
  • the target voice filtering device 3 can specifically determine whether the voice position of the voice signal is within the target range according to the sound source position information, if it is, the current voice signal is retained, and if not, the current voice signal is shielded.
  • the signal transmitting device 4 is configured to send the target voice signal outward, so that a voice transcription server performs voice transcription on the target voice signal.
  • the enhanced target voice signal is sent to a cloud voice transcription server for voice transcription via a data collection device deployed at a sound source site (such as a conference or trial site).
  • FIG. 4 is a specific schematic diagram of an embodiment of a voice transcription device provided by the present application.
  • the device further includes: a voice noise reduction device 5 configured to perform voice enhancement on the target voice signal according to the position of the sound source; correspondingly, the signal transmission device 4 is specifically configured to: Sending the enhanced target voice signal to the voice transcription server.
  • the current direction vector is calculated according to the actual sound direction, and the enhanced direction of the beam is adjusted in real time to achieve the optimal enhancement effect.
  • FIG. 5 is another specific schematic diagram of an embodiment of a voice transcription device provided by the present application.
  • the device further includes: a noise covariance determination device 6.
  • the noise covariance determination device 6 is configured to determine the noise covariance of the speech signal if the speech signal includes a noise signal; correspondingly, the speech noise reduction device 5 is further configured to determine the noise covariance according to the noise covariance. Variance to suppress the noise signal.
  • the noise covariance determining device may perform covariance calculation between microphones according to noise audio data in a non-speech segment.
  • the following formula can be used to calculate the noise covariance:
  • ⁇ n ⁇ X (n, k) ⁇ X (n, k) T
  • the noise covariance ⁇ n is a vector composed of a plurality of microphone signals at the TF point at the k frequency point at time n. Covariance matrix is obtained by conjugate transpose multiplication.
  • the voice noise reduction device 5 uses a beamforming technique to separate the target voice signal under a noisy background and enhance it to obtain an enhanced target voice signal.
  • the microphone noise reduction processing is performed by an algorithm such as MVDR, and the optimal noise suppression effect can be obtained according to the current noise field and the target sound source direction.
  • the spatial filter coefficient can be calculated using the following formula:
  • V is the sound propagation direction vector calculated from the sound source localization.
  • the spatial filtering formula is as follows:
  • Y (n, k) is the output frequency after beamforming.
  • the device further includes: a voice detection device 7 for detecting whether the voice signal includes a voice signal; if so, activating the sound source localization device 2; and detecting whether the voice signal includes a voice signal.
  • the noise signal if yes, the noise covariance determination device 6 is activated.
  • Voice detection also known as voice activation detection (VAD, Voice Activity Detection) refers to a process used to identify whether voice data bits appear. The purpose is to detect whether the current voice signal contains a voice signal, that is, to judge the input signal, to distinguish the voice signal from various background noise signals, and to use different processing methods for the two signals, respectively.
  • VAD Voice Activity Detection
  • the device provided in the embodiment of the present application finds the start point and the end point of a voice from a section of a signal including a voice through voice detection, so that the voice signal can be processed for voice transcription and voice enhancement. Effective endpoint detection not only reduces processing time, but also eliminates noise interference from silent sections.
  • VAD detection can be performed by calculating the energy of each frame of the speech signal.
  • FIG. 6 is a data processing flowchart of an embodiment of a voice transcription device.
  • a microphone array such as a circular array or a square array
  • multiple audio signals are sent to the sound source localization device.
  • Noise covariance determination device and voice noise reduction device and send any voice signal to the voice detection device separately.
  • the position of the sound source is obtained by the sound source localization.
  • the speech noise reduction device performs speech enhancement on the directional sound source through the obtained sound source position information and noise covariance information, and the sound source position information is processed by the target sound source judgment at the same time to determine whether the current sound source is the target sound source.
  • the target voice source filtering device filters the enhanced voice signal to obtain the voice signal of the target area, and finally sends the enhanced target voice signal through the signal sending device through a data collection device deployed at the sound source site (such as a conference or court site) Go to the cloud for voice transcription.
  • the voice transcription device collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determine a sound source position of the voice signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method,
  • the microphone array is used to multi-microphone enhance the voice signal in the pickup area, and at the same time judge whether the target voice is based on the position of the sound source, and filter the sound outside the target area to ensure that the sound outside the area will not be transmitted to the transcription server; , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.
  • a voice transcription device is provided.
  • this application also provides a voice transcription system.
  • FIG. 7 is a flowchart of an embodiment of a speech transcription system of the present application.
  • the present application further provides a voice transcription system, including: at least one voice transcription device 701 according to the above embodiment, and a voice transcription server 702.
  • the voice transcription server 702 is configured to perform voice transcription on a target voice signal uploaded by the voice transcription device 701.
  • the voice transcription device 701 is usually deployed at a sound source site, such as a conference or a trial site.
  • the voice transcription device 701 can collect a voice signal within a receiving range of the array through a microphone array; then, if the voice signal includes a voice signal, determine a sound source position of the voice signal through a sound source positioning device; if the voice signal If the source position is within the target range, the target voice signal is used as the target voice signal by the target voice filtering device; finally, the target voice signal is sent outward by the signal transmitting device, so that the voice transcription server 702 sends the target to the target.
  • Voice signals are transcribed.
  • FIG. 8 is a schematic diagram of a usage scenario of an embodiment of a voice transcription system of the present application.
  • six microphone arrays are deployed on site and include data collection equipment. Each microphone array sends its own target sound source signal to the data collection equipment, and the enhanced target voice signal is transmitted through the data collection equipment. Send to the cloud for voice transcription, and receive and display the transcription result.
  • the speech transcription system collects a speech signal within a receiving range of the array through a microphone array; if the speech signal includes a speech signal, determining a sound source position of the speech signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method,
  • the microphone array is used to multi-microphone enhance the voice signal in the pickup area, and at the same time judge whether the target voice is based on the position of the sound source, and filter the sound outside the target area to ensure that the sound outside the area will not be transmitted to the transcription server; , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.
  • a speech transcription system is provided.
  • this application also provides a speech transcription method. This method corresponds to the embodiment of the system described above.
  • FIG. 9 is a flowchart of an embodiment of a voice transcription method of the present application. Since the method embodiment is basically similar to the system embodiment, it is described relatively simply. For the relevant part, refer to the description of the system embodiment. The method embodiments described below are merely exemplary.
  • the present application further provides a voice transcription method, including:
  • Step S901 Acquire a voice signal in the receiving range of the array through the microphone array.
  • Step S903 If the voice signal includes a voice signal, determine a sound source position of the voice signal.
  • Step S905 if the sound source position is within a target range, use the voice signal as a target voice signal.
  • Step S907 Send the target voice signal outward, so that a voice transcription server performs voice transcription of the target voice signal.
  • the method provided in the embodiment of the present application may further include the following steps: performing speech enhancement on the target voice signal according to the sound source position; correspondingly, step S907 is implemented in the following manner: the enhanced target is implemented The voice signal is sent outward.
  • the method provided in the embodiment of the present application may further include the following steps: 1) if the voice signal includes a noise signal, determining a noise covariance of the voice signal; 2) according to the noise covariance, The noise signal is suppressed.
  • the method provided in the embodiment of the present application may further include the steps of: acquiring the target range, and storing the target range corresponding to the microphone array.
  • the method provided in the embodiment of the present application may further include the step of: if the sound source position is not within the target range, shielding the voice signal.
  • the method provided in the embodiment of the present application may further include the steps of: detecting whether the voice signal includes a voice signal; and detecting whether the voice signal includes the noise signal.
  • the voice transcription method collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determining a sound source position of the voice signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method,
  • the microphone array is used to perform multi-microphone enhancement on the voice signal in the pickup area, and at the same time to determine whether it is the target voice according to the position of the sound source. , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.
  • a voice transcription method is provided.
  • the present application also provides a voice transcription device. This device corresponds to an embodiment of the method described above.
  • FIG. 10 is a schematic diagram of an embodiment of an electronic device of the present application. Since the device embodiment is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment. The device embodiments described below are only schematic.
  • An electronic device in this embodiment includes: a processor 1001 and a memory 1002; the memory is configured to store a program for implementing a voice transcription method, and the device is powered on and runs the voice transcription method through the processor.
  • the following steps are performed: collecting voice signals in the receiving range of the array through the microphone array; if the voice signal includes a voice signal, determining the sound source position of the voice signal; if the sound source position is at the target Within the range, the voice signal is used as the target voice signal; the target voice signal is sent outward, so that the voice transcription server performs voice transcription on the target voice signal.
  • a computing device includes one or more processors (CPUs), input / output interfaces, network interfaces, and memory.
  • processors CPUs
  • input / output interfaces output interfaces
  • network interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-persistent memory, random access memory (RAM), and / or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media. Information can be stored by any method or technology. Information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media may be used to store information that can be accessed by computing devices. As defined herein, computer-readable media does not include non-transitory computer-readable media, such as modulated data signals and carrier waves.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • this application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

La présente invention concerne un dispositif, un système et un procédé de transcription vocale, et un dispositif électronique. Un dispositif de transcription vocale (701) acquiert un signal vocal dans une plage de réception de réseau au moyen d'un réseau de microphones (S901) ; si le signal vocal comprend un signal de tonalité, détermine une position de source de son du signal de tonalité (S903) ; si la position de source de son est dans la plage cible, utilise le signal de tonalité comme signal de tonalité cible (S905) ; et envoie le signal de tonalité cible à un serveur de transcription vocale (702) de telle sorte que le serveur effectue une transcription vocale sur le signal de tonalité cible (S907). Au moyen du traitement, une amélioration à microphones multiples est effectuée sur un signal de tonalité dans une plage de capture sur la base d'un réseau de microphones, on détermine si la tonalité est une tonalité cible en fonction de la position de source de son, et le son au-delà d'une région cible est filtré de façon à s'assurer que le son au-delà de la région n'est pas transmis à un serveur de transcription. Par conséquent, on peut être efficacement s'assurer qu'une tonalité cible est saisie, de façon à augmenter la capacité d'anti-interférence pour une tonalité non cible, ce qui permet d'améliorer la qualité de la transcription vocale.
PCT/CN2019/102482 2018-08-30 2019-08-26 Dispositif, système et procédé de transcription vocale, et dispositif électronique WO2020043037A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811004661.6 2018-08-30
CN201811004661.6A CN110875056B (zh) 2018-08-30 2018-08-30 语音转录设备、系统、方法、及电子设备

Publications (1)

Publication Number Publication Date
WO2020043037A1 true WO2020043037A1 (fr) 2020-03-05

Family

ID=69643925

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102482 WO2020043037A1 (fr) 2018-08-30 2019-08-26 Dispositif, système et procédé de transcription vocale, et dispositif électronique

Country Status (2)

Country Link
CN (1) CN110875056B (fr)
WO (1) WO2020043037A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4178224A4 (fr) * 2020-07-16 2024-01-10 Huawei Tech Co Ltd Procédé, appareil et système d'amélioration vocale pour conférences

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516989A (zh) * 2020-03-27 2021-10-19 浙江宇视科技有限公司 声源音频的管理方法、装置、设备和存储介质
CN112750455A (zh) * 2020-12-29 2021-05-04 苏州思必驰信息科技有限公司 音频处理方法及装置
CN113345462B (zh) * 2021-05-17 2023-12-29 浪潮金融信息技术有限公司 一种拾音去噪方法、系统及介质
CN115482828A (zh) * 2021-06-15 2022-12-16 华为技术有限公司 声音信号处理方法及装置、计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150379990A1 (en) * 2014-06-30 2015-12-31 Rajeev Conrad Nongpiur Detection and enhancement of multiple speech sources
CN105335336A (zh) * 2015-10-12 2016-02-17 中国人民解放军国防科学技术大学 一种传感器阵列的稳健自适应波束形成方法
CN107316649A (zh) * 2017-05-15 2017-11-03 百度在线网络技术(北京)有限公司 基于人工智能的语音识别方法及装置
CN108122563A (zh) * 2017-12-19 2018-06-05 北京声智科技有限公司 提高语音唤醒率及修正doa的方法
CN108269582A (zh) * 2018-01-24 2018-07-10 厦门美图之家科技有限公司 一种基于双麦克风阵列的定向拾音方法及计算设备

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442833B2 (en) * 2009-02-17 2013-05-14 Sony Computer Entertainment Inc. Speech processing with source location estimation using signals from two or more microphones
EP2498250B1 (fr) * 2011-03-07 2021-05-05 Accenture Global Services Limited Système client et serveur pour le contrôle en langage naturel d'un réseau numérique d'appareils
TW201316328A (zh) * 2011-10-14 2013-04-16 Hon Hai Prec Ind Co Ltd 聲音反饋裝置及其工作方法
US9689959B2 (en) * 2011-10-17 2017-06-27 Foundation de l'Institut de Recherche Idiap Method, apparatus and computer program product for determining the location of a plurality of speech sources
JP2014240940A (ja) * 2013-06-12 2014-12-25 株式会社東芝 書き起こし支援装置、方法、及びプログラム
US20160071526A1 (en) * 2014-09-09 2016-03-10 Analog Devices, Inc. Acoustic source tracking and selection
WO2016126770A2 (fr) * 2015-02-03 2016-08-11 Dolby Laboratories Licensing Corporation Résumé sélectif de conférence
CN106297794A (zh) * 2015-05-22 2017-01-04 西安中兴新软件有限责任公司 一种语音文字的转换方法及设备
US20170243582A1 (en) * 2016-02-19 2017-08-24 Microsoft Technology Licensing, Llc Hearing assistance with automated speech transcription
CN106782596A (zh) * 2016-11-18 2017-05-31 深圳市行者机器人技术有限公司 一种基于麦克风阵列的声源定位跟随系统及方法
CN107172018A (zh) * 2017-04-27 2017-09-15 华南理工大学 公共背景噪声下激活式的声纹密码安全控制方法及系统
CN107527626A (zh) * 2017-08-30 2017-12-29 北京嘉楠捷思信息技术有限公司 一种音频识别系统
CN107742522B (zh) * 2017-10-23 2022-01-14 科大讯飞股份有限公司 基于麦克风阵列的目标语音获取方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150379990A1 (en) * 2014-06-30 2015-12-31 Rajeev Conrad Nongpiur Detection and enhancement of multiple speech sources
CN105335336A (zh) * 2015-10-12 2016-02-17 中国人民解放军国防科学技术大学 一种传感器阵列的稳健自适应波束形成方法
CN107316649A (zh) * 2017-05-15 2017-11-03 百度在线网络技术(北京)有限公司 基于人工智能的语音识别方法及装置
CN108122563A (zh) * 2017-12-19 2018-06-05 北京声智科技有限公司 提高语音唤醒率及修正doa的方法
CN108269582A (zh) * 2018-01-24 2018-07-10 厦门美图之家科技有限公司 一种基于双麦克风阵列的定向拾音方法及计算设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4178224A4 (fr) * 2020-07-16 2024-01-10 Huawei Tech Co Ltd Procédé, appareil et système d'amélioration vocale pour conférences

Also Published As

Publication number Publication date
CN110875056A (zh) 2020-03-10
CN110875056B (zh) 2024-04-02

Similar Documents

Publication Publication Date Title
WO2020043037A1 (fr) Dispositif, système et procédé de transcription vocale, et dispositif électronique
CN106782584B (zh) 音频信号处理设备、方法和电子设备
WO2020108614A1 (fr) Procédé de reconnaissance audio et procédé, appareil et dispositif de positionnement audio cible
CN110970057B (zh) 一种声音处理方法、装置与设备
TWI543149B (zh) 雜訊消除方法
CN108109617B (zh) 一种远距离拾音方法
WO2014161309A1 (fr) Procédé et appareil pour qu'un terminal mobile mette en œuvre un suivi de source vocale
JP2019191558A (ja) 音声を増幅する方法及び装置
WO2020029882A1 (fr) Procédé d'estimation d'azimut, dispositif et support de stockage
US11869481B2 (en) Speech signal recognition method and device
US20100098266A1 (en) Multi-channel audio device
US10154345B2 (en) Surround sound recording for mobile devices
CN107124647A (zh) 一种全景视频录制时自动生成字幕文件的方法及装置
CN107452398B (zh) 回声获取方法、电子设备及计算机可读存储介质
KR20090037845A (ko) 혼합 신호로부터 목표 음원 신호를 추출하는 방법 및 장치
Ganguly et al. Real-time smartphone application for improving spatial awareness of hearing assistive devices
CN113409800A (zh) 一种监控音频的处理方法、装置、存储介质及电子设备
WO2023056905A1 (fr) Procédé et appareil de localisation de source sonore et dispositif
Pasha et al. A survey on ad hoc signal processing: Applications, challenges and state-of-the-art techniques
US20190306618A1 (en) Methods circuits devices systems and associated computer executable code for acquiring acoustic signals
US11961501B2 (en) Noise reduction method and device
WO2023065317A1 (fr) Terminal de conférence et procédé d'annulation d'écho
JP2009025025A (ja) 音源方向推定装置およびこれを用いた音源分離装置、ならびに音源方向推定方法およびこれを用いた音源分離方法
WO2023088156A1 (fr) Procédé et appareil de correction de la vitesse du son
JP5022459B2 (ja) 収音装置、収音方法及び収音プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19854563

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19854563

Country of ref document: EP

Kind code of ref document: A1