WO2020043037A1

WO2020043037A1 - Voice transcription device, system and method, and electronic device

Info

Publication number: WO2020043037A1
Application number: PCT/CN2019/102482
Authority: WO
Inventors: 余涛; 许云峰; 刘章
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2018-08-30
Filing date: 2019-08-26
Publication date: 2020-03-05
Also published as: CN110875056A; CN110875056B

Abstract

A voice transcription device, system and method, and an electronic device. A voice transcription device (701) acquires a voice signal within an array receiving range by means of a microphone array (S901); if the voice signal comprises a tone signal, determines a sound source position of the tone signal (S903); if the sound source position is within ta target range, uses the tone signal as a target tone signal (S905); and sends the target tone signal to a voice transcription server (702) so that the server performs voice transcription on the target tone signal (S907). By means of the processing, multi-microphone enhancement is performed on a tone signal in a pickup range on the basis of a microphone array, whether the tone is a target tone is determined according to the sound source position, and the sound beyond a target region is filtered so as to ensure that the sound beyond the region is not transmitted to a transcription server. Therefore, it can be effectively ensured that a target tone is picked up, so as to increase the anti-interference capability for a non-target tone, thereby improving voice transcription quality.

Description

Voice transcription equipment, system, method and electronic equipment

This application claims priority from a Chinese patent application filed on August 30, 2018 with the application number 201811004661.6 and the invention name is "Voice Transcription Device, System, Method, and Electronic Device", the entire contents of which are incorporated herein by reference .

Technical field

The present application relates to the technical field of voice signal processing, and in particular, to a voice transcription device, system, method, and electronic device.

Background technique

Speech transcription technology has been a hot research topic in the field of speech signal processing in recent years. With the continuous deepening of research, this technology has been widely used in trial venues and multi-person conferences.

Figure 1 shows a common speech transcription scene. This solution is equipped with a gooseneck microphone device in front of each person. The gooseneck microphone device collects each person's audio, transmits the collected audio to the audio processing device, and the audio processing device performs amplification processing on the collected original audio. The amplified audio is then sent to the transcription cloud service, and the speech transcription process is performed on the amplified audio through the transcription cloud service.

However, in the process of implementing the present invention, the inventor found that the technical solution has at least the following problems:

1) Due to the limitation of the gooseneck microphone itself, its effective pickup area is very small. When the user deviates from its effective area or the distance is too long, the user's voice will be suppressed, causing the sound to become louder and smaller, affecting the transcription effect;

2) Because the suppression effect of the gooseneck microphone on the sound is limited, the voices of surrounding people can also be easily captured. Therefore, the anti-interference ability is poor when there are noises and playback during the multi-person conference or the trial. , Resulting in crosstalk in the transcription. To sum up, the prior art has the problems that the target voice cannot be picked up and external crosstalk interferes.

Summary of the Invention

The present application provides a voice transcription device to solve the problems that the target voice cannot be picked up and external crosstalk interference exists in the prior art. The application additionally provides a speech transcription system and method, and an electronic device.

This application provides a voice transcription device, including:

A voice acquisition device for acquiring a voice signal in the receiving range of the array through a microphone array;

A sound source positioning device, configured to determine a sound source position of the voice signal if the voice signal includes a voice signal;

A target voice filtering device, configured to use the voice signal as a target voice signal if the sound source position is within a target range;

The signal sending device is configured to send the target voice signal outward, so that a voice transcription server performs voice transcription on the target voice signal.

Optional, also includes:

A voice noise reduction device, configured to perform voice enhancement on the target voice signal according to the position of the sound source;

The signal sending device is specifically configured to send an enhanced target voice signal outward.

Optional, also includes:

A noise covariance determining device, configured to determine a noise covariance of the voice signal if the voice signal includes a noise signal;

The voice noise reduction device is further configured to suppress the noise signal according to the noise covariance.

Optional, also includes:

The target range configuration device is configured to acquire the target range and store the target range.

Optional, also includes:

The target voice filtering device is further configured to shield the voice signal if the sound source position is not within the target range.

Optionally, the arrangement of the microphone array includes a square array or a circular array.

Optional, also includes:

A voice detection device is used to detect whether the voice signal includes a voice signal; if so, the sound source localization device is activated.

Optional, also includes:

A voice detection device is used to detect whether the voice signal includes the noise signal; if so, the noise covariance determination device is activated.

This application also provides a speech transcription system, including:

The above-mentioned voice transcription device, and a voice transcription server; wherein the server is configured to perform voice transcription on a target voice signal uploaded by the voice transcription device.

This application also provides a voice transcription method, including:

Acquire voice signals in the receiving range of the array through the microphone array;

If the voice signal includes a voice signal, determining a sound source position of the voice signal;

If the sound source position is within a target range, using the voice signal as a target voice signal;

Sending the target voice signal outward, so that a voice transcription server performs voice transcription of the target voice signal.

Optional, also includes:

Performing speech enhancement on the target voice signal according to the sound source position;

The sending the target voice signal outward includes:

Send the enhanced target voice signal outward.

Optional, also includes:

If the voice signal includes a noise signal, determining a noise covariance of the voice signal;

Suppressing the noise signal according to the noise covariance.

Optional, also includes:

Acquire the target range, and store the target range corresponding to the microphone array.

Optional, also includes:

If the sound source position is not within the target range, the voice signal is shielded.

Optional, also includes:

Detecting whether the voice signal includes a voice signal; and detecting whether the voice signal includes the noise signal.

This application also provides an electronic device, including:

Microphone array

Processor; and

A memory for storing a program that implements the voice transcription method, after the device is powered on and runs the program for the voice transcription method through the processor, the following steps are performed: the microphone array is used to collect voice signals within the array receiving range; if If the voice signal includes a voice signal, determine a sound source position of the voice signal; if the sound source position is within a target range, use the voice signal as a target voice signal; and send the target voice signal outward , So that the voice transcription server performs voice transcription on the target voice signal.

The present application also provides a computer-readable storage medium having instructions stored in the computer-readable storage medium that, when run on a computer, causes the computer to execute the various methods described above.

The present application also provides a computer program product including instructions that, when run on a computer, causes the computer to perform the various methods described above.

Compared with the prior art, this application has the following advantages:

The voice transcription device provided in the embodiment of the present application collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determining a sound source position of the voice signal; if the sound source position is at a target Within the range, the voice signal is used as the target voice signal; the target voice signal is sent to a voice transcription server, so that the server performs voice transcription of the target voice signal; this processing method is based on a microphone array Multi-microphone enhancement of the voice signal in the pickup area, while determining whether it is the target voice according to the position of the sound source, and filtering the sound outside the target area to ensure that the sound outside the area does not enter the transcription server; therefore, it can effectively ensure Pick up the target speech, improve the anti-interference ability to non-target speech, and improve the quality of speech transcription.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of a voice transcription scene in the prior art;

2 is a schematic structural diagram of an embodiment of a voice transcription device provided by the present application;

3a is a schematic diagram of a microphone array of an embodiment of a voice transcription device provided by the present application;

3b is a schematic diagram of a microphone array of an embodiment of a voice transcription device provided by the present application;

FIG. 4 is a specific structural schematic diagram of an embodiment of a voice transcription device provided by the present application; FIG.

FIG. 5 is a schematic diagram of another specific structure of an embodiment of a voice transcription device provided by the present application; FIG.

6 is a data processing flowchart of an embodiment of a voice transcription device provided by the present application;

FIG. 7 is a system schematic diagram of an embodiment of a voice transcription system provided by the present application; FIG.

FIG. 8 is a schematic scenario diagram of an embodiment of a speech transcription system provided by the present application; FIG.

FIG. 9 is a specific flowchart of an embodiment of a voice transcription method provided by the present application; FIG.

FIG. 10 is a schematic diagram of an embodiment of an electronic device provided by the present application.

detailed description

Numerous specific details are set forth in the following description to facilitate a full understanding of the application. However, this application can be implemented in many other ways than those described herein, and those skilled in the art can make similar promotion without violating the connotation of this application, so this application is not limited by the specific implementation disclosed below.

In this application, a speech transcription device, system, method, and electronic device are provided. Various schemes are described in detail in the following embodiments.

First embodiment

Please refer to FIG. 2, which is a schematic diagram of an embodiment of a voice transcription device provided by the present application. The device includes a voice acquisition device 1, a sound source localization device 2, a target voice filtering device 3, and a signal transmission device 4.

The voice acquisition device 1 is configured to acquire a voice signal within a receiving range of the array through a microphone array.

The microphone array includes a plurality of microphones, and each microphone is an array element in the array.

A microphone is an energy conversion device that converts a sound signal into an electrical signal. It is also called a microphone, a microphone, or a microphone. The microphone can be transmitted by the vibration of the sound to the diaphragm of the microphone, and the magnet inside is pushed to form a changed current, so that the changed current is sent to the subsequent sound processing circuit for amplification processing.

The microphone array can pick up a voice signal within its receiving range. The receiving range is called the array receiving range and refers to the range of the voice signal that the microphone array can receive. The receiving range of the array depends on the arrangement of the array elements and the number of array elements.

The size of the microphone array is not only closely related to the collection of speech and noise signals, but also has a certain impact on the accuracy of sound source localization. A microphone is a sound sensor that converts sound signals into voltage signals. When the sound source is far away from the microphone, the microphone cannot collect the sound signal or the collected voltage signal is very small, which causes the signal-to-noise ratio to be too low, which is disadvantageous for estimating the orientation of the sound source. In addition, the larger the distance between the microphones, the larger the phase difference between the sound sources, and the easier the orientation of the sound source is. The smaller the distance is, the more the resolution will decrease due to the spatial aliasing of the phase difference.

The arrangement of the microphone array can be flexibly adjusted according to actual needs. The arrangement of the array elements includes, but is not limited to, a circle, a square, and a linearly arranged shape.

Please refer to FIG. 3a and FIG. 3b, which are schematic diagrams of a microphone array of an embodiment of a speech transcription system. Among them, Fig. 3a shows a square microphone array with the characteristics of the array elements being the same and equally spaced; Fig. 3b shows a circular microphone array with the characteristics of the array elements being the same and arranged at equal intervals on the circumference Each array element.

The voice acquisition device 1 can use a microphone array to perform space-time sampling of voice signals within its receiving range under noisy backgrounds, such as conference venues, multimedia classrooms, large-scale stages, video conferences, car hands-free phones, and battlefields.

The voice signal may include only a voice signal or a noise signal, and may also include both a voice signal and a noise signal.

In one example, the speech acquisition device 1 includes three parts, namely: 1) a microphone array; 2) a front-end amplification unit; 2) a multi-channel synchronous sampling unit. The process of this device is described below. First, the microphone array is used to collect the voice signals in the receiving range of the array and convert the voice signals into analog electric signals; then the analog electric signals are amplified by the front-end amplification unit; then the analog electric signals are sampled by the multi-channel synchronous sampling unit and converted into Digital electrical signals can be sampled simultaneously on multiple channels.

The sound source positioning device 2 is configured to determine a sound source position of the voice signal if the voice signal includes a voice signal.

Sound localization refers to the behavior of the listener to determine the direction and distance of the sound source using sound stimuli in the environment. Depends on the physical characteristics of the sound reaching the ears, including differences in frequency, intensity, and duration.

The device provided in the embodiment of the present application locates a sound source position through a signal of a multi-channel microphone, and can obtain position information of the sound source according to a delay difference between different sound sources reaching the microphone.

In specific implementation, the maximum delay-and-sum (delay-and-sum) information of each time-frequency point (TF) can be searched to obtain spatial mapping information to obtain the position of the sound source.

It should be noted that the sound source localization algorithm is not limited to this algorithm, and may be: algorithms such as music, cics, SPR-PHAT, and the like. Existing sound source localization algorithms can be roughly divided into three categories: a) algorithms based on time-delay estimation (TDE); b) algorithms based on high-resolution spectral estimation; c) algorithms based on sparse representation. In specific implementation, a sound source localization algorithm can be selected according to requirements.

The target voice filtering device 3 is configured to use the voice signal as a target voice signal if the sound source position is within a target range.

The target range refers to a spatial range in which a target sound source is located, and can be set by a user according to requirements.

In one example, the device further includes: target range configuration means, configured to acquire the target range and save the range information in a memory.

The target voice filtering device 3 can specifically determine whether the voice position of the voice signal is within the target range according to the sound source position information, if it is, the current voice signal is retained, and if not, the current voice signal is shielded.

The signal transmitting device 4 is configured to send the target voice signal outward, so that a voice transcription server performs voice transcription on the target voice signal.

In one example, the enhanced target voice signal is sent to a cloud voice transcription server for voice transcription via a data collection device deployed at a sound source site (such as a conference or trial site).

Please refer to FIG. 4, which is a specific schematic diagram of an embodiment of a voice transcription device provided by the present application. In this embodiment, the device further includes: a voice noise reduction device 5 configured to perform voice enhancement on the target voice signal according to the position of the sound source; correspondingly, the signal transmission device 4 is specifically configured to: Sending the enhanced target voice signal to the voice transcription server. By adopting this processing method, the current direction vector is calculated according to the actual sound direction, and the enhanced direction of the beam is adjusted in real time to achieve the optimal enhancement effect.

Please refer to FIG. 5, which is another specific schematic diagram of an embodiment of a voice transcription device provided by the present application. In this embodiment, the device further includes: a noise covariance determination device 6.

The noise covariance determination device 6 is configured to determine the noise covariance of the speech signal if the speech signal includes a noise signal; correspondingly, the speech noise reduction device 5 is further configured to determine the noise covariance according to the noise covariance. Variance to suppress the noise signal.

The noise covariance determining device may perform covariance calculation between microphones according to noise audio data in a non-speech segment. In specific implementation, the following formula can be used to calculate the noise covariance:

φ _n = ΣX (n, k) × X (n, k) ^T

X (n, k) (x ₁ (n, k), x ₂ (n, k), ..., x _M (n, k)) ^T

Among them, M represents the number of array elements of the microphone array; n represents the instant of speech sampling; k represents the frequency included in the speech signal; X represents the speech signal. It can be seen from the above formula that the noise covariance Φ _n is a vector composed of a plurality of microphone signals at the TF point at the k frequency point at time n. Covariance matrix is obtained by conjugate transpose multiplication.

In one example, the voice noise reduction device 5 uses a beamforming technique to separate the target voice signal under a noisy background and enhance it to obtain an enhanced target voice signal. For example, the microphone noise reduction processing is performed by an algorithm such as MVDR, and the optimal noise suppression effect can be obtained according to the current noise field and the target sound source direction. Among them, the spatial filter coefficient can be calculated using the following formula:

Among them, V is the sound propagation direction vector calculated from the sound source localization.

The spatial filtering formula is as follows:

Y (n, k) = W · X (n, k)

Y (n, k) is the output frequency after beamforming.

In one example, the device further includes: a voice detection device 7 for detecting whether the voice signal includes a voice signal; if so, activating the sound source localization device 2; and detecting whether the voice signal includes a voice signal. The noise signal; if yes, the noise covariance determination device 6 is activated.

Voice detection, also known as voice activation detection (VAD, Voice Activity Detection), refers to a process used to identify whether voice data bits appear. The purpose is to detect whether the current voice signal contains a voice signal, that is, to judge the input signal, to distinguish the voice signal from various background noise signals, and to use different processing methods for the two signals, respectively.

The device provided in the embodiment of the present application finds the start point and the end point of a voice from a section of a signal including a voice through voice detection, so that the voice signal can be processed for voice transcription and voice enhancement. Effective endpoint detection not only reduces processing time, but also eliminates noise interference from silent sections.

In specific implementation, VAD detection can be performed by calculating the energy of each frame of the speech signal.

Please refer to FIG. 6, which is a data processing flowchart of an embodiment of a voice transcription device. In this embodiment, a microphone array (such as a circular array or a square array) is used for sound pick-up to obtain multiple microphone array signals. After multi-microphone data acquisition and processing, multiple audio signals are sent to the sound source localization device. , Noise covariance determination device and voice noise reduction device, and send any voice signal to the voice detection device separately. The VAD device is used to detect whether there is currently a voice signal. If a voice signal is present (VAD = 1), the sound source is located by the sound source localization device. If it is a noise signal, it is sent to the noise covariance determination device to estimate the noise covariance. matrix. The position of the sound source is obtained by the sound source localization. The speech noise reduction device performs speech enhancement on the directional sound source through the obtained sound source position information and noise covariance information, and the sound source position information is processed by the target sound source judgment at the same time to determine whether the current sound source is the target sound source. According to the judgment information, The target voice source filtering device filters the enhanced voice signal to obtain the voice signal of the target area, and finally sends the enhanced target voice signal through the signal sending device through a data collection device deployed at the sound source site (such as a conference or court site) Go to the cloud for voice transcription.

It can be seen from the foregoing embodiments that the voice transcription device provided in the embodiments of the present application collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determine a sound source position of the voice signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method, The microphone array is used to multi-microphone enhance the voice signal in the pickup area, and at the same time judge whether the target voice is based on the position of the sound source, and filter the sound outside the target area to ensure that the sound outside the area will not be transmitted to the transcription server; , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.

In the above embodiments, a voice transcription device is provided. Correspondingly, this application also provides a voice transcription system.

Second embodiment

Please refer to FIG. 7, which is a flowchart of an embodiment of a speech transcription system of the present application. The present application further provides a voice transcription system, including: at least one voice transcription device 701 according to the above embodiment, and a voice transcription server 702.

The voice transcription server 702 is configured to perform voice transcription on a target voice signal uploaded by the voice transcription device 701.

The voice transcription device 701 is usually deployed at a sound source site, such as a conference or a trial site. The voice transcription device 701 can collect a voice signal within a receiving range of the array through a microphone array; then, if the voice signal includes a voice signal, determine a sound source position of the voice signal through a sound source positioning device; if the voice signal If the source position is within the target range, the target voice signal is used as the target voice signal by the target voice filtering device; finally, the target voice signal is sent outward by the signal transmitting device, so that the voice transcription server 702 sends the target to the target. Voice signals are transcribed.

Please refer to FIG. 8, which is a schematic diagram of a usage scenario of an embodiment of a voice transcription system of the present application. In this embodiment, six microphone arrays are deployed on site and include data collection equipment. Each microphone array sends its own target sound source signal to the data collection equipment, and the enhanced target voice signal is transmitted through the data collection equipment. Send to the cloud for voice transcription, and receive and display the transcription result.

It can be seen from the foregoing embodiments that the speech transcription system provided by the embodiment of the present application collects a speech signal within a receiving range of the array through a microphone array; if the speech signal includes a speech signal, determining a sound source position of the speech signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method, The microphone array is used to multi-microphone enhance the voice signal in the pickup area, and at the same time judge whether the target voice is based on the position of the sound source, and filter the sound outside the target area to ensure that the sound outside the area will not be transmitted to the transcription server; , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.

In the above embodiments, a speech transcription system is provided. Correspondingly, this application also provides a speech transcription method. This method corresponds to the embodiment of the system described above.

Third embodiment

Please refer to FIG. 9, which is a flowchart of an embodiment of a voice transcription method of the present application. Since the method embodiment is basically similar to the system embodiment, it is described relatively simply. For the relevant part, refer to the description of the system embodiment. The method embodiments described below are merely exemplary.

The present application further provides a voice transcription method, including:

Step S901: Acquire a voice signal in the receiving range of the array through the microphone array.

Step S903: If the voice signal includes a voice signal, determine a sound source position of the voice signal.

Step S905: if the sound source position is within a target range, use the voice signal as a target voice signal.

Step S907: Send the target voice signal outward, so that a voice transcription server performs voice transcription of the target voice signal.

In an example, the method provided in the embodiment of the present application may further include the following steps: performing speech enhancement on the target voice signal according to the sound source position; correspondingly, step S907 is implemented in the following manner: the enhanced target is implemented The voice signal is sent outward.

In one example, the method provided in the embodiment of the present application may further include the following steps: 1) if the voice signal includes a noise signal, determining a noise covariance of the voice signal; 2) according to the noise covariance, The noise signal is suppressed.

In an example, the method provided in the embodiment of the present application may further include the steps of: acquiring the target range, and storing the target range corresponding to the microphone array.

In an example, the method provided in the embodiment of the present application may further include the step of: if the sound source position is not within the target range, shielding the voice signal.

In an example, the method provided in the embodiment of the present application may further include the steps of: detecting whether the voice signal includes a voice signal; and detecting whether the voice signal includes the noise signal.

It can be seen from the foregoing embodiments that the voice transcription method provided in the embodiments of the present application collects a voice signal within a receiving range of the array through a microphone array; if the voice signal includes a voice signal, determining a sound source position of the voice signal; If the position of the sound source is within the target range, the voice signal is used as the target voice signal; the target voice signal is sent outward so that the voice transcription server performs voice transcription on the target voice signal; this processing method, The microphone array is used to perform multi-microphone enhancement on the voice signal in the pickup area, and at the same time to determine whether it is the target voice according to the position of the sound source. , Can effectively ensure that the target speech is picked up, improve the anti-interference ability to non-target speech, and thereby improve the quality of speech transcription.

In the above embodiments, a voice transcription method is provided. Correspondingly, the present application also provides a voice transcription device. This device corresponds to an embodiment of the method described above.

Fourth embodiment

Please refer to FIG. 10, which is a schematic diagram of an embodiment of an electronic device of the present application. Since the device embodiment is basically similar to the method embodiment, it is described relatively simply. For the relevant part, refer to the description of the method embodiment. The device embodiments described below are only schematic.

An electronic device in this embodiment includes: a processor 1001 and a memory 1002; the memory is configured to store a program for implementing a voice transcription method, and the device is powered on and runs the voice transcription method through the processor. After the program, the following steps are performed: collecting voice signals in the receiving range of the array through the microphone array; if the voice signal includes a voice signal, determining the sound source position of the voice signal; if the sound source position is at the target Within the range, the voice signal is used as the target voice signal; the target voice signal is sent outward, so that the voice transcription server performs voice transcription on the target voice signal.

Although the present application is disclosed above with the preferred embodiments, it is not intended to limit the present application. Any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application. The scope of protection shall be subject to the scope defined by the claims of this application.

In a typical configuration, a computing device includes one or more processors (CPUs), input / output interfaces, network interfaces, and memory.

Memory may include non-persistent memory, random access memory (RAM), and / or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media include permanent and non-permanent, removable and non-removable media. Information can be stored by any method or technology. Information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media may be used to store information that can be accessed by computing devices. As defined herein, computer-readable media does not include non-transitory computer-readable media, such as modulated data signals and carrier waves.

2. Those skilled in the art should understand that the embodiments of the present application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

Claims

A voice transcription device, comprising:

A voice acquisition device for acquiring a voice signal in the receiving range of the array through a microphone array;

A sound source positioning device, configured to determine a sound source position of the voice signal if the voice signal includes a voice signal;

A target voice filtering device, configured to use the voice signal as a target voice signal if the sound source position is within a target range;

The signal sending device is configured to send the target voice signal outward, so that a voice transcription server performs voice transcription on the target voice signal.
The device according to claim 1, further comprising:

A voice noise reduction device, configured to perform voice enhancement on the target voice signal according to the position of the sound source;

The signal sending device is specifically configured to send an enhanced target voice signal outward.
The device according to claim 2, further comprising:

A noise covariance determining device, configured to determine a noise covariance of the voice signal if the voice signal includes a noise signal;

The voice noise reduction device is further configured to suppress the noise signal according to the noise covariance.
The device according to claim 1, further comprising:

The target range configuration device is configured to acquire the target range and store the target range.
The device according to claim 1, further comprising:

The target voice filtering device is further configured to shield the voice signal if the sound source position is not within the target range.
The device according to claim 1, characterized in that:

The arrangement of the microphone array includes a square array or a circular array.
The device according to claim 1, further comprising:

A voice detection device is used to detect whether the voice signal includes a voice signal; if so, the sound source localization device is activated.
The device according to claim 2, further comprising:

A voice detection device is used to detect whether the voice signal includes a noise signal; if so, a noise covariance determination device is activated.
A speech transcription system, comprising:

The voice transcription device according to any one of the preceding claims 1-8, and a voice transcription server; wherein the server is configured to perform voice transcription on a target voice signal uploaded by the voice transcription device.
A speech transcription method, comprising:

Acquire voice signals in the receiving range of the array through the microphone array;

If the voice signal includes a voice signal, determining a sound source position of the voice signal;

If the sound source position is within a target range, using the voice signal as a target voice signal;

Sending the target voice signal outward, so that a voice transcription server performs voice transcription of the target voice signal.
The method according to claim 10, further comprising:

Performing speech enhancement on the target voice signal according to the sound source position;

The sending the target voice signal outward includes:

Send the enhanced target voice signal outward.
The method according to claim 11, further comprising:

If the voice signal includes a noise signal, determining a noise covariance of the voice signal;

Suppressing the noise signal according to the noise covariance.
The method according to claim 11, further comprising:

Acquire the target range, and store the target range corresponding to the microphone array.
The method according to claim 11, further comprising:

If the sound source position is not within the target range, the voice signal is shielded.
The method according to claim 12, further comprising:

Detecting whether the voice signal includes a voice signal; and detecting whether the voice signal includes the noise signal.
An electronic device, comprising:

Microphone array

Processor; and

A memory for storing a program that implements the voice transcription method, after the device is powered on and runs the program for the voice transcription method through the processor, the following steps are performed: the microphone array is used to collect voice signals within the array receiving range; if If the voice signal includes a voice signal, determine a sound source position of the voice signal; if the sound source position is within a target range, use the voice signal as a target voice signal; and send the target voice signal outward , So that the voice transcription server performs voice transcription on the target voice signal.