CN115334413A - Voice signal processing method, system and device and electronic equipment - Google Patents

Voice signal processing method, system and device and electronic equipment Download PDF

Info

Publication number
CN115334413A
CN115334413A CN202210836554.XA CN202210836554A CN115334413A CN 115334413 A CN115334413 A CN 115334413A CN 202210836554 A CN202210836554 A CN 202210836554A CN 115334413 A CN115334413 A CN 115334413A
Authority
CN
China
Prior art keywords
microphone
signal
voice
microphones
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210836554.XA
Other languages
Chinese (zh)
Inventor
韩润强
吕新亮
赵昊然
郑羲光
张晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202210836554.XA priority Critical patent/CN115334413A/en
Publication of CN115334413A publication Critical patent/CN115334413A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/02Circuits for transducers, loudspeakers or microphones for preventing acoustic reaction, i.e. acoustic oscillatory feedback
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/403Arrangements for multi-party communication, e.g. for conferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones

Abstract

The method can be applied to control equipment of a microphone system, wherein the control equipment respectively performs echo cancellation on first voice signals collected by each microphone in the microphone system to obtain second voice signals corresponding to each microphone, and then determines a target microphone according to the signal-to-noise ratio of the second voice signals so as to output the voice signals based on the second voice signals corresponding to the target microphone. In the process, the control device can automatically select and output the voice signal corresponding to the microphone according to the signal-to-noise ratio of the second voice signal, so that the processing efficiency of the voice signal is effectively improved, and the second voice signal is the voice signal after echo cancellation, so that the far-end device can be ensured to receive the clear voice signal.

Description

Voice signal processing method, system and device and electronic equipment
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a method, a system, and an apparatus for processing a speech signal, and an electronic device.
Background
With the advancement and rapid development of voice call technology, online conferences are increasingly required, for example, in a meeting scene of multiple persons in a conference room, other persons who are inconvenient to arrive can participate in the online conferences through a network.
In the related art, a plurality of microphones are arranged in a conference room to collect voice signals in the conference room, for example, a microphone is arranged at a fixed position on a table top of the conference room to collect voice signals at a fixed direction, and a microphone (e.g., a wireless microphone) is provided to collect voice signals of a designated speaker, and a conference host manually selects which microphone to collect voice signals to transmit to a remote device of an online conference.
However, in the above scenario, some microphones do not have an echo cancellation function, so the above manner of manually selecting the microphone cannot ensure that the far-end device can receive a clear voice signal.
Disclosure of Invention
The present disclosure provides a method, a system, a device and an electronic device for processing a voice signal, which can ensure that a far-end device receives a clear voice signal. The technical scheme of the disclosure is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided a speech signal processing method applied to a control device of a microphone system, the microphone system including a plurality of microphones, the method including:
based on a reference signal sent by a remote device, performing echo cancellation on first voice signals collected by each microphone in the microphone system to obtain second voice signals corresponding to each microphone, wherein the remote device and the microphone system are in different position spaces;
determining a target microphone from the plurality of microphones based on the signal-to-noise ratio of the second voice signal corresponding to each microphone;
and sending the voice signal to the far-end equipment based on the second voice signal corresponding to the target microphone.
In the method, the control device performs echo cancellation on first voice signals collected by each microphone in the microphone system respectively to obtain second voice signals corresponding to each microphone, and then determines a target microphone according to a signal-to-noise ratio of the second voice signals so as to output a voice signal based on the second voice signal corresponding to the target microphone. In the process, the control device can automatically select and output the voice signal corresponding to the microphone according to the signal-to-noise ratio of the second voice signal, so that the processing efficiency of the voice signal is effectively improved, and the second voice signal is the voice signal after echo cancellation, so that the far-end device can be ensured to receive the clear voice signal.
In some embodiments, the determining a target microphone from the plurality of microphones based on the snr of the second speech signal corresponding to each microphone includes:
acquiring a first reference value of each microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and a plurality of first weights, wherein the first weights indicate the influence degree of the microphone type to which the microphone belongs on the signal-to-noise ratio of the voice signals;
and determining the microphone with the highest first reference value in the plurality of microphones as the target microphone.
Through the mode, the influence degree of the microphone type on the signal-to-noise ratio of the voice signal is fully considered, so that the target microphone can provide a clearer voice signal.
In some embodiments, the determining a target microphone from the plurality of microphones based on the signal-to-noise ratio of the second speech signal corresponding to each microphone comprises:
acquiring a second reference value of each microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and a plurality of second weights, wherein the second weights indicate the influence degree of the distance between the speaking object and the microphone on the signal-to-noise ratio of the voice signals;
and determining the microphone with the highest second reference value in the plurality of microphones as the target microphone.
Through the mode, the influence degree of the distance between the microphone and the speaking object on the signal-to-noise ratio of the voice signal is fully considered, so that the target microphone can provide a clearer voice signal.
In some embodiments, the sending the voice signal to the remote device based on the second voice signal corresponding to the target microphone includes:
and under the condition that the signal-to-noise ratios of a plurality of historical voice signals meet a target condition, sending the voice signals to the remote equipment based on a second voice signal corresponding to the target microphone, wherein the plurality of historical voice signals are the historical voice signals sent to the remote equipment by the control equipment in a target time period.
By the mode, in some scenes, under the condition that a plurality of historical voice signals are collected by the microphones except the target microphone, after the control device determines the target microphone, the microphone outputting the voice signals is not immediately switched from the first microphone to the target microphone, and whether the microphone needs to be switched or not is judged according to the signal-to-noise ratios of the plurality of historical voice signals, so that sudden voice change is avoided, and conference experience is not influenced.
In some embodiments, the transmitting the speech signal to the remote device based on the second speech signal corresponding to the target microphone in the case that the signal-to-noise ratios of the plurality of historical speech signals meet the target condition includes any one of:
under the condition that the average signal-to-noise ratio of the plurality of historical voice signals is smaller than or equal to a first threshold value, transmitting a voice signal to the far-end equipment based on a second voice signal corresponding to the target microphone;
and under the condition that the signal-to-noise ratio of a target number of historical voice signals in the plurality of historical voice signals is smaller than or equal to a second threshold value, transmitting the voice signals to the far-end equipment based on a second voice signal corresponding to the target microphone.
Through the mode, under the condition that the signal-to-noise ratios of a plurality of historical voice signals are low, the fact that the voice quality of the microphone of the current output voice signal is poor or unstable is shown, on the basis, the microphone of the output voice signal is switched to the target microphone, and on the basis of avoiding voice mutation, the fact that the far-end equipment can receive clear voice signals is guaranteed.
In some embodiments, the determining a target microphone from the plurality of microphones based on the signal-to-noise ratio of the second speech signal corresponding to each microphone comprises:
and determining the target microphone from the plurality of microphones based on the second voice signal with the highest signal-to-noise ratio in the second voice signals corresponding to the microphones.
In some embodiments, the transmitting the voice signal to the far-end device based on the second voice signal corresponding to the target microphone includes: and performing signal gain on the second voice signal corresponding to the target microphone, and sending the second voice signal corresponding to the target microphone after signal gain to the remote equipment.
For a voice signal, the higher the signal-to-noise ratio, the higher the quality of the voice signal, that is, the higher the signal-to-noise ratio of the voice signal, which indicates that the quality of the voice signal collected by the microphone corresponding to the voice signal is higher. Therefore, the quality of the voice signal collected by the target microphone can be determined to be higher than that of the voice signal collected by other microphones, so that the far-end equipment can be ensured to receive clear voice signals, and the experience of the online conference participants is improved.
In some embodiments, the performing echo cancellation on the first voice signal collected by each microphone in the microphone system based on the reference signal sent by the remote device to obtain the second voice signal corresponding to each microphone includes:
performing linear echo cancellation on the first voice signals corresponding to the microphones based on the reference signal and the first voice signals corresponding to the microphones to obtain intermediate voice signals corresponding to the microphones;
and performing nonlinear echo cancellation on the intermediate voice signals corresponding to the microphones based on the reference signal, the first voice signals corresponding to the microphones and the intermediate voice signals corresponding to the microphones to obtain second voice signals corresponding to the microphones.
Through the mode, the control equipment performs linear echo cancellation and nonlinear echo cancellation on the first voice signals acquired by the microphones so as to cancel echoes in the first voice signals, obtain a plurality of second voice signals with high voice quality and provide a basis for sending clear voice signals to the far-end equipment subsequently.
According to a second aspect of embodiments of the present disclosure, there is provided a microphone system comprising a plurality of microphones and a control device for:
performing echo cancellation on first voice signals collected by each microphone in the microphone system based on a reference signal sent by a remote device to obtain second voice signals corresponding to each microphone, wherein the remote device and the microphone system are in different position spaces;
determining a target microphone from the plurality of microphones based on the signal-to-noise ratio of the second voice signal corresponding to each microphone;
and sending the voice signal to the far-end equipment based on the second voice signal corresponding to the target microphone.
In some embodiments, the control device is configured to: acquiring a first reference value of each microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and a plurality of first weights, wherein the first weights indicate the influence degree of the type of the microphone to which the microphone belongs on the signal-to-noise ratio of the voice signals; and determining the microphone with the highest first reference value in the plurality of microphones as the target microphone.
In some embodiments, the control device is configured to: acquiring a second reference value of each microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and a plurality of second weights, wherein the second weights indicate the influence degree of the distance between the speaking object and the microphone on the signal-to-noise ratio of the voice signals; and determining the microphone with the highest second reference value in the plurality of microphones as the target microphone.
In some embodiments, the control device is configured to: and under the condition that the signal-to-noise ratios of a plurality of historical voice signals meet a target condition, sending the voice signals to the remote equipment based on a second voice signal corresponding to the target microphone, wherein the plurality of historical voice signals are the historical voice signals sent to the remote equipment by the control equipment in a target time period.
In some embodiments, the control device is for any one of:
under the condition that the average signal-to-noise ratio of the plurality of historical voice signals is smaller than or equal to a first threshold value, transmitting a voice signal to the far-end equipment based on a second voice signal corresponding to the target microphone;
and under the condition that the signal-to-noise ratio of a target number of historical voice signals in the plurality of historical voice signals is less than or equal to a second threshold value, transmitting the voice signals to the far-end equipment based on a second voice signal corresponding to the target microphone.
In some embodiments, the control device is configured to: and determining the target microphone from the plurality of microphones based on the second voice signal with the highest signal-to-noise ratio in the second voice signals corresponding to the microphones.
In some embodiments, the control device is configured to: and performing signal gain on the second voice signal corresponding to the target microphone, and sending the second voice signal corresponding to the target microphone after the signal gain to the far-end equipment.
In some embodiments, the control device is configured to: performing linear echo cancellation on the first voice signals corresponding to the microphones based on the reference signal and the first voice signals corresponding to the microphones to obtain intermediate voice signals corresponding to the microphones; and performing nonlinear echo cancellation on the intermediate voice signals corresponding to the microphones based on the reference signal, the first voice signals corresponding to the microphones and the intermediate voice signals corresponding to the microphones to obtain second voice signals corresponding to the microphones.
According to a third aspect of the embodiments of the present disclosure, there is provided a speech signal processing apparatus applied to a control device of a microphone system including a plurality of microphones, the apparatus including:
the echo cancellation unit is configured to perform echo cancellation on first voice signals acquired by each microphone in the microphone system based on a reference signal sent by a far-end device, so as to obtain second voice signals corresponding to each microphone, wherein the far-end device and the microphone system are located in different position spaces;
a determination unit configured to perform determination of a target microphone from the plurality of microphones based on signal-to-noise ratios of the second voice signals corresponding to the respective microphones;
a transmitting unit configured to perform transmitting the voice signal to the far-end device based on a second voice signal corresponding to the target microphone.
In some embodiments, the determining unit is configured to perform: acquiring a first reference value of each microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and a plurality of first weights, wherein the first weights indicate the influence degree of the type of the microphone to which the microphone belongs on the signal-to-noise ratio of the voice signals; and determining the microphone with the highest first reference value in the plurality of microphones as the target microphone.
In some embodiments, the determining unit is configured to perform: acquiring a second reference value of each microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and a plurality of second weights, wherein the second weights indicate the influence degree of the distance between the speaking object and the microphone on the signal-to-noise ratio of the voice signals; and determining the microphone with the highest second reference value in the plurality of microphones as the target microphone.
In some embodiments, the transmitting unit is configured to perform: and under the condition that the signal-to-noise ratios of a plurality of historical voice signals meet a target condition, sending the voice signals to the remote equipment based on a second voice signal corresponding to the target microphone, wherein the plurality of historical voice signals are the historical voice signals sent to the remote equipment by the control equipment in a target time period.
In some embodiments, the transmitting unit is configured to perform any one of:
under the condition that the average signal-to-noise ratio of the plurality of historical voice signals is smaller than or equal to a first threshold value, transmitting a voice signal to the far-end equipment based on a second voice signal corresponding to the target microphone;
and under the condition that the signal-to-noise ratio of a target number of historical voice signals in the plurality of historical voice signals is smaller than or equal to a second threshold value, transmitting the voice signals to the far-end equipment based on a second voice signal corresponding to the target microphone.
In some embodiments, the determining unit is configured to determine the target microphone from the plurality of microphones based on a second speech signal with a highest signal-to-noise ratio in the second speech signals corresponding to the respective microphones.
In some embodiments, the sending unit is configured to perform signal gain on the second speech signal corresponding to the target microphone, and send the signal-gain second speech signal corresponding to the target microphone to the remote device.
In some embodiments, the echo cancellation unit is configured to perform:
performing linear echo cancellation on the first voice signals corresponding to the microphones based on the reference signal and the first voice signals corresponding to the microphones to obtain intermediate voice signals corresponding to the microphones;
and performing nonlinear echo cancellation on the intermediate voice signals corresponding to the microphones based on the reference signal, the first voice signals corresponding to the microphones and the intermediate voice signals corresponding to the microphones to obtain second voice signals corresponding to the microphones.
According to a fourth aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including:
one or more processors;
a memory for storing the processor executable program code;
wherein the processor is configured to execute the program code to implement the above-mentioned speech signal processing method.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium including: the program code in the computer readable storage medium, when executed by a processor of an electronic device, enables the electronic device to perform the above-described voice signal processing method.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-described speech signal processing method.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a schematic diagram of an implementation environment in accordance with an exemplary embodiment;
FIG. 2 is a flow diagram illustrating a method of speech signal processing according to an exemplary embodiment;
FIG. 3 is a flow chart illustrating another method of speech signal processing according to an exemplary embodiment;
FIG. 4 is a schematic diagram illustrating a method of speech signal processing according to an exemplary embodiment;
FIG. 5 is a block diagram illustrating a speech signal processing apparatus according to an exemplary embodiment;
FIG. 6 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment;
FIG. 7 is a block diagram illustrating another electronic device in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this disclosure are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data requires compliance with relevant laws and regulations and standards in relevant countries and regions. For example, the voice signals and the like involved in the embodiments of the present disclosure are acquired with sufficient authorization. In some embodiments, the present disclosure provides an authority inquiry page, where the authority inquiry page is used to inquire whether to grant the acquisition authority of the information, and in the authority inquiry page, an authorization granting control and an authorization denying control are displayed, and when a trigger operation on the authorization granting control is detected, the voice signal processing method provided in the present disclosure is used to acquire the information, thereby implementing processing of a voice signal in an online conference.
FIG. 1 is a diagram illustrating an implementation environment in accordance with an exemplary embodiment. Referring to fig. 1, the implementation environment includes: a microphone system 101, a first terminal 102, a second terminal 103 and a server 104. The microphone system 101 and the first terminal 102 are deployed in a first space, the microphone system 101 and the first terminal 102 are directly or indirectly connected in a wired or wireless communication mode, and the first terminal 102 and the server 104 are directly or indirectly connected in a wired or wireless communication mode; the second terminal 103 is disposed in the second space, and is directly or indirectly connected with the server 104 through a wired or wireless communication manner, that is, the second terminal 102 is located in a different location space from the microphone system 101.
The microphone system 101 includes a control device 1011 and a plurality of microphones 1012 (e.g., the microphones 1 and 2 \8230; the microphones n, n are positive integers), where the control device 1011 is configured to control the plurality of microphones 1012 to collect voice signals in a first space, process the voice signals collected by the plurality of microphones 1012, and send the processed voice signals to the second terminal 103 through the first terminal 102, that is, implement the voice signal processing method provided by the embodiment of the present disclosure.
In some embodiments, the plurality of microphones 1012 includes at least one type of microphone to omni-directionally collect voice signals in the first space to meet different conference requirements. For example, the plurality of microphones includes a wired microphone and a wireless microphone, wherein the wired microphone includes a primary microphone and at least one secondary microphone (also referred to as an extension microphone) for collecting voice signals with a fixed orientation in a first space, for example, the first space is a conference room, the wired microphone is placed on a table or at a corner of the conference room, and the like, which is not limited thereto. The main microphone may be a microphone array, integrated with the control device 1011 in the same electronic device, or may be disposed separately from the control device 1011, which is not limited. The wireless microphone is used for collecting voice signals of a speaking object at a short distance, for example, the speaking object can carry the wireless microphone to freely move in a conference room, so that clear voice signals of the speaking object can be provided even if the speaking object is far away from the main microphone. Illustratively, the wireless microphone is connected to the control device 1011 via bluetooth or 2.4G wireless, but not limited thereto.
It should be noted that the microphone system 101 shown in the drawings is only an illustrative example, and in some embodiments, the microphone system 101 includes a plurality of microphones 1012, and the microphone system 101 is directly or indirectly connected with a control device through wired or wireless communication, that is, the control device does not belong to the microphone system 101, and the control device is used for controlling the microphone system 101 to implement the voice signal processing method provided by the embodiments of the present disclosure. That is, the control device may be a device existing in the microphone system, or may be a device independent from the microphone system, for example, the control device is an electronic device for providing a cloud service, and the like, and the form of the control device is not limited in the embodiments of the present disclosure.
The first terminal 102 and the second terminal 103 are at least one of a smart phone, a smart watch, a desktop computer, a laptop computer, a virtual reality terminal, an augmented reality terminal, a wireless terminal, a laptop computer, and the like. Illustratively, the first terminal 102 and the second terminal 103 are each capable of installing and running an application for providing an online conference function for an object. Illustratively, taking the example that objects in different spaces participate in the same target conference, an object in a first space participates in the target conference through an application running on the first terminal 102, and an object in a second space participates in the target conference through an application running on the second terminal 103. In the conference process, the microphone system 101 in the first space collects a voice signal, processes the collected voice signal, and sends the processed voice signal to the second terminal 103 through the first terminal 102, so as to implement an online conference, in this process, the second terminal 103 is also a far-end device with respect to the first terminal 102. It should be understood that the number of the first terminals 102 and the second terminals 103 may be more or less, and is not limited thereto. In addition, a microphone system similar to the microphone system 101 can be disposed in the second space, and details thereof are not repeated herein.
It should be noted that, in some embodiments, the first terminal 102 can implement the function of the control device 1011 in the microphone system 101, that is, the first terminal 102 serves as a control device for multiple microphones in the first space to implement the voice signal processing method provided in the embodiments of the present disclosure on the basis of providing the online conference function, and the embodiments of the present disclosure are not limited thereto. In addition, the embodiment of the present disclosure is introduced by taking an online conference as an example, and in some call scenarios (for example, objects in different position spaces are called through a telephone terminal), the voice signal processing method provided by the embodiment of the present disclosure can also be applied to implement corresponding functions, which is not limited herein.
The server 104 is an independent physical server, or a server cluster or a distributed file system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. The server 104 is illustratively configured to provide background services for applications running on the first terminal 102 and the second terminal 103. It should be noted that the number of servers 104 can be more or less, and is not limited thereto. Of course, the server 104 can also include other functional servers to provide more comprehensive and diverse services.
In the embodiment of the present disclosure, a speech signal processing method is provided, which can be applied to the microphone system 101 in the implementation environment shown in fig. 1. The basic flow of the speech signal processing method is described below with reference to fig. 2.
Fig. 2 is a flowchart of a speech signal processing method according to an exemplary embodiment. As shown in fig. 2, the method is performed by a control device of a microphone system, the microphone system comprising a plurality of microphones. Illustratively, the method includes the following steps 201 to 203.
In step 201, the control device performs echo cancellation on the first voice signals collected by each microphone in the microphone system based on the reference signal sent by the remote device, so as to obtain second voice signals corresponding to each microphone.
In embodiments of the present disclosure, the remote device is in a different location space than the microphone system. For any microphone in the plurality of microphones, the control device performs echo cancellation on the first voice signal collected by the microphone based on the reference signal sent by the far-end device, so as to cancel an echo corresponding to the reference signal in the first voice signal, thereby obtaining a second voice signal corresponding to the microphone. That is, for the first voice signal collected by each microphone, the control device performs echo cancellation on the first voice signal to obtain a corresponding second voice signal.
It should be noted that, in the embodiments of the present disclosure, a control device is taken as an example to describe implementation of an echo cancellation function, in some embodiments, the multiple microphones have an echo cancellation function themselves, and illustratively, for any one of the multiple microphones, the control device sends a reference signal to the microphone, and the microphone performs echo cancellation on a first voice signal collected by the microphone based on the reference signal to obtain a corresponding second voice signal, and sends the second voice signal to the control device. In this way, the echo cancellation steps for the speech signal can be distributed to the microphones for execution, thereby saving the computational resources of the control device. Of course, in other embodiments, the plurality of microphones include a microphone with an echo cancellation function and a microphone without an echo cancellation function, and the control device can perform the corresponding echo cancellation step in a targeted manner.
In step 202, the control apparatus determines a target microphone from the plurality of microphones based on the signal-to-noise ratios of the second voice signals corresponding to the respective microphones.
In the embodiment of the present disclosure, the signal-to-noise ratio is also referred to as a signal-to-noise ratio, which refers to a ratio of a signal to noise, and for a voice signal, the higher the signal-to-noise ratio is, the higher the quality of the voice signal is, that is, the higher the signal-to-noise ratio of the voice signal is, the higher the quality of the voice signal acquired by a microphone corresponding to the voice signal is. Illustratively, the control device determines a target microphone from among the plurality of microphones based on a second speech signal having a highest signal-to-noise ratio among the second speech signals corresponding to the respective microphones. That is, the speech signal collected by the target microphone is higher in quality than the speech signals collected by the other microphones.
It should be noted that, in other embodiments, the control device may further be capable of determining a target microphone from the multiple microphones according to a microphone type to which the microphone belongs, a distance between the speaking object and the microphone, and the like, which will be described in the embodiment shown in fig. 3 later and will not be described herein again.
In step 203, the control device sends a voice signal to the remote device based on the second voice signal corresponding to the target microphone.
In this disclosure, the control device performs signal gain on the second speech signal corresponding to the target microphone, and sends the second speech signal corresponding to the target microphone after signal gain to the remote device. Illustratively, the Control device applies an Automatic Gain Control (AGC) algorithm to perform signal Gain on the second speech signal corresponding to the target microphone to obtain a speech signal after the signal Gain, and the embodiment of the present disclosure does not limit the specific implementation of the AGC algorithm.
In summary, the embodiments of the present disclosure provide a speech signal processing method, which can be applied to a control device of a microphone system, wherein the control device performs echo cancellation on first speech signals collected by microphones in the microphone system respectively to obtain second speech signals corresponding to the microphones, and then determines a target microphone according to a signal-to-noise ratio of the second speech signals, so as to output a speech signal based on the second speech signal corresponding to the target microphone. In the process, the control device can automatically select and output the voice signal corresponding to the microphone according to the signal-to-noise ratio of the second voice signal, so that the processing efficiency of the voice signal is effectively improved, and the second voice signal is the voice signal after echo cancellation, so that the far-end device can be ensured to receive the clear voice signal.
Based on the foregoing embodiment shown in fig. 2, a basic flow of a speech signal processing method provided by the embodiment of the present disclosure is described, and another speech signal processing method provided by the embodiment of the present disclosure is described below with an embodiment shown in fig. 3.
FIG. 3 is a flow chart illustrating another method of speech signal processing according to an example embodiment. As shown in fig. 3, the method is performed by a control device of a microphone system, which is described by way of example as comprising a plurality of microphones and a control device. Illustratively, the method includes the following steps 301 through 305.
In step 301, the control device receives a reference signal transmitted by a remote device and first voice signals collected by respective microphones in a microphone system.
In an embodiment of the disclosure, the control device and the plurality of microphones are disposed in a first space, and the remote device is disposed in a second space. Illustratively, taking an example that objects in a first space and an object in a second space participate in the same target conference, the object in the first space participates in the target conference through an application program running on a first terminal, the control device can receive a reference signal through the first terminal, the object in the second space participates in the target conference through an application program running on a second terminal, and the second terminal serves as a far-end device and sends a voice signal acquired in the second space, that is, the reference signal, to the first terminal.
In some embodiments, a speaker is further disposed in the first space for playing the reference signal transmitted by the remote device, i.e. playing out the received voice signal. For example, the speaker and the control device are integrated on the same electronic device, and for example, the speaker is a speaker of the first terminal in the first space, and for example, the speaker is a conference sound box, and so on, which is not limited in this disclosure.
In some embodiments, each microphone in the microphone system collects a voice signal in the first space, and sends the collected first voice signal corresponding to each microphone to the control device. In some embodiments, an object in the first space is able to select multiple microphones for acquisition of speech signals by the control device. For example, a microphone selection control is deployed on the control device, and the control device determines a plurality of microphones corresponding to an operation in response to the operation on the microphone selection control, and controls the plurality of microphones to collect voice signals, so as to meet personalized requirements and improve conference experience, which is not limited by the embodiments of the present disclosure.
In step 302, the control device performs linear echo cancellation on the first voice signals corresponding to the microphones based on the reference signal and the first voice signals corresponding to the microphones to obtain intermediate voice signals corresponding to the microphones.
In this disclosure, for any microphone in a plurality of microphones, the control device performs linear echo cancellation on the first voice signal based on the reference signal and the first voice signal collected by the microphone, so as to cancel a linear echo corresponding to the reference signal in the first voice signal, and obtain an intermediate voice signal corresponding to the microphone. That is, for the first voice signal collected by each microphone, the control device performs linear echo cancellation on the first voice signal to obtain a corresponding intermediate voice signal.
Illustratively, the control device applies a Least Mean Square (LMS) algorithm based on an adaptive filter (a digital filter that automatically adjusts its parameters based on the input signal) to perform linear echo cancellation on the first speech signal to obtain a corresponding intermediate speech signal. It should be noted that, other methods for canceling linear echo in a speech signal may be applied to the embodiments of the present disclosure, and are not limited thereto.
In step 303, the control device performs nonlinear echo cancellation on the multiple intermediate voice signals based on the reference signal, the first voice signals corresponding to the microphones, and the intermediate voice signals corresponding to the microphones to obtain second voice signals corresponding to the microphones.
In this disclosure, for any microphone of the plurality of microphones, the control device performs nonlinear echo cancellation on the intermediate voice signal based on the reference signal, the first voice signal collected by the microphone, and the intermediate voice signal obtained after linear echo cancellation, so as to cancel a nonlinear echo corresponding to the reference signal in the intermediate voice signal, and obtain a second voice signal corresponding to the microphone. That is, for the intermediate voice signal corresponding to each microphone, the control device performs nonlinear echo cancellation on the intermediate voice signal to obtain a corresponding second voice signal.
Illustratively, for any one of the microphones, the control device invokes a nonlinear echo cancellation model, inputs the reference signal, the first speech signal collected by the microphone, and an intermediate speech signal corresponding to the microphone into the nonlinear echo cancellation model, and cancels nonlinear echoes in the intermediate speech signal through the nonlinear echo cancellation model to obtain a corresponding second speech signal. The nonlinear echo cancellation model is a Neural Network model obtained based on deep learning training, for example, the nonlinear echo cancellation model is a Convolutional Neural Network (CNN) model or a Recurrent Neural Network (RNN) model, and the specific type of the nonlinear echo cancellation model is not limited in the embodiments of the present disclosure. It should be noted that, any other neural network model based on deep learning and for eliminating the non-linear echo in the speech signal may be applied to the embodiment of the present disclosure, and is not limited thereto.
In some embodiments, the control device invokes different nonlinear echo cancellation models based on the microphone types to which the plurality of microphones belong to perform nonlinear echo cancellation on the intermediate voice signals corresponding to the respective microphones. Illustratively, the plurality of microphones includes a wired microphone and a wireless microphone, and the control device invokes a first non-linear echo model to perform non-linear echo cancellation on an intermediate voice signal corresponding to the wired microphone and invokes a second non-linear echo model to perform non-linear echo cancellation on an intermediate voice signal corresponding to the wireless microphone. Of course, in some embodiments, in a case where the wired microphone includes a master microphone and a slave microphone, the foregoing first nonlinear echo model includes a plurality of submodels, which are respectively used for performing nonlinear echo cancellation on intermediate voice signals corresponding to different wired microphones, and this is not limited by the embodiments of the present disclosure. By the method, different types of nonlinear echo cancellation models are adopted for different types of microphones, and the influence of the type of the microphone on the nonlinear echo can be fully considered, so that the accuracy of nonlinear echo cancellation is improved, and the quality of a voice signal is improved.
Through the above steps 302 and 303, the control device performs linear echo cancellation and nonlinear echo cancellation on the first voice signal acquired by each microphone, so as to cancel an echo in the first voice signal, obtain a second voice signal corresponding to each microphone with higher voice signal quality, and provide a basis for subsequently sending a clear voice signal to the far-end device.
Illustratively, taking the first space and the second space as a conference room a and a conference room B respectively as an example, the object in the conference room a uses the terminal 1 to participate in the target conference, the object in the conference room B uses the terminal 2 to participate in the target conference, the terminal 1 in the conference room a performs loudspeaker amplification after receiving the voice signal sent by the terminal 2 in the conference room B, in this case, when the speaking object in the conference room a speaks, the first voice signal collected by each microphone includes the voice signal of the speaking object and an echo generated by loudspeaker amplification, based on this, the control device performs the above steps 302 and 303 to perform echo cancellation on the voice signal collected by the microphone to obtain a voice signal after echo cancellation, so that a subsequent control device determines a target microphone based on the signal-to-noise ratio of these voice signals, and provides a clear voice signal for the target conference.
In step 304, the control apparatus determines a target microphone from the plurality of microphones based on the signal-to-noise ratios of the second voice signals corresponding to the respective microphones.
In the embodiment of the present disclosure, the control device calculates the signal-to-noise ratio of the second voice signal corresponding to each microphone based on the second voice signal corresponding to each microphone, and determines the target microphone from the plurality of microphones based on the second voice signal with the highest signal-to-noise ratio among the second voice signals corresponding to each microphone. In other embodiments, the control device may be further capable of determining the target microphone according to the type of the microphone to which the microphone belongs and the distance between the microphone and the speaking object, and other alternative embodiments of this step 304 are described below, including the following ways:
in the first mode, the control device determines a target microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and the microphone type of the microphones.
Illustratively, the control device obtains a first reference value of each microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and a plurality of first weights, wherein the first weights indicate the influence degree of the type of the microphone to which the microphone belongs on the signal-to-noise ratio of the voice signals; and determining the microphone with the highest first reference value in the plurality of microphones as the target microphone. For example, the plurality of microphones include a wired microphone and a wireless microphone, where the first weight corresponding to a master microphone in the wired microphone is 90%, the first weight corresponding to a slave microphone is 85%, and the first weight corresponding to the wireless microphone is 80%, and so on, it should be noted that, here, for example, the first weights are only exemplary, and the first weights can be set as required, and are not limited thereto.
Through the first mode, the influence degree of the microphone type on the signal-to-noise ratio of the voice signal is fully considered, and therefore the target microphone can provide a clearer voice signal.
In the second mode, the control device determines the target microphone based on the signal-to-noise ratio of the second speech signal corresponding to each microphone and the distance between the speaking object and the plurality of microphones.
Illustratively, the control device obtains a second reference value of each microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and a plurality of second weights, where the second weights indicate the degree of influence of the distance between the speaking object and the microphone on the signal-to-noise ratio of the voice signal; and determining the microphone with the highest second reference value in the plurality of microphones as the target microphone. The speaking object is a speaking object corresponding to the second voice signal, and the control device acquires the distances between the speaking object and the microphones and determines the second weights according to the distances. Illustratively, the farther the distance between the microphone and the speaking object is, the lower the second weight corresponding to the microphone is, for example, when the speaking object carries a speech with the wireless microphone, the closer the speaking object is to the wireless microphone, the wireless microphone can collect a high-quality speech signal, accordingly, the second weight corresponding to the wireless microphone is higher and set to 100%, and the second weight corresponding to the main microphone in the wired microphone is further from the speaking object, the second weight corresponding to the main microphone is set to 80%, and so on, which is not limited herein.
Illustratively, the first terminal is capable of recognizing a speaking object currently speaking from a video picture of the first space, calculating distances between the speaking object and the plurality of microphones, and transmitting the distances between the speaking object and the plurality of microphones to the control apparatus for the control apparatus to determine a target microphone based on the distances. Of course, in some embodiments, the plurality of second weights are determined by the first terminal and then sent to the control device, which is not limited in this disclosure.
Through the second mode, the influence degree of the distance between the microphone and the speaking object on the signal to noise ratio of the voice signal is fully considered, so that the target microphone can provide a clearer voice signal.
It should be noted that, in some embodiments, the control device may further determine the target microphone by combining the above two manners, that is, the control device determines the target microphone based on the snr of the second speech signal corresponding to each microphone, the type of the microphone to which the microphones belong, and the distance between the speaking object and the microphones, and this way, it can further be ensured that the speech signal provided by the target microphone is clearest among the microphones, so as to effectively improve the conference experience, and the embodiment of the present disclosure does not limit the specific manner in which the control device determines the target microphone.
In step 305, the control device sends a voice signal to the remote device based on a second voice signal corresponding to the target microphone.
In this embodiment, the control device performs signal gain on the second voice signal corresponding to the target microphone, and sends the second voice signal corresponding to the target microphone after signal gain to the remote device. This process is the same as step 203 in the embodiment shown in FIG. 2, and therefore is not described again.
In some embodiments, the control device determines whether to transmit a voice signal to a remote device based on a second voice signal corresponding to a target microphone based on signal-to-noise ratios of a plurality of historical voice signals transmitted to the remote device by the control device within a target time period. Illustratively, in the case where the signal-to-noise ratios of the plurality of historical speech signals meet a target condition, the control device transmits a speech signal to the remote device based on a second speech signal corresponding to the target microphone.
The target time period is a preset time period, for example, the target time period is 100 milliseconds, which is not limited herein. That is, in some scenarios, in a case where a plurality of historical speech signals are collected by a first microphone (a microphone of the plurality of microphones other than a target microphone), after the control device determines the target microphone, the control device does not immediately switch the microphone outputting the speech signal from the first microphone to the target microphone, but judges whether the microphone needs to be switched according to the signal-to-noise ratios of the plurality of historical speech signals, so as to avoid sudden speech changes and influence on conference experience.
Several alternatives to the above target conditions are described below:
the first method is that the control device sends a voice signal to the remote device based on a second voice signal corresponding to the target microphone under the condition that the average signal-to-noise ratio of the plurality of historical voice signals is smaller than or equal to a first threshold value. Wherein the first threshold is a preset threshold, and the control device calculates an average signal-to-noise ratio of the plurality of historical speech signals based on the signal-to-noise ratios of the plurality of historical speech signals and the target time period, thereby determining whether to switch the microphone outputting the speech signal to the target microphone based on the average signal-to-noise ratio. By the method, under the condition that the average signal-to-noise ratio of a plurality of historical voice signals is low, the voice quality of the microphone of the current output voice signal is poor, and on the basis, the microphone of the output voice signal is switched to the target microphone, so that the far-end equipment can be ensured to receive clear voice signals on the basis of avoiding voice mutation.
And secondly, when the signal-to-noise ratio of a target number of historical voice signals in the plurality of historical voice signals is smaller than or equal to a second threshold value, the control device sends the voice signals to the far-end device based on a second voice signal corresponding to the target microphone. The target number is a preset number (for example, 3, which is not limited), and the second threshold is a preset threshold. By the method, under the condition that the voice signals with low individual signal-to-noise ratios exist in a plurality of historical voice signals, the voice quality of the microphone of the current output voice signal is unstable, and based on the fact that the microphone of the output voice signal is switched to the target microphone, the far-end equipment is ensured to receive clear voice signals on the basis of avoiding voice mutation.
Schematically, the above steps 301 to 305 are schematically explained below with reference to fig. 4, and fig. 4 is a schematic diagram of a speech signal processing method according to an exemplary embodiment.
As shown in fig. 4, the voice signal processing method is performed by a control apparatus of a microphone system, and the method includes: for the first voice signals collected by each microphone in the microphone system, linear echo cancellation is performed on the first voice signals corresponding to each microphone in combination with the reference signal sent by the far-end device, so as to cancel linear echoes corresponding to the reference signal in the first voice signals, and obtain intermediate voice signals corresponding to each microphone. Furthermore, based on the microphone types to which the microphones belong, calling a plurality of nonlinear echo cancellation models, and based on the reference signal, the first voice signal corresponding to each microphone, and the intermediate voice signal corresponding to each microphone, obtaining a second voice signal corresponding to each microphone, where the second voice signal is a voice signal from which echoes and noise are removed. Then, the control device determines the microphone corresponding to the second speech signal with the highest signal-to-noise ratio as a target microphone based on the signal-to-noise ratio of the second speech signal corresponding to each microphone, performs signal gain on the second speech signal corresponding to the target microphone, and sends the second speech signal to the remote device.
Therefore, the voice signal processing method provided by the embodiment of the disclosure can be compatible with a plurality of microphones, and automatically determine the target microphone according to the voice signal after echo cancellation, thereby ensuring that the far-end device can receive a clear voice signal. Particularly, for some microphones without an echo cancellation function, such as wireless microphones, although such microphones can move randomly along with a speaking object and acquire a high-quality voice signal, the acquired voice signal often has an echo, and by adopting the method, different types of microphones can be well combined, and a clear voice signal can be provided for a far-end device even though a microphone without an echo cancellation exists in a current microphone system.
In summary, the embodiments of the present disclosure provide a speech signal processing method, which can be applied to a control device of a microphone system, where the control device performs echo cancellation on first speech signals collected by microphones in the microphone system respectively to obtain second speech signals corresponding to the microphones, and then determines a target microphone according to a signal-to-noise ratio of the second speech signals, so as to output a speech signal based on the second speech signals corresponding to the target microphone. In the process, the control device can automatically select and output the voice signal corresponding to the microphone according to the signal-to-noise ratio of the second voice signal, so that the processing efficiency of the voice signal is effectively improved, and the second voice signal is the voice signal after echo cancellation, so that the far-end device can be ensured to receive the clear voice signal.
Fig. 5 is a block diagram illustrating a speech signal processing apparatus according to an example embodiment. Referring to fig. 5, the apparatus is applied to a control device of a microphone system including a plurality of microphones, and includes an echo cancellation unit 501, a determination unit 502, and a transmission unit 503.
An echo cancellation unit 501 configured to perform echo cancellation on first voice signals collected by each microphone in the microphone system based on a reference signal sent by a far-end device, so as to obtain second voice signals corresponding to each microphone, where the far-end device and the microphone system are located in different position spaces;
a determining unit 502 configured to perform determining a target microphone from the plurality of microphones based on signal-to-noise ratios of the second voice signals corresponding to the respective microphones;
a transmitting unit 503 configured to transmit the voice signal to the remote device based on the second voice signal corresponding to the target microphone.
In some embodiments, the determining unit 502 is configured to perform: acquiring a first reference value of each microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and a plurality of first weights, wherein the first weights indicate the influence degree of the type of the microphone to which the microphone belongs on the signal-to-noise ratio of the voice signals; and determining the microphone with the highest first reference value in the plurality of microphones as the target microphone.
In some embodiments, the determining unit 502 is configured to perform: acquiring a second reference value of each microphone based on the signal-to-noise ratio of the second voice signal corresponding to each microphone and a plurality of second weights, wherein the second weights indicate the influence degree of the distance between the speaking object and the microphone on the signal-to-noise ratio of the voice signals; and determining the microphone with the highest second reference value in the plurality of microphones as the target microphone.
In some embodiments, the sending unit 503 is configured to perform: and under the condition that the signal-to-noise ratios of a plurality of historical voice signals meet a target condition, sending the voice signals to the far-end equipment based on a second voice signal corresponding to the target microphone, wherein the plurality of historical voice signals are the historical voice signals sent to the far-end equipment by the control equipment in a target time period.
In some embodiments, the sending unit 503 is configured to perform any one of the following:
under the condition that the average signal-to-noise ratio of the plurality of historical voice signals is smaller than or equal to a first threshold value, transmitting a voice signal to the far-end equipment based on a second voice signal corresponding to the target microphone;
and under the condition that the signal-to-noise ratio of a target number of historical voice signals in the plurality of historical voice signals is less than or equal to a second threshold value, transmitting the voice signals to the far-end equipment based on a second voice signal corresponding to the target microphone.
In some embodiments, the determining unit 502 is configured to determine the target microphone from the plurality of microphones based on the second speech signal with the highest signal-to-noise ratio in the second speech signals corresponding to the respective microphones.
In some embodiments, the sending unit 503 is configured to perform signal gain on the second speech signal corresponding to the target microphone, and send the signal-gain second speech signal corresponding to the target microphone to the remote device.
In some embodiments, the echo cancellation unit 501 is configured to perform:
performing linear echo cancellation on the first voice signals corresponding to the microphones based on the reference signal and the first voice signals corresponding to the microphones to obtain intermediate voice signals corresponding to the microphones;
and performing nonlinear echo cancellation on the intermediate voice signals corresponding to the microphones based on the reference signal, the first voice signals corresponding to the microphones and the intermediate voice signals corresponding to the microphones to obtain second voice signals corresponding to the microphones.
Through the device, the control device respectively performs echo cancellation on the first voice signals collected by each microphone in the microphone system to obtain second voice signals corresponding to each microphone, and then determines a target microphone according to the signal-to-noise ratio of the second voice signals so as to output voice signals based on the second voice signals corresponding to the target microphone. In the process, the control device can automatically select and output the voice signal corresponding to the microphone according to the signal-to-noise ratio of the second voice signal, so that the processing efficiency of the voice signal is effectively improved, and the second voice signal is the voice signal after echo cancellation, so that the far-end device can be ensured to receive the clear voice signal.
It should be noted that: in the speech signal processing apparatus provided in the foregoing embodiment, when processing a speech signal, only the division of the above functional modules is used for illustration, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the voice signal processing apparatus and the voice signal processing method provided in the foregoing embodiments belong to the same concept, and detailed implementation processes thereof are described in the method embodiments, and are not described herein again.
In an exemplary embodiment, there is also provided an electronic device including a processor and a memory for storing at least one computer program, the at least one computer program being loaded and executed by the processor to implement the speech signal processing method in the embodiments of the present disclosure.
Taking an electronic device as an example terminal, fig. 6 is a block diagram of an electronic device shown according to an exemplary embodiment. As shown in fig. 6, the electronic device is a terminal, and the terminal 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.
In general, the terminal 600 includes: a processor 601 and a memory 602.
The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing content that the display screen needs to display. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.
The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 602 is used to store at least one program code for execution by the processor 601 to implement processes performed by the control device in the speech signal processing method provided by the method embodiments in the present disclosure.
In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604, a display 605, a camera assembly 606, an audio circuit 607, a positioning component 608, and a power supply 609.
The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on separate chips or circuit boards, which is not limited by the present embodiment.
The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, disposed on the front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a foldable design; in other embodiments, the display 605 may be a flexible display disposed on a curved surface or a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of a terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, the main camera and the wide-angle camera are fused to realize panoramic shooting and a VR (Virtual Reality) shooting function or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp and can be used for light compensation under different color temperatures.
The audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing or inputting the electric signals to the radio frequency circuit 604 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert the electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker and can also be a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 607 may also include a headphone jack.
The positioning component 608 is used for positioning the current geographic Location of the terminal 600 to implement navigation or LBS (Location Based Service).
A power supply 609 is used to supply power to the various components in terminal 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may support wired charging or wireless charging. The rechargeable battery can also be used to support fast charge technology.
In some embodiments, the terminal 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.
The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the terminal 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensors 613 may be disposed on the side bezel of terminal 600 and/or underneath display screen 605. When the pressure sensor 613 is disposed on the side frame of the terminal 600, a user's holding signal of the terminal 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is arranged at the lower layer of the display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control, and a menu control.
The fingerprint sensor 614 is used for collecting a fingerprint of the user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the terminal 600. When a physical button or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.
The optical sensor 615 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the display brightness of display screen 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the display screen 605 is increased; when the ambient light intensity is low, the display brightness of the display screen 605 is adjusted down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.
A proximity sensor 616, also known as a distance sensor, is typically disposed on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front surface of the terminal 600. In one embodiment, when proximity sensor 616 detects that the distance between the user and the front face of terminal 600 gradually decreases, processor 601 controls display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front face of the terminal 600 is gradually increased, the display 605 is controlled by the processor 601 to switch from the breath-screen state to the bright-screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 6 is not intended to be limiting of terminal 600 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
FIG. 7 is a block diagram illustrating another electronic device in accordance with an example embodiment. The electronic device 700 may illustratively have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 701 and one or more memories 702, where at least one program code is stored in the one or more memories 702, and the at least one program code is loaded and executed by the one or more processors 701 to implement the speech signal processing method provided by the above-mentioned method embodiments. Of course, the electronic device 700 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the electronic device 700 may further include other components for implementing device functions, which are not described herein again.
In an exemplary embodiment, a computer readable storage medium comprising program code, such as the memory 702 comprising program code, which is executable by the processor 701 of the electronic device 700 to perform the above-described speech signal processing method is also provided. Alternatively, the computer-readable storage medium may be a Read-only Memory (ROM), a Random Access Memory (RAM), a Compact-disc Read-only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, comprising a computer program which, when executed by a processor, implements the above-described speech signal processing method.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (13)

1. A speech signal processing method applied to a control apparatus of a microphone system including a plurality of microphones, the method comprising:
performing echo cancellation on first voice signals collected by each microphone in the microphone system based on a reference signal sent by a remote device to obtain second voice signals corresponding to each microphone, wherein the remote device and the microphone system are in different position spaces;
determining a target microphone from the plurality of microphones based on the signal-to-noise ratio of the second voice signals corresponding to the microphones;
and sending a voice signal to the far-end equipment based on a second voice signal corresponding to the target microphone.
2. The method of claim 1, wherein determining a target microphone from the plurality of microphones based on the snr of the second speech signal corresponding to each microphone comprises:
acquiring a first reference value of each microphone based on the signal-to-noise ratio of a second voice signal corresponding to each microphone and a plurality of first weights, wherein the first weights indicate the influence degree of the type of the microphone to which the microphone belongs on the signal-to-noise ratio of the voice signals;
determining a microphone with a highest first reference value among the plurality of microphones as the target microphone.
3. The method of claim 1, wherein determining a target microphone from the plurality of microphones based on the snr of the second speech signal corresponding to each microphone comprises:
acquiring a second reference value of each microphone based on the signal-to-noise ratio of a second voice signal corresponding to each microphone and a plurality of second weights, wherein the second weights indicate the influence degree of the distance between the speaking object and the microphones on the signal-to-noise ratio of the voice signals;
determining a microphone with a highest second reference value among the plurality of microphones as the target microphone.
4. The method of claim 1, wherein the transmitting a voice signal to the remote device based on a second voice signal corresponding to the target microphone comprises:
and sending a voice signal to the remote equipment based on a second voice signal corresponding to the target microphone under the condition that the signal-to-noise ratios of a plurality of historical voice signals meet a target condition, wherein the plurality of historical voice signals are the historical voice signals sent to the remote equipment by the control equipment within a target time period.
5. The method according to claim 4, wherein when the SNR of the plurality of historical speech signals meets a target condition, the method sends the speech signal to the remote device based on a second speech signal corresponding to the target microphone, and includes any one of:
under the condition that the average signal-to-noise ratio of the plurality of historical voice signals is smaller than or equal to a first threshold value, transmitting a voice signal to the far-end equipment based on a second voice signal corresponding to the target microphone;
and under the condition that the signal-to-noise ratio of a target number of historical voice signals in the plurality of historical voice signals is smaller than or equal to a second threshold value, transmitting the voice signals to the far-end equipment based on a second voice signal corresponding to the target microphone.
6. The method of claim 1, wherein determining a target microphone from the plurality of microphones based on the snr of the second speech signal corresponding to each microphone comprises:
and determining the target microphone from the plurality of microphones based on the second voice signal with the highest signal-to-noise ratio in the second voice signals corresponding to the microphones.
7. The method of claim 1, wherein the transmitting a voice signal to the remote device based on a second voice signal corresponding to the target microphone comprises:
and performing signal gain on the second voice signal corresponding to the target microphone, and sending the second voice signal corresponding to the target microphone after the signal gain to the far-end equipment.
8. The method according to claim 1, wherein the performing echo cancellation on the first voice signal collected by each microphone in the microphone system based on the reference signal sent by the far-end device to obtain a second voice signal corresponding to each microphone comprises:
performing linear echo cancellation on the first voice signals corresponding to the microphones based on the reference signal and the first voice signals corresponding to the microphones to obtain intermediate voice signals corresponding to the microphones;
and performing nonlinear echo cancellation on the intermediate voice signals corresponding to the microphones based on the reference signal, the first voice signals corresponding to the microphones and the intermediate voice signals corresponding to the microphones to obtain second voice signals corresponding to the microphones.
9. A microphone system, characterized in that the microphone system comprises a plurality of microphones and a control device for:
performing echo cancellation on first voice signals collected by each microphone in the microphone system based on a reference signal sent by a remote device to obtain second voice signals corresponding to each microphone, wherein the remote device and the microphone system are in different position spaces;
determining a target microphone from the plurality of microphones based on the signal-to-noise ratio of the second voice signal corresponding to each microphone;
and sending a voice signal to the far-end equipment based on a second voice signal corresponding to the target microphone.
10. A speech signal processing apparatus, applied to a control device of a microphone system including a plurality of microphones, the apparatus comprising:
the echo cancellation unit is configured to perform echo cancellation on first voice signals acquired by each microphone in the microphone system based on a reference signal sent by a far-end device, so as to obtain second voice signals corresponding to each microphone, wherein the far-end device and the microphone system are located in different position spaces;
a determination unit configured to perform determination of a target microphone from the plurality of microphones based on signal-to-noise ratios of second voice signals corresponding to the respective microphones;
a transmitting unit configured to perform transmitting a voice signal to the remote device based on a second voice signal corresponding to the target microphone.
11. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a memory for storing the processor executable program code;
wherein the processor is configured to execute the program code to implement the speech signal processing method of any one of claims 1 to 8.
12. A computer-readable storage medium, characterized in that, when program code in the computer-readable storage medium is executed by a processor of an electronic device, the electronic device is enabled to execute the speech signal processing method according to any one of claims 1 to 8.
13. A computer program product comprising a computer program, characterized in that the computer program realizes the speech signal processing method of any one of claims 1 to 8 when executed by a processor.
CN202210836554.XA 2022-07-15 2022-07-15 Voice signal processing method, system and device and electronic equipment Pending CN115334413A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210836554.XA CN115334413A (en) 2022-07-15 2022-07-15 Voice signal processing method, system and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210836554.XA CN115334413A (en) 2022-07-15 2022-07-15 Voice signal processing method, system and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115334413A true CN115334413A (en) 2022-11-11

Family

ID=83917756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210836554.XA Pending CN115334413A (en) 2022-07-15 2022-07-15 Voice signal processing method, system and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115334413A (en)

Similar Documents

Publication Publication Date Title
CN110764730B (en) Method and device for playing audio data
CN108401124B (en) Video recording method and device
CN110149332B (en) Live broadcast method, device, equipment and storage medium
CN108965757B (en) Video recording method, device, terminal and storage medium
US20220164159A1 (en) Method for playing audio, terminal and computer-readable storage medium
CN110996305B (en) Method and device for connecting Bluetooth equipment, electronic equipment and medium
CN112907725B (en) Image generation, training of image processing model and image processing method and device
CN111142838B (en) Audio playing method, device, computer equipment and storage medium
CN110602101A (en) Method, device, equipment and storage medium for determining network abnormal group
CN111462742B (en) Text display method and device based on voice, electronic equipment and storage medium
CN110096865B (en) Method, device and equipment for issuing verification mode and storage medium
CN112581358A (en) Training method of image processing model, image processing method and device
CN111862972B (en) Voice interaction service method, device, equipment and storage medium
CN111294551B (en) Method, device and equipment for audio and video transmission and storage medium
CN110473562B (en) Audio data processing method, device and system
CN108196813B (en) Method and device for adding sound effect
CN110543403A (en) power consumption evaluation method and device
CN111918084B (en) Wheat loading method and device, server and terminal
CN114384466A (en) Sound source direction determining method, sound source direction determining device, electronic equipment and storage medium
CN111245629B (en) Conference control method, device, equipment and storage medium
CN115334413A (en) Voice signal processing method, system and device and electronic equipment
CN114698409A (en) Video conference realization method, device, system and storage medium
CN111488895A (en) Countermeasure data generation method, device, equipment and storage medium
CN112260845A (en) Method and device for accelerating data transmission
CN111163262B (en) Method, device and system for controlling mobile terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination