CN117809628A

CN117809628A - Far-field voice data expansion method, server and electronic equipment

Info

Publication number: CN117809628A
Application number: CN202311419047.7A
Authority: CN
Inventors: 刘宇
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2024-04-02

Abstract

The embodiment of the application discloses a far-field voice data expansion method, a server and electronic equipment, wherein the method comprises the following steps: receiving voice data uploaded by electronic equipment, and judging the type of the voice data; if the voice data is of far-field type, storing the voice data into a first database; if the voice data is of a near field type, storing the voice data into a second database; screening far-field sample data according to the first database and/or simulating far-field sample data according to near-field voice data in the second database or the open-source voice data set, wherein the far-field sample data are used for training a far-field voice processing model; far field sample data is stored. According to the method and the device for expanding the far-field sample data, the speed of accumulation and expansion of the far-field sample data can be improved, the problems that the far-field sample data are not matched with equipment channels and the like are avoided, coverage of different scenes and fields is achieved through online expansion data, and particularly when two expansion modes of a first database and a second database are operated together, the expansion speed of the far-field sample data is remarkably improved.

Description

Far-field voice data expansion method, server and electronic equipment

Technical Field

The application relates to the technical field of voice, in particular to a far-field voice data expansion method, a server and electronic equipment.

Background

The voice interaction scene may include near-field voice and far-field voice, wherein near-field voice refers to voice interaction between a user and a sound collector in a short distance, for example, the user holds a smart phone and inputs voice instructions, and the user presses voice keys of a remote controller for a long time to input voice instructions to a smart television. Far-field speech is a speech interaction performed in a relatively long distance range, for example, a user sends out a speech command in a scene such as a conference room, a classroom, an intelligent home, etc., a device such as a microphone array arranged in the scene captures a speech signal of the user, and a speech system processes and responds to the speech signal.

When developing an algorithmic model of far-field speech, a large amount of far-field speech data adapted to the electronic device, to the microphone array, is often required, which is used to train the model to improve the model, or to improve the accuracy of the model operation. However, when far-field speech data is actually accumulated, there is a problem in that: if equipment is used for collecting far-field voice data, a great deal of time and labor are consumed for recording the voice data, texts of the voice data are marked, the collection speed is low, and the research and development efficiency and progress of a far-field voice algorithm are affected; and (II) if far-field voice data are purchased from some suppliers, the purchased data may not be matched with the data acquisition channel of the current equipment, and the quantity of far-field voice data on the market is small, so that the far-field voice data are insufficient to be shot for all application scenes and fields.

Disclosure of Invention

Some embodiments of the present application provide a far-field voice data expansion method, a server and an electronic device, so as to increase the speed of accumulation and expansion of far-field sample data, avoid the problem of mismatching of far-field sample data with devices, channels and the like, and realize coverage of different scenes and fields by continuously and dynamically expanding data on line, thereby improving the training efficiency and accuracy of a model.

In a first aspect, some embodiments of the present application provide a server, including:

a first communicator for communication with the electronic device;

a first controller for performing:

receiving voice data uploaded by electronic equipment, and judging the category of the voice data;

if the voice data is of far-field type, storing the voice data into a first database;

if the voice data is of a near field type, storing the voice data into a second database;

screening far-field sample data according to the first database, and/or simulating the far-field sample data according to near-field voice data in the second database or an open-source voice data set, wherein the far-field sample data is used for training a far-field voice processing model; wherein the open source speech data set comprises near field speech data acquired through other approaches;

The far field sample data is stored.

In some embodiments, the first controller screens far field sample data from the first database, comprising: acquiring a first far-field voice data set meeting a first screening condition from the first database, wherein the first screening condition comprises equipment information, recording time and region information of target equipment; acquiring target far-field voice data meeting second screening conditions from the first far-field voice data set, wherein the second screening conditions comprise target audio duration and target signal-to-noise ratio; performing voice recognition on the target far-field voice data to obtain target text information; and expanding the target far-field voice data and the target text information into the far-field sample data.

In some embodiments, the first controller screens far field sample data from the first database, comprising: acquiring a first far-field voice data set meeting a first screening condition from the first database, wherein the first screening condition comprises equipment information, recording time and region information of target equipment; acquiring target far-field voice data meeting second screening conditions from the first far-field voice data set, wherein the second screening conditions comprise target audio duration and target signal-to-noise ratio; calling N different voice recognition interfaces, and respectively performing voice recognition on the target far-field voice data to obtain N target text messages; n is the calling number of the voice recognition interfaces, and N is greater than 1; and if the N pieces of target text information are completely consistent, expanding the target far-field voice data and the uniquely identified target text information into the far-field sample data.

In some embodiments, the first controller simulates the far field sample data from near field speech data in the second database, comprising: creating a far-field simulated room, and setting a topological structure of a microphone array; setting the positions of the microphone array, the sound source, the sound player and the noise in the far-field simulated room; acquiring near-field voice data from the second database or the open-source voice data set, setting the near-field voice data as a sound source signal, and playing the incoming voice data at a sound source position; setting the sound environment in the far-field simulation room, and simulating far-field audio signals to obtain a multi-communication audio set; converting the multi-channel audio set into single-channel voice data by utilizing a microphone array algorithm; and expanding the single-channel voice data and the text information thereof into the far-field sample data.

In some embodiments, the first controller sets the far-field simulated room to simulate a sound environment and simulates a far-field audio signal, comprising: controlling the sound player to play target audio and simulating far-field audio signals FS containing echoes ₁ ，FS ₁ =y+x rir; wherein y represents a sound source signal collected by the microphone array, x represents an echo signal played by the sound player, rir represents an impulse response of the far-field simulated room, and x represents convolution operation.

In some embodiments, the first controller sets the far-field simulated room to simulate a sound environment and simulates a far-field audio signal, comprising: simulating far-field audio signal FS containing echo reverberation ₂ ，FS ₂ ＝y*rir。

In some embodiments, the first controller is configured toPlacing the far field to simulate the sound environment in a room and simulate a far field audio signal, comprising: applying a noise signal at the noise location and simulating a far-field audio signal FS containing noise ₃ ，FS ₃ =y+z (10 (-SNR/20)); where z represents the noise signal, SNR represents the target signal-to-noise ratio, # represents the convolution operation, and ε represents the power operation.

In some embodiments, the first controller sets the far-field simulated room to simulate a sound environment and simulates a far-field audio signal, comprising:

simulating far-field audio signal FS containing both echo and reverberation ₄ ，FS ₄ ＝x*rir+y*rir；

And/or simulate far-field audio signal FS containing both echo and noise ₅ ，FS ₅ ＝y+x*rir+z*(10^(-SNR/20))；

And/or simulate far-field audio signal FS containing both reverberation and noise ₆ ，FS ₆ ＝y*rir+z*(10^(-SNR/20))；

And/or simulate far-field audio signal FS containing echo, reverberation and noise at the same time ₇ ，FS ₇ ＝x*rir+y*rir+z*(10^(-SNR/20))。

In some embodiments, the multi-channel audio set includes FS ₁ ′、FS ₂ ′、FS ₃ ′、FS ₄ ′、FS ₅ ′、FS ₆ ' and FS ₇ ' any one of; FS (FS) ₁ ′＝|FS ₁ |×s；FS ₂ ′＝|FS ₂ |×s；FS ₃ ′＝|FS ₃ |×s；FS ₄ ′＝|FS ₄ |×s；FS ₅ ′＝|FS ₅ |×s；FS ₆ ′＝|FS ₆ |×s；FS ₇ ′＝|FS ₇ |x s; wherein, |FS ₁ I represents far-field audio signal FS ₁ Amplitude of, |FS ₂ I represents far-field audio signal FS ₂ Amplitude of, |FS ₃ I represents far-field audio signal FS ₃ Amplitude of, |FS ₄ I represents far-field audio signal FS ₄ Amplitude of, |FS ₅ I represents far-field audio signal FS ₅ Amplitude of, |FS ₆ I represents far-field audio signalFS ₆ Amplitude of, |FS ₇ I represents far-field audio signal FS ₇ S is the volume disturbance factor, x represents the multiplication.

In a second aspect, some embodiments of the present application further provide an electronic device, including:

a second communicator for communication with the server of the first aspect;

the sound collector is used for collecting voice data input by a user;

a second controller for performing:

acquiring voice data acquired by the sound acquirer, wherein the voice data comprises a category identifier set by the sound acquirer, and the category identifier is used for indicating whether the voice data is of a near-field category or a far-field category;

and uploading the voice data to the server.

In a third aspect, some embodiments of the present application further provide a far-field speech data expansion method, including:

the far field sample data is stored.

In a fourth aspect, some embodiments of the present application also provide a computer storage medium having stored therein program instructions which, when run on a computer, cause the computer to perform the methods involved in the above aspects and their respective implementations.

In the embodiment of the application, the server can receive the voice data uploaded by a plurality of different electronic devices, identify the category of the voice data, store far-field voice data into the first database, store near-field voice data into the second database, and realize classified storage of the server according to the category of the voice data. The first database stores a large amount of far-field voice data uploaded by each electronic device, and the first database can continuously accumulate and expand the data, so that the server can screen matched data from the first database as far-field sample data for model training according to application scenes and model training requirements.

The second database stores a large amount of near-field voice data uploaded by the electronic equipment, the second database can also accumulate and expand data continuously, the near-field voice data are included in the second database or the open-source voice data set to simulate far-field voice data, on one hand, the second database can be fully utilized, on the other hand, the server can acquire near-field voice data required by simulation in multiple ways, further, the expansion way of far-field sample data is increased, the speed of accumulating and expanding the far-field sample data is improved, the data do not need to be purchased from a provider, the problem of mismatching of the far-field sample data with equipment channels and the like is avoided, coverage of different scenes and fields is realized through online expansion of the data, and the model training efficiency and accuracy are improved. In particular, when the two expansion modes of the first database and the second database operate together, the expansion speed of far-field sample data can be significantly improved.

Drawings

In order to more clearly illustrate some embodiments of the present application or the technical solutions in the prior art, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an operational scenario diagram of voice service processing provided by some embodiments of the present application;

FIG. 2 is a block diagram of a hardware configuration of an electronic device provided in some embodiments of the present application;

FIG. 3 is a block diagram of a software architecture configuration of a server and an electronic device provided in some embodiments of the present application;

FIG. 4 is a logic diagram of obtaining far field sample data from an online database according to some embodiments of the present application;

fig. 5 is a flowchart of a far-field speech data expansion method A1 according to some embodiments of the present application;

fig. 6 is a flowchart of a far-field speech data expansion method A2 according to some embodiments of the present application;

FIG. 7 is a schematic diagram of simulated far-field speech data according to some embodiments of the present application;

fig. 8 is a flowchart of a far-field speech data expansion method B provided in some embodiments of the present application;

FIG. 9 is a flow chart illustrating the operation of a microphone array algorithm provided in some embodiments of the present application;

fig. 10 is a flowchart illustrating an operation of an echo cancellation algorithm according to some embodiments of the present application.

Detailed Description

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The voice interaction scene may include near-field voice and far-field voice, wherein near-field voice refers to voice interaction between a user and a sound collector in a short distance, for example, the user holds a smart phone and inputs voice instructions, and the user presses voice keys of a remote controller for a long time to input voice instructions to a smart television. Far-field voice is a voice interaction performed in a relatively long distance range, for example, a user sends out a voice command in a meeting room, a classroom, an intelligent home and other scenes, a sound collector (for example, a microphone array) arranged in the scene captures a voice signal of the user, a voice system processes and responds the voice signal, and finally a program or action indicated by the voice signal is executed.

In order to process far-field voice data, research personnel need to develop an algorithm model of far-field voice, a large amount of far-field voice data matched with an application scene, electronic equipment and a microphone array are needed, and the model is trained by utilizing the far-field voice data, so that the model is trained or improved, and the operation precision of the model is improved. However, when far-field speech data is actually accumulated, there is a problem in that:

problem one: if the far-field voice data are collected by using the equipment, a great deal of time and labor are consumed to record the far-field voice data, and text labeling is carried out on the far-field voice data. The data acquisition speed of the mode is low, and the research and development efficiency and progress of a far-field voice algorithm can be affected.

And a second problem: if far-field speech data is purchased from some suppliers, there may be problems such as mismatch of the purchased data with the application scenario, the device acquisition channel and the microphone array algorithm, and the amount of far-field speech data currently on the market is relatively small, not enough to be relevant for all application scenarios and fields.

Aiming at the technical problems, the embodiment of the application provides two far-field voice data expansion schemes, namely, pulling the online log audio of a server and screening far-field voice data which can be used for training a model from the online log audio; and secondly, generating far-field voice data by utilizing near-field voice data with relatively large quantity in a simulation mode. By either method, the far-field voice data for model training can be rapidly expanded without being transacted by a provider, and the expanded far-field voice data is adapted to the requirements of application scenes, equipment acquisition channels, microphone array algorithms and the like. The following embodiments will describe several far-field speech data augmentation methods in detail.

Fig. 1 is an operation scenario diagram of voice service processing provided in some embodiments of the present application. As shown in fig. 1, a server 100 and an electronic device 200 may be included in an operation scenario, where the electronic device 200 illustratively includes a smart television 200a, a mobile terminal 200b, a smart speaker 200c, and the like.

The server 100 and the electronic device 200 in the present application may perform data interaction through various communication manners. Electronic device 200 may be enabled for communication connections via a Local Area Network (LAN), wireless Local Area Network (WLAN), or other network. The server 100 may provide the electronic device 200 with contents of semantic parsing and intention recognition results, various business-related data, and the like. For example, the electronic device 200 may interact with the server 100 for information and data, receive software program updates, and the like.

The server 100 may be a server providing various services, such as a background server providing support for voice data collected by the electronic device 200. The server 100 may perform voice processing such as semantic analysis and intention recognition on the received voice data, and may feed back processing results (e.g., voice text, intention instructions, etc.) to the electronic device 200. The server 100 may also transmit corresponding service data (e.g., application data, media asset data, etc.) to the electronic device 200 in response to the service request of the electronic device 200.

In some embodiments, the server 100 may receive the voice data (including the near-field voice data and/or the far-field voice data) uploaded by the electronic device 200, and perform matching screening on the far-field voice data, or generate the far-field voice data by using near-field voice data simulation, so as to obtain far-field voice data adapted to the relevant scene, device and microphone array, thereby rapidly expanding training data usable by the far-field voice processing model.

The server 100 in the embodiment of the present application may be one server cluster, or may be a plurality of server clusters, and may include one or more types of servers.

The electronic device 200 may be a hardware device or a software device. When the electronic device 200 is a hardware device, it may be various electronic devices having a sound collection function, including but not limited to: household appliances such as intelligent televisions, intelligent refrigerators, intelligent air conditioners, intelligent sound boxes, intelligent mobile phones, tablet computers, electronic book readers, intelligent watches, intelligent game machines, computers, AI equipment, robots, intelligent vehicle-mounted terminal equipment and the like.

When the electronic device 200 is a software apparatus, at least one software functional module/service/model (e.g., a sound collection module, a voice service, a voice processing model, etc.) may be included, and the software apparatus may be applied to the above-listed hardware electronic device. The software means may be implemented as a plurality of software or software modules (e.g. for providing sound collection services) or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that, the far-field voice data expansion method provided in the embodiments of the present application may be executed by the server 100, or may be completed based on communication interaction between the server 100 and the electronic device 200.

Fig. 2 is a block diagram of a hardware configuration of an electronic device 200 according to an embodiment of the present application. As shown in fig. 2, the electronic apparatus 200 may include, but is not limited to, at least one of a communicator 210, a detector 220, an external device interface 230, a controller 240, a display 250, an audio output interface 260, a user interface 270, a memory 280, and a power supply. The controller 240 may include: the system comprises a central processor, a video processor, an audio processor, a graphic processor, a RAM and a ROM, wherein the first interface to the nth interface are used for input/output.

The display 250 includes a display screen component for presenting a picture, and a driving component for driving image display, components for receiving an image signal output from the controller 240, displaying video content, image content, and a menu manipulation interface, and a user manipulation UI interface. The display 250 may be a liquid crystal display, an OLED display, a projection device, or a projection screen.

The communicator 210 is a component for communicating with the external device or the server 100 according to various communication protocol types. For example, communicator 210 may include: at least one of a Wifi module, a Bluetooth module, a wired Ethernet module and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The electronic device 200 may establish a communication connection with the server 100 through the communicator 210 to enable transmission and reception of control signals and data signals.

The user interface 270 may be used to receive external control signals, such as user operations based on user interface inputs.

The detector 220 may be used to collect signals of the external environment or interaction with the outside. For example, the detector 220 may include: the light receiver is used for acquiring a sensor of the intensity of ambient light; alternatively, the detector 220 may comprise an image collector, such as a camera, for collecting external environmental scenes, user attributes or user interaction gestures; still alternatively, the detector 220 may include a sound collector, which may be an independent microphone for collecting near-field voice data, or a microphone array including a plurality of microphone units arranged in a certain layout for collecting far-field voice data in an external environment.

Microphones, also known as "microphones" and "microphones," may be used to receive sound from a user and convert the sound signal into an electrical signal. The electronic device 200 may be provided with at least one microphone. In some embodiments, the electronic device 200 may be provided with two microphones, and may implement a noise reduction function in addition to collecting sound signals. The electronic device 200 may further be provided with three, four or more microphones, and may be provided with a microphone array to achieve functions of far-field sound signal collection, noise reduction, sound source identification, directional recording, and the like.

Further, the microphone may be built in the electronic device 200, or the microphone may be connected to the electronic device 200 by a wired or wireless (e.g., bluetooth connection, etc.). Of course, the mounting position of the microphone on the electronic device 200 is not limited in the embodiment of the present application. Alternatively, the electronic device 200 may not include a microphone, i.e., the microphone is not provided in the electronic device 200. The electronic device 200 may be coupled to a microphone via some interface (e.g., a USB interface) and the coupled microphone may be secured in any position on the electronic device 200 via an external mount (e.g., a clipped microphone stand).

The controller 240 may control task execution of the electronic device 200 and respond to user operations or voice instructions through various software programs stored in the memory 270. The controller 240 is used to control the overall operation of the electronic device 200.

The controller 240 may include: at least one of a central processing unit (Central Processing Unit, CPU), a video processor, an audio processor, a graphic processor (Graphics Processing Unit, GPU), a RAM (Random Access Memory, RAM), a ROM (Read-Only Memory), a first interface to an nth interface for input/output, a communication Bus (Bus), and the like.

The electronic device 200 may have different software configurations under different device types and operating systems. Fig. 3 is a software architecture configuration block diagram of a server and an electronic device according to some embodiments of the present application. Taking the electronic device 200 configured with an Android (Android) operating system as an example, as shown in fig. 3, the electronic device 200 may be logically divided into an application (Applications) layer (abbreviated as "application layer 21"), a kernel layer 22, and a hardware layer 23. The server 100 includes, but is not limited to, a communication control module 101, an intent recognition module 102, a data storage module 103, and a far field data augmentation module 104.

In some embodiments, as shown in fig. 3, the hardware layer 23 may include the communicator 210, the detector 220, the controller 240, the display 250, and the like of the example of fig. 2.

In some embodiments, as shown in FIG. 3, the application layer 21 includes one or more applications. The application may be a system application or a third party application. For example, the application layer 21 may include a voice application that may provide a voice interaction interface and related voice services, and the voice application may interact with the communication control module 101 to enable connection of the electronic device 200 with the server 100.

The kernel layer 22 acts as a software middleware between the hardware layer 23 and the application layer 21 for managing and controlling hardware resources and software resources.

In some embodiments, the kernel layer 22 includes a detector driver, which may include a microphone driver for sending the voice data collected by the microphone to the voice application of the application layer 21. In case the voice application in the electronic device 200 is started, the microphone drive is used to send the voice data entered by the user collected by the microphone in the detector 230 to the voice application in case the electronic device 200 has established a communication connection with the server 100. Thereafter, the voice application may send the voice data to the intent recognition module 202 in the server; the intention recognition module 202 is configured to input voice data transmitted by the electronic device 200 into an intention recognition model, which may be a model configured with algorithms of voice processing, intention recognition, etc., for outputting a user intention indicated by the voice data, for example, the user intention is "turn on air conditioner".

In some embodiments, the server 100 may also build a first database and a second database. Wherein the first database may be used to store far field speech data and the second database is used to store near field speech data. The data included in the first database and the second database may be collected and uploaded by a plurality of electronic devices in communication with the server, or may originate from other sources, such as near-field voice data and far-field voice data that are self-recorded or purchased by the operator. In some embodiments, after the voice application uploads the voice data received by the electronic device 200 to the server 100, if the server 100 recognizes that the voice data is of far-field type, the voice data is saved to the first database. In this way, far-field data expansion module 104 may screen far-field speech data from the first database, where the far-field speech data is matched with conditions such as a model, a scene, a signal-to-noise ratio, and the like, and the screened far-field speech data may be used as training data of the model to implement training and optimization of the model. According to the method and the device, rapid accumulation and expansion of far-field voice data required by model training are achieved, the acquisition efficiency of the far-field voice data is improved, far-field voice data which can be used for model training can be screened according to conditions such as application scenes and different equipment, development efficiency and progress of a far-field voice algorithm are improved, and accuracy of the model is improved.

In some embodiments, after the voice application uploads the voice data received by the electronic device 200 to the server 100, if the server 100 recognizes that the voice data is of a near-field type, the voice data may be saved to the second database. Thus, when training the far-field speech processing model, the far-field data expansion module 104 may acquire near-field speech data from the second database, simulate the far-field speech data with the near-field speech data, and use the far-field speech data generated by simulation as training data of the model to achieve training and optimization of the model. According to the embodiment of the application, the rapid accumulation and expansion of far-field voice data required by model training are realized, the acquisition efficiency of the far-field voice data is improved, simulation can be implemented according to the application scene, different equipment, channel conditions and other factors, and the accuracy of the model is improved.

Note that the electronic device structure and voice interaction logic are not limited to the examples of fig. 1-3. The gist of the embodiment of the present application is that far-field speech data required for model training is accumulated and expanded quickly, without limiting what algorithms and training patterns the model uses, etc.

In some embodiments, referring to fig. 1, the sound collector of the smart tv 200a may be configured as a microphone array, and if the user interacts with the smart tv 200a by voice, the microphone array may collect voice data a input by the user, and set a category identifier for the voice data a, where the category identifier is used to indicate a category (near field or far field) to which the voice data belongs. For example, the user speaks "small X, small X" to the smart tv 200a, where "small X" is a keyword for waking up the voice application of the smart tv 200a, and the microphone array sets far-field identification for the voice data a in this scenario. After the voice application uploads the voice data a to the server 100, the server 100 reads the category identifier configured by the voice data a, and recognizes that the voice data a is of a far-field type, and then the voice data a may be saved to the first database.

In some embodiments, the smart television 200a may also be communicatively coupled to a control device (e.g., a remote control) that may configure a microphone and voice keys. The control device responds to the operation of long-time pressing of voice keys by a user and controls the microphone to collect voice data B input by the user in a short distance; the control device responds to the operation of releasing the voice key by the user, controls the microphone to stop collecting voice data, sets a near field identifier for the collected voice data B, then sends the voice data B to the intelligent television 200a, and the intelligent television 200a forwards the voice data B to the server 100. The server 100 reads the category identifier configured by the voice data B and recognizes that the voice data B is of the near field type, and then may store the voice data B to the second database.

In some embodiments, referring to fig. 1, the sound collector of the mobile terminal apparatus 200b may include a microphone that may collect voice data C input by a user at a close distance and set a near field identifier for the voice data C, and then the mobile terminal apparatus 200b uploads the voice data C to the server 100. The server 100 reads the category identifier configured by the voice data C, recognizes that the voice data C is of a near field type, and stores the voice data C in the second database.

In some embodiments, referring to fig. 1, the sound collector of the smart speaker 200c may be configured as a microphone array for collecting voice data D input to the smart speaker 200c by a user over a relatively long distance, setting a far-field identification for the voice data D, and uploading the voice data D to the server 100. The server 100 reads the category identifier configured by the voice data D, recognizes that the voice data D is of far-field type, and stores the voice data D in the first database.

It follows that in different speech interaction scenarios, a near-field speech mode or a far-field speech mode may be triggered. In near field voice mode, a sound collector structure (e.g., a microphone) adapted for near field voice collection is used, and the sound collector will set a near field identification for voice data itself. In far-field speech mode, a sound collector structure (e.g., a microphone array) adapted for far-field speech collection is used, and the sound collector will set a far-field identification for speech data by itself. Thus, when the server receives voice data, the server reads the category identification to identify far-field voice data or near-field voice data, so that the voice data is classified and stored in different databases.

In some embodiments, the first database may also record audio information for each far-field speech data, including, but not limited to: equipment information, recording time, region information, scene information, audio duration, signal-to-noise ratio and the like. Thus, the first database aggregates far-field speech data and its audio information uploaded to the server by a plurality of electronic devices, making such far-field speech data an online log audio of the server 100.

Wherein, the device information is the relevant information of the electronic device uploading the far-field voice data, and the device information comprises but is not limited to the device type, the device model and the like; recording time is the time when the far-field voice data is input by the user; the region information may include the location of the electronic device uploading the far-field voice data, and because the languages of the region and the voice have relevance, the languages used by different countries may be different, and different domestic regions have different dialects, for example, the user in the uk region commonly speaks english, the guangdong region commonly speaks guangdong, the Fujian region commonly speaks Minnan, and the like, so the region information may also be used as one of the reference factors of the training model; the scene information is used for indicating a scene or environment triggering far-field voice, such as conference rooms, classrooms, home furnishings and the like; the audio duration is the duration of the voice data; the SIGNAL-to-NOISE RATIO (SNR) is the RATIO of the SIGNAL power to the NOISE power, with a larger SIGNAL-to-NOISE RATIO indicating less NOISE mixed in the SIGNAL and a higher sound quality of the sound playback.

Fig. 4 is a logic diagram of acquiring far field sample data from an online database according to some embodiments of the present application. Referring to fig. 4, a first mode of augmenting far-field sample data may include several links: far field speech data, audio pre-processing, and multi-interface verification (optional links) are obtained from a first database. The second mode of augmenting far-field sample data includes obtaining near-field speech data from a second database or open-source speech data set, and simulating far-field speech data using the near-field speech data to augment the far-field sample data. Wherein the open source speech data set comprises near field speech data for simulation acquired by other means (not uploaded by the electronic device). For example, the open source voice data set may include near field voice data downloaded by the server from an associated voice repository, near field voice data purchased from an operator, and the like.

In some embodiments, the second database and the open source speech data set may be provided as two separate data storage architectures. The second database is a database formed by online collection of near-field voice data uploaded by the electronic device, and the open-source voice data set is a data set formed by acquiring the near-field voice data through other ways (downloading, purchasing and the like).

In some embodiments, the server may integrate and fuse the second database and the open source voice data set, for example, store the open source voice data set in the second database, so that the second database may include near-field voice data obtained through different approaches and capable of being used for far-field audio simulation.

The expandable far-field sample data is obtained through the two expansion modes, and the far-field sample data can be used for training or optimizing a far-field voice algorithm model. Because the online log audio in the first database and the second database is sustainably and dynamically accumulated, the embodiment of the application can realize accumulation and expansion of far-field sample data for training a model.

It should be noted that, the multi-interface verification link provided in the embodiment of the present application is an optional link, and is not an optional link. Links involved in the above scheme may be uniformly executed by the far-field data expansion module 104, or separately executed by sub-modules (for example, an audio acquisition sub-module, an audio preprocessing sub-module, an interface verification sub-module, etc.) included in the far-field data expansion module 104, which is not limited in the embodiment of the present application. Based on the processing logic illustrated in fig. 4, the following far-field speech data expansion method A1 and far-field speech data expansion method A2 are provided.

Fig. 5 is a flowchart of a far-field speech data expansion method A1 according to some embodiments of the present application. Referring to fig. 4 and 5, far-field speech data expansion method A1 is performed by server 100, and in particular may be performed by far-field data expansion module 104, and includes:

step S51 (corresponding to pulling the online log audio link from the first database) acquires a first far-field speech data set meeting the first screening condition from the first database.

In some embodiments, the first filtering condition may include information to adapt a tag of an application scenario, a specific device, a language, etc. For example, the first filtering condition may include at least one of tag information such as device information, recording time, region information, and scene information. By means of the first screening condition, far-field voice data matched with the specific voice interaction environment can be initially acquired from the first database, and the collection of the far-field voice data is simply called a first far-field voice data set.

Step S52 (corresponding to the audio preprocessing link), acquiring target far-field voice data meeting the second screening condition from the first far-field voice data set.

In some embodiments, the second screening criteria may include relevant parameters of the adaptation model training, such as, for example, the second screening criteria include, but are not limited to, audio duration, signal-to-noise ratio, etc. Through the second screening condition, the target far-field voice Data simultaneously adapting to the voice interaction environment and the characteristics of the model to be trained can be accurately screened out _i Wherein i represents the sequence number of the target far-field voice data, i is more than or equal to 1 and less than or equal to M, and M represents the total number of the screened target far-field voice data.

And step S53, performing voice recognition on the target far-field voice data to obtain target text information.

In step S54, the target far-field speech data and the associated target text information are used as extensible far-field sample data, and the far-field sample data is saved.

In some embodiments, the server 100 may create a sample database associated with the model to be trained, and save the finally determined extensible far-field sample data to the sample database, so as to implement expansion and update of the sample database. In this way, the server can acquire available far-field sample data from the sample database, input the far-field sample data into the model to be trained, and realize training and algorithm optimization of the far-field speech processing model.

FIG. 5 illustrates an embodiment without multiple interface verification steps to screen out target far-field speech Data _i Later, to the target far-field speech Data _i Performing voice recognition to convert the audio frequency into text, thereby obtaining target far-field voice Data _i Associated target Text information Text _i Then the extensible far field sample Data includes Data _i And Text _i And i is more than or equal to 1 and less than or equal to M, and the extensible far-field sample data are stored in a sample database. The embodiment can expand far-field sample data matched with conditions such as application scenes, equipment, regions, model training and the like, realize rapid accumulation and expansion of far-field sample speed, does not need to purchase the far-field sample data from suppliers, and can avoid the problems that the far-field sample data is not matched with equipment channels, microphone array algorithms and the like.

Fig. 6 is a flowchart of a far-field speech data expansion method A2 according to some embodiments of the present application. Referring to fig. 4 and 6, far-field speech data expansion method A2 is performed by server 100, and in particular may be performed by far-field data expansion module 104, and includes:

step S61 (corresponding to pulling the online log audio link from the first database) acquires a first far-field speech data set meeting the first screening condition from the first database.

Step S62 (corresponding to the audio preprocessing link), acquiring target far-field voice data meeting the second screening condition from the first far-field voice data set.

Step S63 (corresponding to the multi-interface verification link), calling N different voice recognition interfaces, and respectively performing voice recognition on the target far-field voice data to obtain N target text messages.

Wherein the speech recognition interface has a bound speech recognition algorithm/model to recognize the speech data to convert the audio to text. In this way, the server can call the existing voice recognition interface to target far-field voice Data _i Recognition is performed so that the target Text information Text _ij J represents the serial number of the voice recognition interface, j is not less than 1 and not more than N, N represents the total number of called voice recognition interfaces, and N is not less than 2; text _ij And representing text information obtained after the jth voice recognition interface performs voice recognition on the ith target far-field voice data.

Step S64 (corresponding to the multi-interface verification link) verifies whether the N target text messages are completely consistent.

In some embodiments, far field data augmentation module 104 may verify that target Text information Text is at the same i value _ij Whether or not to be identical, i.e. verify Text _i1 、Text _i2 、…、Text _iN Whether or not they are completely identical. The purpose of adding a multi-interface authentication procedure, compared to the method illustrated in fig. 5, is to: verifying the accuracy of target far-field speech data recognition, thereby deciding whether the target far-field speech data can be used as far-field sample data for model training, and filtering outThe far-field voice data with the problems of difficult voice recognition, ambiguous recognition, invalid recognition and the like ensure the accuracy and reliability of model training and improve the model operation precision.

If the target Text information Text under the same i value is verified _ij If the two values are completely consistent, executing a step S65; otherwise, if the target Text information Text under the same i value is verified _ij Not completely identical, e.g. all different or only partially identical, step S66 is performed.

Step S65 (corresponding to the multi-interface verification link), taking the target far-field voice data and the uniquely identified target text information as extensible far-field sample data, and storing the far-field sample data.

In some embodiments, the expandable far-field sample Data includes target far-field speech Data _i And Text _i ' wherein Text _i ' Text information uniquely identifying the ith target far-field speech data, i.e. Text _i ′＝Text _i1 ＝Text _i2 ＝…＝Text _iN . The far field data expansion module 104 may save the expandable far field sample data to a sample database.

Step S66 (corresponding to the multi-interface verification link) does not expand the target far-field speech data into far-field sample data.

If the target Text information Text is verified under the same i value _ij The target far-field voice data is unsuitable as far-field sample data, and is ignored and not saved to a sample database.

For example, assuming n=2, the server invokes the speech recognition interface 1 and the speech recognition interface 2 to target far-field speech Data, respectively ₁ Text recognized by the speech recognition interface 1 ₁₁ To "play song", speech recognitionText recognized by other interface 2 ₁₂ Also "play songs", i.e. Text ₁₁ ＝Text ₁₂ Then the target far-field voice Data is processed ₁ Text information Text ₁ ' ("play song") is augmented to far field sample data.

For another example, assuming n=3, the server invokes the speech recognition interface 1, the speech recognition interface 2, and the speech recognition interface 3 to target far-field speech Data, respectively ₄ Text recognized by the speech recognition interface 1 ₄₁ For eating tomato sirloin at night, text is recognized by the voice recognition interface 2 ₄₂ For 'eating tomato milk at night', text is recognized by the voice recognition interface 3 ₄₃ For "having eaten tomato sirloin", i.e. Text ₄₁ ≠Text ₄₂ ≠Text ₄₃ Obviously, target far-field speech Data ₄ If the unique text recognition result cannot be obtained, the target far-field voice Data are not obtained ₄ Extended to far field sample data. Thereby ensuring the data reliability of the sample database and the accuracy of model training.

The embodiment can expand far-field sample data matched with conditions such as application scenes, equipment, regions, model training and the like, realize rapid accumulation and expansion of far-field sample speed, does not need to purchase the far-field sample data from suppliers, and can avoid the problems that the far-field sample data is not matched with equipment channels, microphone array algorithms and the like. In addition, a multi-interface verification link is additionally arranged, so that target far-field voice data inaccurate in recognition can be filtered, the effectiveness and reliability of far-field sample data are guaranteed, the accuracy of far-field voice processing model training is further guaranteed, and the operation precision of the model is improved.

Fig. 7 is a schematic diagram of analog far-field speech data according to some embodiments of the present application. Referring to fig. 7, the far-field data expansion module 104 may set a microphone array (microphone array) according to factors such as actual equipment and application scenario, where the microphone array includes a set of microphone units located at different positions in space, and the microphone units are arranged according to a certain topology structure (including number, shape, distribution rule, etc.), and the microphone array is used to sample a sound signal propagated in space, and the collected sound signal includes spatial position information. Microphone arrays may include one-dimensional arrays (i.e., linear arrays), two-dimensional arrays (i.e., planar arrays), and three-dimensional arrays (i.e., volumetric arrays).

In some embodiments, far field data augmentation module 104 may create a far field containing simulated Room (Room) and set the following simulated elements: the size of the far-field simulated room, the location of the microphone array in the far-field simulated room, the location of the sound source, the location of the noise, the location of the speakers, etc. The microphone array includes Q microphone units, where Q is greater than 1, so that Q sound collection channels (abbreviated as Q channels) are formed, and one channel may collect one audio signal in the far-field analog room, where the audio signal collected by each channel is different, and q=3 in the example of fig. 7.

In some embodiments, the far field data augmentation module 104 may obtain near field speech data from a second database or an open source speech data set as a sound source signal, i.e., play the near field speech data at a sound source location, to simulate a room with the far field speech data, according to the simulation requirements.

In some embodiments, according to the simulation requirement, the far-field data expansion module 104 may apply a noise signal at a noise position according to the signal-to-noise ratio requirement, so that the noise is mixed in the sound source signal, and thus the sound signal collected by any channel of the Q channels includes the noise signal, so as to simulate a scene environment with noise. The noise signal may include at least one of different types of noise such as white noise, human voice, music noise, and the like.

In some embodiments, the far field data expansion module 104 may control the speaker to play audio, which may be an echo signal, depending on the analog requirements. For example, in a home scene where a television speaker is playing program audio, a user emits a speech signal to the television, the scene contains a sound source signal (user-input speech) and the sound played by the speaker, so the player can be controlled to play audio in an analog room for such or similar scenes.

When the sound wave propagates, the sound wave can meet the obstacles in the far-field simulation room, a part of the sound wave is absorbed by the obstacles, another part of the sound wave can be reflected by the obstacles, and based on the internal structure of the room, the sound wave can be reflected and absorbed for many times, and reverberation can be generated when the reflected sound and the direct sound are overlapped. Among these, the obstacles are, for example, "ceiling", "wall", "floor", etc. The far field data expansion module 104 may invoke a toolkit (e.g., pyroacologics) configured with an indoor acoustic simulation algorithm, implement simulating sound propagation in a far field simulated room, and generate an impulse response rir based on the emissions inside the room, the impulse response rir containing the reverberation characteristics of the far field simulated room.

In some embodiments, the far-field data expansion module 104 may obtain a multi-channel audio set including audio signals acquired by Q channels respectively, based on the impulse response rir, the sound source signal y, the playing signal x of the speaker, the noise signal z, and so on. The far-field data expansion module 104 may adjust the analog elements in the remote analog room according to the analog requirement, thereby configuring whether the audio signal collected by the channel contains echo, reverberation and/or noise. And then, the audio signals collected by Q channels in the multichannel audio set are fused into single-channel audio, and the single-channel audio is used as far-field sample data and stored in a sample database. It should be noted that fig. 7 is only a schematic simulation, and is not limited to a real simulation situation.

Fig. 8 is a flowchart of a far-field speech data expansion method B according to some embodiments of the present application. Based on the simulation principle of the example of fig. 7, referring to fig. 8, far-field speech data expansion method B is performed by the server 100, and in particular may be performed by the far-field data expansion module 104, and the method includes:

in step S81, the topology of the microphone array is set.

Step S82, creating a far field simulated room of the target size.

Step S83, setting the positions of the microphone array, the sound source, the sound player, and the noise in the far-field simulation room. Wherein the sound player may be a speaker or the like.

Step S84, acquiring near-field voice data from the second database or the open-source voice data set, setting the near-field voice data as a sound source, and playing a sound source signal at the sound source position.

And step S85, setting a sound environment in a far-field simulation room, and simulating far-field audio signals to obtain a multi-communication audio set.

In some embodiments, far-field data augmentation module 104 may control speakers to play target audio and simulate far-field audio signal FS containing echoes ₁ Far field audio signal FS ₁ The calculation mode of (a) is exemplified as follows: FS (FS) ₁ =y+x× rir. Where y represents the sound source signal collected by the microphone array (the sound source signal is a Q-channel signal), x represents the playing signal of the loudspeaker (which can be regarded as an echo signal), rir represents the impulse response of the far-field simulated room, and x represents the convolution operation.

In some embodiments, the far-field data augmentation module 104 may simulate a far-field audio signal FS that contains reverberation ₂ Far field audio signal FS ₂ The calculation mode of (a) is exemplified as follows: FS (FS) ₂ =y× rir. Since the far-field simulation room reverberation characteristics are contained in rir, the far-field audio signal containing reverberation can be simulated by performing convolution operation using the sound source signal and the impulse response.

In some embodiments, far-field data augmentation module 104 may apply a noise signal at a noise location and simulate a far-field audio signal FS containing noise according to a target signal-to-noise ratio ₃ Far field audio signal FS ₃ The calculation mode of (a) is exemplified as follows: FS (FS) ₃ =y+z (10 (-SNR/20)). Where z represents the noise signal, SNR represents the target signal-to-noise ratio, and ε represents the power of the square operation. The target signal ratio in the embodiment of the present application is not limited, and the far-field data expansion module 104 may set or adjust the target signal ratio SNR according to the analog requirement.

In some embodiments, the sound source signal may be mixed with any one of echo, reverberation, and noise, and may also be mixed with a combination of any multiple of echo, reverberation, and noise.

In some embodiments, far-field data augmentation module 104 may simulate far-field sounds that contain both echoes and reverberation Frequency signal FS ₄ Far field audio signal FS ₄ The calculation mode of (a) is exemplified as follows: FS (FS) ₄ ＝x*rir+y*rir。

In some embodiments, far-field data augmentation module 104 may simulate far-field audio signal FS containing both echo and noise ₅ Far field audio signal FS ₅ The calculation mode of (a) is exemplified as follows: FS (FS) ₅ ＝y+x*rir+z*(10^(-SNR/20))。

In some embodiments, far-field data augmentation module 104 may simulate far-field audio signal FS containing both reverberation and noise ₆ Far field audio signal FS ₆ The calculation mode of (a) is exemplified as follows: FS (FS) ₆ ＝y*rir+z*(10^(-SNR/20))。

In some embodiments, far-field data augmentation module 104 may simulate far-field audio signal FS containing echo, reverberation, and noise at the same time ₇ Far field audio signal FS ₇ The calculation mode of (a) is exemplified as follows: FS (FS) ₇ ＝x*rir+y*rir+z*(10^(-SNR/20))。

In some embodiments, the far-field data expansion module 104 may set a volume disturbance coefficient s according to the simulation requirement, and use the volume disturbance coefficient s to randomly volume disturbance the simulated far-field audio signal. Wherein the analog far-field audio signal comprises FS ₁ 、FS ₂ 、FS ₃ 、FS ₄ 、FS ₅ 、FS ₆ And FS ₇ At least one of them.

In some embodiments, far-field data augmentation module 104 is operative on far-field audio signal FS ₁ After random volume disturbance, far-field audio signals FS are obtained ₁ ' far field audio signal FS ₁ ′＝|FS ₁ |x s= (y+x rir) ×s. Wherein, |FS ₁ I represents far-field audio signal FS ₁ X represents the multiplication.

In some embodiments, far-field data augmentation module 104 is operative on far-field audio signal FS ₂ After random volume disturbance, far-field audio signals FS are obtained ₂ ' far field audio signal FS ₂ ′＝|FS ₂ |×s＝(y*rir)×s，|FS ₂ I represents far-field audio signal FS ₂ Is set, is a constant value, and is a constant value.

In some embodiments, far-field data augmentation module 104 is operative on far-field audio signal FS ₃ After random volume disturbance, far-field audio signals FS are obtained ₃ ' far field audio signal FS ₃ ′＝|FS ₃ |×s＝[y+z*(10^(-SNR/20))]×s，|FS ₃ I represents far-field audio signal FS ₃ Amplitude of (2) of the amplitude of (3).

In some embodiments, far-field data augmentation module 104 is operative on far-field audio signal FS ₄ After random volume disturbance, far-field audio signals FS are obtained ₄ ' far field audio signal FS ₄ ′＝|FS ₄ |×s＝(x*rir+y*rir)×s，|FS ₄ I represents far-field audio signal FS ₄ Is set, is a constant value, and is a constant value.

In some embodiments, far-field data augmentation module 104 is operative on far-field audio signal FS ₅ After random volume disturbance, far-field audio signals FS are obtained ₅ ' far field audio signal FS ₅ ′＝|FS ₅ |×s＝[y+x*rir+z*(10^(-SNR/20))]×s，|FS ₅ I represents far-field audio signal FS ₅ Is set, is a constant value, and is a constant value.

In some embodiments, far-field data augmentation module 104 is operative on far-field audio signal FS ₆ After random volume disturbance, far-field audio signals FS are obtained ₆ ' far field audio signal FS ₆ ′＝|FS ₆ |×s＝[y*rir+z*(10^(-SNR/20))]×s，|FS ₆ I represents far-field audio signal FS ₆ Is set, is a constant value, and is a constant value.

In some embodiments, far-field data augmentation module 104 is operative on far-field audio signal FS ₇ After random volume disturbance, far-field audio signals FS are obtained ₇ ' far field audio signal FS ₇ ′＝|FS ₇ |×s＝[x*rir+y*rir+z*(10^(-SNR/20))]×s，|FS ₇ I represents far-field audio signal FS ₇ Is set, is a constant value, and is a constant value.

In some embodiments, the multi-channel audio set includes FS ₁ ′、FS ₂ ′、FS ₃ ′、FS ₄ ′、FS ₅ ′、FS ₆ ' and FS ₇ Any of which depends on the scenario and simulation requirements.

It should be noted that, the topology of the microphone array, the structure and size of the far-field simulated room, the sound environment in the room, the simulation elements, the noise type, and the like are not limited to examples of the embodiment of the present application, and the settings of these aspects are different, and the obtained multi-channel audio set is also different and may be set as appropriate according to the application scenario, the field, the equipment, and other factors.

In step S86, the multi-channel audio set is converted into single-channel speech data using a microphone array algorithm.

In some embodiments, the algorithms corresponding to the different topologies of the microphone arrays are different, so that the microphone arrays used in the simulation scene may be set based on the far-field speech algorithm to be developed, and when step S86 is performed, the multi-channel audio set is fused by using the microphone array algorithm matched with the far-field speech algorithm and the microphone arrays, so that the audio signals collected by Q channels form a target sound source of one beam (i.e., a single channel).

Fig. 9 is a flowchart illustrating operations of a microphone array algorithm according to some embodiments of the present application. Referring to fig. 9, the microphone array algorithm may include, but is not limited to, the following: echo cancellation, sound source localization, beam forming and noise suppression.

In some embodiments, far field data augmentation module 104 may include an echo cancellation sub-module, a sound source positioning sub-module, a beam forming sub-module, and a noise suppression sub-module. The individual links involved in the microphone array algorithm are performed by these sub-modules.

In some embodiments, sound played by the near end of the speaker in two-way communication propagates while being picked up by the far end microphone and then transmitted back to the near end, causing the near end speaker to hear his own voice, thereby generating an echo. The echo cancellation submodule is used for respectively carrying out echo cancellation on the audio signals acquired by the Q channels. The algorithm used by the echo cancellation sub-module is not limited, and may refer to a related art of echo cancellation, for example, a linear echo cancellation algorithm or the like may be used.

Fig. 10 is a flowchart illustrating an operation of an echo cancellation algorithm according to some embodiments of the present application. Referring to fig. 10, the echo cancellation sub-module may perform several links: delay estimation, linear echo cancellation, double talk detection, and nonlinear echo suppression. The delay estimation link is used for eliminating audio delay and realizing audio alignment; the linear echo cancellation link is used for canceling the linear echo in the audio signal, then, the nonlinear echo cancellation is firstly performed once for double-talk detection, and finally, the nonlinear echo cancellation is performed again by utilizing the double-talk detection result, so that the nonlinear echo is cancelled. Assuming that the number of channels collected by the microphone array is c, the playing signal of the speaker is set as the reference signal y (c, n), and the microphone signal is x (c, n), the echo cancellation sub-module may perform the following steps:

Step A, frequency domain transformation: the reference signal Y (c, n) and the microphone signal X (c, n) in the time domain are transformed into frequency domain features Y (c, k) and X (c, k) by fourier transformation.

Step B, time delay estimation: by calculating the frequency domain correlation of Y (c, k) and X (c, k), the audio delay is eliminated and the audio alignment is achieved.

In some embodiments, the starting point of the delay estimation is signal correlation, and the inverse fourier transform of the spectrum correlation is a correlation function: r (m) =f ^-1 { Y (k), X (k) } if the real time delay between the reference signal and the microphone signal isThe maximum value of the correlation function is +.>The algorithm of the delay estimation may refer to the related art, and the embodiments of the present application are not described in detail.

Step C, linear echo cancellation: linear filtering is carried out on X (c, k) to obtain a signal after linear echo is eliminated

In some embodiments, the linear echo cancellation algorithm is not limited, and for example, an NLMS (Normalized Least Mean Square ) algorithm may be used that employs a variable step-size dual filter structure: a foreground filter and a background filter, the background filter updating the filter weights according to the conventional filtering mode, but the errors are not used as the filter output; the foreground filter replicates the weights of the background filter under certain conditions, and the error signal is output as the filter, so that the convergence of the filter under the double-talk condition can be ensured. The linear echo cancellation algorithm may refer to the related art, and the embodiments of the present application are not described in detail.

Step D, calculating a gain factor g, and utilizing the gain factor g pairResidual echo cancellation is performed. Wherein the residual echo comprises a nonlinear echo, thereby achieving nonlinear echo suppression.

And E, performing double-talk detection on the elimination result obtained in the step D to obtain a detection result of dt_flag.

In some embodiments, the two-way detection is based on the estimated echo (c, k), after cancellation of the echoAnd microphone signal>To determine if double talk is present, and if the estimated echo (c, k) and microphone signal X (c, k) are less correlated, then consider a greater probability as double talk; if the microphone signal X (c, k) and the echo-cancelled +.>The more relevant it is considered that the higher probability is double-talk, and the threshold value used in the correlation determination can be set by itself according to the situation. The two-way detection algorithm may refer to related technologies, and the embodiments of the present application are not described in detail.

Step F, recalculating a gain factor g 'according to the dt_flag, and utilizing the gain factor g' pairResidual echo cancellation is performed.

In some embodiments, it may be possible toecho (c, k) and dt_flag, estimating a residual echo signal, calculating a gain factor g' from the residual echo signal, and eliminating +. >Is a residual nonlinear echo. The nonlinear echo suppression algorithm may refer to related technologies, and embodiments of the present application are not described in detail.

And G, performing inverse Fourier transform on the elimination result of the step F, thereby obtaining a signal e (c, n) in the time domain. The echo cancellation submodule is operated up.

In some embodiments, the sound source positioning sub-module performs sound source positioning on the audio signals collected by the Q channels through a sound source positioning algorithm, including positioning the angle (including azimuth angle, pitch angle, etc.) and distance of the target speaker at the sound source, so as to track the target speaker. In consumer-level microphone arrays, the direction of arrival of the sound source, the direction of arrival (Direction of Arrival, DOA), is often of interest. The sound source localization algorithm is not limited, and for example, MVDR (Minimum Variance Distortionless Response, least mean square distortion free response), SRP-heat (Steered Response Power-Phase Transform, phase Transform weighted based controllable response power) and the like can be used.

In some embodiments, the beam forming sub-module performs spatial filtering on the signals, and combines the audio signals collected by the Q channels into audio frequency of one beam (abbreviated as single-channel voice data), so as to inhibit non-target direction signals, enhance target direction signals, realize focusing pickup of specific directions, and improve SINR (Signal to Interference plus Noise Ratio ) of the single-channel voice data, thereby playing a role in noise reduction. The beamforming algorithm is not limited, and for example, GSC (Generalized Sidelobe Canceller, generalized sidelobe cancellation) or MVDR may be used.

In some embodiments, the noise suppression sub-module is configured to suppress noise in the single-channel speech data, and to implement further noise reduction, thereby providing a sample database with more pure far-field sample data. The noise suppression algorithm used by the noise suppression submodule is not limited, and for example, an RNN (Recurrent Neural Network ) noise algorithm or the like may be used.

In some embodiments, referring to fig. 9, it is assumed that the microphone array includes Q microphone units, respectively Mic ₁ 、Mic ₂ …Mic _Q For Mic ₁ 、Mic ₂ …Mic _Q The collected audio signals respectively execute echo cancellation to Mic ₁ 、Mic ₂ …Mic _Q And respectively executing sound source localization on the collected audio signals, then carrying out wave beam formation according to the Q channels of echo removal and the sound source localization audio signals to obtain single-channel voice data, and then carrying out noise suppression on the single-channel voice data to finally obtain far-field voice data which can be used for training a model. The microphone array algorithm is not limited to the embodiments of the present application.

In step S87, the single-channel voice data and the text information thereof are used as extensible far-field sample data, and the far-field sample data is saved.

In some embodiments, the far-field data expansion module 104 may perform multi-interface verification on the single-channel voice data, and call N voice recognition interfaces to perform voice recognition on the single-channel voice data respectively to obtain N text information, and if the N text information is completely consistent, take the single-channel voice data and the uniquely identified text information thereof as far-field sample data.

The far-field voice data expansion method B can fully utilize near-field voice data stored in the second database, set sound environments (including sound propagation, reflection, noise, echo, reverberation and other characteristics) in the room according to different model training requirements, application scenes and other aspects by constructing a far-field simulation room, further simulate far-field audio, and convert a multichannel audio set into purer single-channel voice data by utilizing a matched microphone array algorithm.

Some embodiments of the present application further provide an electronic device that may include, but is not limited to, a sound collector for collecting voice data (near field type or far field type), a communicator for communicating with a server, and a controller for uploading the voice data collected by the sound collector to the server through the communicator. After receiving voice data uploaded by the electronic equipment, the server identifies the category of the voice data, if the voice data is far-field voice data, the voice data is stored in a first database so that the server executes a far-field voice data expansion method A1 or a far-field voice data expansion method A2; if the voice data is near-field voice data, the voice data is stored in a second database, so that the server executes the far-field voice data expansion method B.

In some embodiments, far-field speech data extension method A1 and far-field speech data extension method A2 are defined collectively as a first extension mode, and far-field speech data extension method B is defined as a second extension mode. Wherein the first expansion mode includes any one of a far-field voice data expansion method A1 and a far-field voice data expansion method A2.

In some embodiments, the server may configure either of the first expansion mode and the second expansion mode, or the server may configure both the first expansion mode and the second expansion mode.

In some embodiments, if the server configures the first extended mode and the second extended mode simultaneously, the server side may set a mode switch whose state is used to indicate the extended mode currently being enabled by the server.

In some embodiments, for example, when the mode switch is in an off state, the first expansion mode may be turned on by default, and the second expansion mode may be turned off; when the mode switch is in an on state, the second expansion mode can be started by default, and the first expansion mode is closed.

In some embodiments, for example, when the mode switch is in an off state, the first expansion mode may be turned off by default, and the second expansion mode may be turned on; when the mode switch is in an on state, the second expansion mode can be closed by default, and the first expansion mode is opened. It should be noted that, the selection manner of the expansion mode is not limited to the examples of the embodiments of the present application.

In some embodiments, the server may enable the first expansion mode and the second expansion mode simultaneously, i.e., the two expansion modes operate together, thereby enabling the expansion of the far-field sample data to be faster.

Some embodiments of the present application also provide a computer storage medium, which may store a program. When the computer storage medium is configured in the server, the program may include program steps included in any one of the far-field speech data expansion method A1, the far-field speech data expansion method A2, and the far-field speech data expansion method B in the above embodiments when executed. The computer storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the disclosure and to enable others skilled in the art to best utilize the embodiments.

Claims

1. A server, comprising:

a first communicator for communication with the electronic device;

a first controller for performing:

The far field sample data is stored.

2. The server of claim 1, wherein the first controller screens far field sample data from the first database, comprising:

acquiring a first far-field voice data set meeting a first screening condition from the first database, wherein the first screening condition comprises equipment information, recording time and region information of target equipment;

acquiring target far-field voice data meeting second screening conditions from the first far-field voice data set, wherein the second screening conditions comprise target audio duration and target signal-to-noise ratio;

performing voice recognition on the target far-field voice data to obtain target text information;

and expanding the target far-field voice data and the target text information into the far-field sample data.

3. The server of claim 1, wherein the first controller screens far field sample data from the first database, comprising:

calling N different voice recognition interfaces, and respectively performing voice recognition on the target far-field voice data to obtain N target text messages; n is the calling number of the voice recognition interfaces, and N is greater than 1;

and if the N pieces of target text information are completely consistent, expanding the target far-field voice data and the uniquely identified target text information into the far-field sample data.

4. The server of claim 1, wherein the first controller simulates the far field sample data from near field speech data in the second database, comprising:

creating a far-field simulated room, and setting a topological structure of a microphone array;

setting the positions of the microphone array, the sound source, the sound player and the noise in the far-field simulated room;

acquiring near-field voice data from the second database or the open-source voice data set, setting the near-field voice data as a sound source signal, and playing the incoming voice data at a sound source position;

Setting the sound environment in the far-field simulation room, and simulating far-field audio signals to obtain a multi-communication audio set;

converting the multi-channel audio set into single-channel voice data by utilizing a microphone array algorithm;

and expanding the single-channel voice data and the text information thereof into the far-field sample data.

5. The server of claim 4, wherein the first controller sets the far-field simulated room sound environment and simulates far-field audio signals, comprising:

controlling the sound player to play target audio and simulating far-field audio signals FS containing echoes ₁ ；

FS ₁ =y+x rir; wherein y represents a sound source signal collected by the microphone array, x represents an echo signal played by the sound player, rir represents an impulse response of the far-field simulated room, and x represents convolution operation.

6. The server of claim 5, wherein the first controller sets the far-field simulated room sound environment and simulates a far-field audio signal, comprising:

simulating far-field audio signal FS containing echo reverberation ₂ ，FS ₂ ＝y*rir。

7. The server of claim 6, wherein the first controller sets the far-field simulated room sound environment and simulates a far-field audio signal, comprising:

Applying a noise signal at the noise location and simulating a far-field audio signal FS containing noise ₃ ；

FS ₃ =y+z (10 (-SNR/20)); where z represents the noise signal, SNR represents the target signal-to-noise ratio, # represents the convolution operation, and ε represents the power operation.

8. The server of claim 7, wherein the first controller sets the far-field simulated room sound environment and simulates a far-field audio signal, comprising:

simulating far-field audio signals containing both echo and reverberationNumber FS ₄ ，FS ₄ ＝x*rir+y*rir；

9. The server of claim 8, wherein the multi-channel audio set comprises FS ₁ ′、FS ₂ ′、FS ₃ ′、FS ₄ ′、FS ₅ ′、FS ₆ ' and FS ₇ ' any one of;

FS ₁ ′＝|FS ₁ |×s；FS ₂ ′＝|FS ₂ |×s；FS ₃ ′＝|FS ₃ |×s；FS ₄ ′＝|FS ₄ |×s；FS ₅ ′＝|FS ₅ |×s；FS ₆ ′＝|FS ₆ |×s；

FS ₇ ′＝|FS ₇ |x s; wherein, |FS ₁ I represents far-field audio signal FS ₁ Amplitude of, |FS ₂ I represents far-field audio signal FS ₂ Amplitude of, |FS ₃ I represents far-field audio signal FS ₃ Amplitude of, |FS ₄ I represents far-field audio signal FS ₄ Amplitude of, |FS ₅ I represents far-field audio signal FS ₅ Amplitude of, |FS ₆ I represents far-field audio signal FS ₆ Amplitude of, |FS ₇ I represents far-field audio signal FS ₇ S is the volume disturbance factor, x represents the multiplication.

10. An electronic device, comprising:

a second communicator for communication connection with the server of any one of claims 1 to 9;

the sound collector is used for collecting voice data input by a user;

a second controller for performing:

and uploading the voice data to the server.

11. A far-field speech data augmentation method, comprising:

The far field sample data is stored.