CN110634498A

CN110634498A - Voice processing method and device

Info

Publication number: CN110634498A
Application number: CN201810603522.9A
Authority: CN
Inventors: 田彪; 余涛
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2019-12-31

Abstract

The embodiment of the application discloses a voice processing method and voice processing equipment. The method comprises the following steps: acquiring at least one ambient signal of a microphone array for recording a speech signal; processing the environment signal to determine environment information of the microphone array; and carrying out voice front-end processing on the recorded voice signal according to the environment information. By utilizing the technical scheme of the application, the environmental factors around the microphone array can be fused into the voice front-end processing technology, so that the processing efficiency and the processing accuracy of the front-end processing process can be improved, and the accuracy of subsequent voice recognition and semantic understanding can be improved.

Description

Voice processing method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing speech.

Background

In recent years, speech recognition technology has advanced significantly, and has gradually moved from the laboratory to the market. Typically, speech recognition technology has been widely used in various fields such as industry, home appliances, communication, automotive electronics, medical treatment, home services, consumer electronics, and the like. The technical fields involved in speech recognition technology are complex, including signal processing, pattern recognition, probability theory and information theory, sound production mechanism and auditory mechanism, artificial intelligence, and so on.

The speech recognition process includes a front-end processing step, wherein the front-end processing step is to process an original speech signal before extracting the speech signal features, so as to partially eliminate the influence caused by noise or different speakers, and enable the processed speech signal to reflect the essential features of speech. The most common front-end processing techniques in the prior art include endpoint detection and speech enhancement. The endpoint detection is to distinguish the speech signal from the non-speech signal in the speech signal to accurately determine the starting point and the ending point of the speech signal. After the endpoint detection, the subsequent processing can be carried out on the voice signal only, which plays an important role in improving the accuracy and the accuracy of the voice recognition model. The main task of speech enhancement is to eliminate the influence of environmental noise on speech signals, and the current general method is to perform filtering processing on the speech signals by means of wiener filtering, kalman filtering and the like.

As can be seen from the above, the front-end processing technology in the prior art only processes a pure speech signal, and has a great limitation for extracting the essential features of the speech signal. Therefore, there is a need in the art for a technique that can perform front-end processing on voice signals from multiple layers.

Disclosure of Invention

The present disclosure provides a speech processing method and apparatus, which can fuse environmental factors around a microphone array into a speech front-end processing technology, so as to improve processing efficiency and processing accuracy of a front-end processing process, and improve accuracy of subsequent speech recognition, i.e., semantic understanding.

Specifically, the voice processing method and the voice processing device are realized as follows:

a method of speech processing, the method comprising:

acquiring at least one ambient signal of a microphone array for recording a speech signal;

processing the environment signal to determine environment information of the microphone array;

and carrying out voice front-end processing on the recorded voice signal according to the environment information.

A speech processing apparatus comprising a processor and a memory for storing processor-executable instructions, the instructions when executed by the processor implementing the steps of the method of the above embodiments.

A speech processing device comprising: microphone array, at least one sensing device, processor, wherein:

the microphone array is used for recording voice signals;

the at least one sensing device is deployed in a recording range of the microphone array and used for acquiring an environmental signal in the recording range of the microphone array;

and the processor is used for processing the environment signal, determining the environment information of the microphone array and carrying out voice front-end processing on the recorded voice signal according to the environment information.

A business service device comprising a voice processing module, the voice processing module coupled to a host of the business service device, the voice processing module configured to:

The voice processing method and the voice processing equipment can acquire at least one environment signal of a microphone array for recording voice signals, determine environment information of the microphone array according to the environment signal, and finally perform voice front-end processing on the recorded voice signals according to the environment information. By utilizing the technical scheme provided by the application, the environmental factors around the microphone array can be fused into the voice front-end processing technology, so that the processing efficiency and the processing accuracy of the front-end processing process can be improved, and the accuracy of subsequent voice recognition, namely semantic understanding, can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of an application scenario of a speech processing apparatus provided in the present application;

FIG. 3 is a flow chart diagram of a voice interaction method provided by the present application;

FIG. 4 is a schematic diagram of a business logic implementation of voice interaction provided by the present application;

FIG. 5 is a flowchart illustrating a speech processing method according to an embodiment of the present application;

fig. 6 is a schematic block diagram of an embodiment of a speech processing apparatus provided in the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application shall fall within the scope of protection of the present application.

As described above, in the speech front-end processing link in the prior art, only pure speech data can be processed, and there is a great limitation in extracting essential features of a speech signal. Based on the technical requirements similar to those described above, the present application provides a speech processing method, which can use the environmental information of the microphone array recording the speech signal as a reference factor for speech front-end processing, so that the accuracy of speech front-end processing can be enhanced regardless of the end-point detection or speech enhancement link, and the method has an important role in subsequent speech recognition and semantic understanding.

The voice processing method can be applied to various application scenes, such as various scenes including conferences, news conferences, museums, intelligent voice interaction and the like. The following describes the speech processing method provided by the present application in a specific application scenario.

In the scenario shown in fig. 1, a user may purchase a subway ticket, a train ticket, an automobile ticket, and the like through an intelligent voice ticket purchasing machine, and during the process of interacting with the intelligent voice ticket purchasing machine, the ticket purchasing machine needs to record voice data sent by the user through a microphone array. However, in a practical application scenario, in the process of acquiring the voice signal of the user by the ticket purchaser, the environment around the ticket purchaser has a great influence on the subsequent voice recognition, for example, if the environment around the ticket purchaser is noisy, noise needs to be suppressed subsequently, or a sound source needs to be accurately located. Therefore, as shown in fig. 1, various sensing devices, such as a camera for acquiring a video signal, an infrared detector for acquiring an infrared image by a user, and a ground pressure sensor for sensing a pressure signal, may be disposed within the recording range of the ticket purchaser. The ticket buying machine can confirm whether a ticket buying user exists in front of the ticket buying machine and whether the ticket buying user produces sound through video information acquired by the camera. Of course, whether the ticket buyer is in the recording range of the microphone or not can be sensed through an infrared detector or a ground pressure sensor. Through the auxiliary effect of the sensing equipment, the environmental information around the microphone array can be fused into the voice front-end processing process, such as accurately identifying the starting point and the end point of a voice signal, accurately identifying a sound source and performing voice enhancement on the sound source, and the like.

In the above scenario, in the implementation of the function of the ticket purchasing machine, an intelligent all-in-one machine capable of integrating all the above hardware can be adopted, and the above intelligent ticket purchasing machine can integrate hardware devices such as an infrared detector, a camera, a microphone array, a pressure sensor, a voice front-end processing module and the like, acquire an environmental signal acquired by each sensing device through the voice front-end processing device, and perform voice front-end processing on a recorded voice signal based on the environmental signal. In other embodiments, an external device may be added on the basis of the original device, the external device may include an integrated circuit chip, such as a single chip, an ARM, and other processors, and the external device may implement the technical solution provided in the embodiment of the present application. Fig. 2 is a specific hardware structure, and as shown in fig. 2, the present application may provide a speech processing apparatus, which may be a single chip with multiple pins. Specifically, the microphone array, the video recording device, the infrared sensor, the pressure sensor, the ultrasonic transmitter and other sensing devices can be respectively electrically connected with the single chip microcomputer, part of pins of the single chip microcomputer are connected to a service server, and the service server can comprise a ticket vending machine, a cash dispenser, a vending machine and the like.

After the voice processing device performs voice denoising or voice enhancement, the network interface module shown in fig. 2 may be used to send the processed voice signal to the voice data processing cloud terminal in a wired or wireless manner. In the cloud processing process, as shown in fig. 3, the voice recognition module can perform voice recognition on the voice signal, and after the voice recognition module recognizes the pronunciation of the correct voice signal, the voice signal is transmitted to the semantic understanding module to perform natural language recognition. After the correct semantics of the voice signal are solved, the semantics can be transmitted to a service response module, and the service response module can provide corresponding services for the user according to the semantics, for example, providing purchased train tickets, subway tickets, and the like for the user. And finally, the processing cloud end can transmit the result of the service response to the business service equipment, and the business service equipment provides specific service for the user. It should be noted that, in an actual processing scenario, after acquiring an environment signal of a microphone array, the speech processing device may send the environment signal and a speech signal recorded by the microphone array to a speech processing cloud, and perform speech front-end processing at the cloud.

The business service equipment is not limited to the intelligent voice ticket purchasing machine, and can also comprise an intelligent voice conference assistant, an intelligent sound box, a smart phone, a digital assistant, intelligent wearable equipment, a shopping guide terminal, a television, an intelligent sound box and the like. Wherein, wearable equipment of intelligence includes but not limited to intelligent bracelet, intelligent wrist-watch, intelligent glasses, intelligent helmet, intelligent necklace etc.. Therefore, the corresponding usage scenario may also include various environments such as a meeting environment, a large-scale evening, an exhibition hall, and the like, which is not limited herein.

Fig. 4 is a schematic diagram of implementing a service logic for performing voice interaction based on the voice interaction mode in fig. 3, which may include:

1) the hardware aspect can comprise: a camera and a microphone array.

The camera and the microphone array may be disposed in the speech device 101 shown in fig. 1, the camera may acquire portrait information, and the position of the mouth may be further determined based on the acquired portrait information, so as to determine the source position of the sound, that is, the position of the mouth that emits the sound may be specifically determined through the portrait information, so that it is determined which direction the coming sound is the sound to be acquired.

After determining which direction of the sound is the sound to be acquired, directional noise cancellation can be performed through the microphone array, that is, the sound in the sound source direction can be enhanced through the microphone array, and the noise in the non-sound source direction can be suppressed.

Namely, the directional noise elimination of the sound can be realized in a mode of matching the camera and the microphone array.

2) The local algorithms may include algorithms based on face recognition and algorithms based on signal processing.

The algorithm based on face recognition can be used for determining the identity of a user, identifying the position of the five sense organs of the user, identifying whether the user faces equipment, authenticating the payment of the user and the like, and can be realized by matching a camera with a local face recognition algorithm.

The signal processing algorithm may determine the angle of the sound source after determining the position of the sound source, and then control the sound pickup of the microphone array, so as to implement directional noise cancellation. Meanwhile, the acquired voice can be subjected to certain processing such as amplification, filtering and the like.

3) Cloud processing, i.e., implementation in the cloud, may also be implemented locally, which may be determined according to the processing capability of the device itself, the usage environment, and the like. Certainly, if the method is realized at the cloud, the algorithm model is updated and adjusted by means of big data, so that the accuracy of voice recognition, natural voice understanding and dialogue management can be effectively improved.

The cloud processing mainly comprises: speech recognition, natural language understanding, dialog management, and the like.

The speech recognition mainly recognizes the content of the acquired speech, for example, a section of speech data is acquired, and its meaning needs to be understood, so that the specific text content of the section of speech needs to be known first, and the process needs to convert the speech into text by means of speech recognition.

For a machine, whether the text is the text itself, the meaning expressed by the text needs to be determined, and then the natural meaning corresponding to the text needs to be determined through natural language interpretation, so that the intention of the voice content of the user and the carried information can be recognized.

Because the process is a man-machine interaction process, the links of question answering are involved, the question answering can be actively triggered by the dialogue management unit, and the previous question answering is continuously generated based on the reply of the user. These questions and answers require that questions and required answers be preset. For example, in a session for purchasing subway tickets, it is necessary to set: asking for subway tickets, a plurality of tickets and the like to which station you need to go, the corresponding users need to provide: the station name and the number of sheets. For the user to change the station name, or to modify the reply that has been replied, etc. that occurs during the dialog, the dialog management needs to provide corresponding processing logic.

For the dialogue management, not only a conventional dialogue is set, but also the dialogue content can be customized for the user according to the difference of the user identity, so that the user experience is higher.

The purpose of session management is to achieve an efficient communication with the user to obtain information required to perform operations.

Specific speech recognition, natural speech understanding and dialogue management can be realized in the cloud or locally, and the specific speech recognition, natural speech understanding and dialogue management can be determined according to the processing capacity of the equipment, the use environment and the like. Certainly, if the method is realized at the cloud, the algorithm model is updated and adjusted by means of big data, so that the accuracy of voice recognition, natural voice understanding and dialogue management can be effectively improved. And for various payment scenes and voice interaction scenes, the voice processing model can be subjected to repeated iterative analysis and optimization, so that the experience of payment and voice interaction is better.

4) Business logic, i.e., services that a device can provide.

For example, the services may include: payment, ticketing, inquiry, query result presentation, and the like. Through the setting of hardware, a local algorithm and cloud processing, the device can execute the provided service.

For example, for a ticketing device, through human-computer interaction, a user requests to buy a ticket through the device, and the device can draw the ticket. For the service counseling apparatus, through human-computer interaction, a user can acquire required information through the apparatus, and the like. These business scenarios are usually paid, so there is generally a payment flow in the business logic, and after the payment of the user, the user is provided with the corresponding service.

By the service logic and the intelligent interaction scheme of vision and voice, the noise can be reduced, the recognition accuracy is improved, the double-person conversation scene can be prevented from being disturbed, the aim of avoiding awakening can be fulfilled, meanwhile, for a user, interaction can be carried out through natural voice,

the speech processing method described in the present application is described in detail below with reference to the drawings. FIG. 5 is a flowchart of a method of an embodiment of a speech processing method provided herein. Although the present application provides method steps as shown in the following examples or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In the case of steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. The method can be executed in sequence or in parallel in the voice processing process in practice according to the embodiment or the method shown in the figure (for example, a parallel processor or a multi-thread processing environment).

Specifically, as shown in fig. 2, an embodiment of the speech processing method provided in the present application may include:

s501: at least one ambient signal of a microphone array for recording a speech signal is acquired.

S503: processing the ambient signal to determine ambient information for the microphone array

S505: and carrying out voice front-end processing on the recorded voice signal according to the environment information.

In this embodiment, the recorded voice signal may be acquired by using a microphone array, and the microphone array has the advantages of echo cancellation, beam forming, sound source positioning, noise reduction, dereverberation and the like, so that the method may be applied to many scenes, such as intelligent voice interaction, conferences, stages, reporter conferences, museums and the like. In order to enhance some advantages of the microphone array, such as sound source localization, noise reduction, etc., in this embodiment, at least one environment signal of the microphone array for recording the voice signal may be obtained, and the environment signal may be used to extract the environment information of the microphone array. The environmental information may include, for example, information of sound emitting objects within the recording range of the microphone array, illumination information, noise information, and the like. In one embodiment of the present application, the environmental signal may include at least one of: video signals, infrared signals, gravity sensing signals, ultrasonic signals. The video signal may include a video signal recorded by a video recording device within a recording range of the microphone array, the infrared signal may include an infrared sensing signal acquired by an infrared sensing device within the recording range of the microphone array, the gravity sensing signal may include ground gravity sensing information acquired by a pressure sensing device within the recording range of the microphone, and the ultrasonic signal may include an ultrasonic signal acquired by an ultrasonic device within the recording range of the microphone.

In one embodiment of the present application, the environmental signal may be obtained by placing a sensing device within the recording range of the microphone array. In particular, at least one sensing device may be provided within the recording range of the microphone. And when the microphone array records a voice signal, starting the at least one sensing device to acquire an environment signal of the microphone array. For example, in a conference scene, a video recording device may be provided at a position facing a microphone array so that a video signal containing a sound-emitting object can be recorded, and in addition, pressure sensors may be provided at respective positions on the ground within the recording range of the microphone array so that the presence or absence of the sound-emitting object within the recording range can be recognized. In this embodiment, when the microphone array starts to record a voice signal, triggering the sensing device to acquire an environmental signal of the microphone array.

In one embodiment of the present application, in the process of determining the environment information of the microphone array by using the environment signal, whether a sound emitting object exists in the recording range of the microphone array may be determined by using the environment signal. The sound object may include a person or a smart machine, etc. that utters a voice signal. In this embodiment, the specific determination manner may include determining whether a sound-producing object exists in the recording range of the microphone array by using a video signal, an infrared signal, a gravity sensing signal, and an ultrasonic signal. Specifically, in one embodiment, whether a sound emitting object exists in a recording range may be determined by using the acquired video signal in the recording range of the microphone array. In another embodiment, infrared imaging may be performed on the recording range by using the acquired infrared signal within the recording range of the microphone array, and whether a sound emitting object exists within the recording range may be determined from the infrared imaging. In another embodiment, whether a sound emitting object exists in the recording range can be judged by using the acquired gravity sensing signal in the recording range of the microphone array. In another embodiment, the obtained ultrasonic signals in the recording range of the microphone array may be used to perform ultrasonic imaging on the recording range, and whether a sound emitting object exists in the recording range is determined from the ultrasonic imaging.

In this embodiment, if it is determined that a sound-emitting object exists within the recording range of the microphone array, information of the sound-emitting object is acquired, and the information of the sound-emitting object is used as environment information of the microphone array. In one embodiment of the present application, the information of the sound emission object may include at least one of: the number of the sound-emitting objects, the sound-emitting state of the sound-emitting objects, the positions of the sound-emitting objects, the facial information of the sound-emitting objects, the sex and the age of the sound-emitting objects.

In an embodiment of the present application, the number of the sounding objects and the positions of the sounding objects may be obtained by using the video signal, the infrared signal and the ultrasonic signal. In one embodiment, the number of sound generating objects, the locations of the sound generating objects may be identified from the video signal. In another embodiment, infrared signals may be used to determine infrared imagery of sound objects within the recording range of the microphone array, from which the number of sound objects, the location of the sound objects, are identified. In another embodiment, ultrasonic imaging of sound objects within the recording range of the microphone array may be determined using ultrasonic signals, from which the number of sound objects, the location of the sound objects, are identified.

In one embodiment of the present application, the utterance state, facial information, sex, age of the utterance subject are set to be identifiable from a video signal. Specifically, the recognized sound-producing object can be subjected to face recognition from the video signal, the sound-producing state, gender and age of the sound-producing object can be recognized, and even the identity information of the sound-producing object can be recognized according to the face information, so that the information is beneficial to processing such as enhancing or denoising the recorded voice signal by using the sound features of different genders and ages in the voice front-end processing process.

In this embodiment, after determining the environment information of the microphone array, the voice front-end processing may be performed on the recorded voice signal according to the environment information. The speech front-end processing techniques may include endpoint detection and speech enhancement. The endpoint detection is to distinguish the speech signal from the non-speech signal in the speech signal to accurately determine the starting point and the ending point of the speech signal. After the endpoint detection, the subsequent processing can be carried out on the voice signal only, which plays an important role in improving the accuracy and the accuracy of the voice recognition model. In this embodiment, in the process of detecting the endpoint, the recognition efficiency and accuracy of viewpoint detection may be improved by using the environment information. Specifically, in one embodiment of the present application, the start point and the end point of a speech signal may be determined by the utterance state of the utterance object, wherein the utterance state may include the mouth shape of the utterance object, and the like, and when the utterance object speaks by mouth recognition through a video signal, the start point of the corresponding speech signal may be determined. Even if the subsequent sound-producing object changes or the sound-producing object suspends the sound production, the sound-producing state of the sound-producing object can be quickly determined by the video signal, so that the starting point and the ending point of the voice signal can be determined. Specifically, in combination with the processing technology of the microphone array, after the corresponding sound-producing object and the sound-producing state of the sound-producing object are determined, the voice processing can be performed in a manner of enhancing the target sound source and suppressing other noises. Of course, in other embodiments, the endpoint detection of the recorded voice signal may also be implemented through information such as the location of the sound generating object, facial information, gender, age, and the like, which is not limited in the present application.

In the front-end speech processing technology, the main task of speech enhancement is to eliminate the influence of environmental noise on a speech signal, and a currently general method is to perform filtering processing on the speech signal by adopting modes such as wiener filtering and kalman filtering. In one embodiment of the present application, the position of the sound-producing object relative to the microphone array can be determined by the position of the sound-producing object, so as to generate processing steps of sound source localization and speech enhancement.

In one embodiment of the present application, multiple environment modes of the microphone array may be preset, the environment modes having corresponding speech front-end processing modes. In one embodiment of the present application, the environmental mode may include one of: quiet mode, outdoor mode, conference mode, noisy mode. The quiet mode may include that the environment around the microphone array is quiet and has no excessive noise, such as the environment where the smart voice interaction device used in a house is located, so the corresponding voice front-end processing manner may be relatively simple. The outdoor mode can the environment around the microphone is noisy, for example, the car flow sound on the road, etc., for example, the user uses intelligent voice interaction equipment outdoors, and therefore, the noise around needs to be suppressed in the corresponding voice front-end processing process. The conference mode may include a plurality of sound objects around the microphone, and thus, the position of each sound object needs to be accurately located during the corresponding voice front-end processing. The noisy mode may include a noisy environment in an indoor environment, such as an environment in an airport, a subway station, or other places, and therefore, it is necessary to suppress noise emitted by a plurality of other sound-producing objects around the environment and accurately locate the position of the sound source.

The voice processing method can acquire at least one environment signal of a microphone array for recording voice signals, determines environment information of the microphone array according to the environment signal, and finally carries out voice front-end processing on the recorded voice signals according to the environment information. By utilizing the technical scheme provided by the application, the environmental factors around the microphone array can be fused into the voice front-end processing technology, so that the processing efficiency and the processing accuracy of the front-end processing process can be improved, and the accuracy of subsequent voice recognition, namely semantic understanding, can be improved.

The present application further provides a speech processing apparatus, which may include a processor and a memory for storing processor-executable instructions, where the processor executes the instructions to implement the steps of the method according to any of the above embodiments.

As shown in fig. 6, the present application also provides a speech processing apparatus including: a microphone array, at least one sensing device, a processor, wherein,

the microphone array is used for recording voice signals;

Optionally, in an embodiment of the present application, the acquiring, by the processor, a plurality of environment signals of a microphone array for recording a voice signal may include:

arranging at least one sensing device in the recording range of the microphone array;

and when the microphone array records a voice signal, starting the at least one sensing device to acquire an environment signal of the microphone array.

Optionally, in an embodiment of the present application, the environment signal may include at least one of: video signals, infrared signals, gravity sensing signals, ultrasonic signals.

Optionally, in an embodiment of the present application, the processor processes the environment signal, and determining the environment information of the microphone array may include:

judging whether a sound production object exists in the recording range of the microphone array or not by using the environment signal;

if the fact that a sound generating object exists in the recording range of the microphone array is determined, information of the sound generating object is obtained;

and taking the information of the sound production object as the environment information of the microphone array.

Optionally, in an embodiment of the present application, the information of the sound object may include at least one of: the number of the sound-emitting objects, the sound-emitting state of the sound-emitting objects, the positions of the sound-emitting objects, the facial information of the sound-emitting objects, the sex and the age of the sound-emitting objects.

Optionally, in an embodiment of the present application, the processor determining whether a sound object exists in the recording range of the microphone array by using the environment signal may include one of the following ways:

judging whether a sound production object exists in a recording range by using the acquired video signal in the recording range of the microphone array;

or, utilizing the acquired infrared signals in the recording range of the microphone array to perform infrared imaging on the recording range, and judging whether a sound production object exists in the recording range from the infrared imaging;

or judging whether a sound production object exists in the recording range by using the acquired gravity sensing signal in the recording range of the microphone array;

or carrying out ultrasonic imaging on the recording range by utilizing the acquired ultrasonic signals in the recording range of the microphone array, and judging whether a sound production object exists in the recording range in the ultrasonic imaging.

Optionally, in an embodiment of the present application, the number of the sound-generating objects and the positions of the sound-generating objects may be set to be obtained in at least one of the following manners:

identifying the number of sound-producing objects and the positions of the sound-producing objects from a video signal;

or determining infrared imaging of sounding objects in the recording range of the microphone array by using infrared signals, and identifying the number of sounding objects and the positions of the sounding objects in the infrared imaging;

or determining ultrasonic imaging of the sounding objects in the recording range of the microphone array by using the ultrasonic signals, and identifying the number of the sounding objects and the positions of the sounding objects in the ultrasonic imaging.

Alternatively, in an embodiment of the present application, the face information of the sound-emitting subject, the sex and age of the sound-emitting subject may be set to be identified from the video signal.

Optionally, in an embodiment of the present application, the performing, by the processor, voice front-end processing on the recorded voice signal according to the environment information may include:

determining an environmental pattern in which the microphone array is located from the environmental information, the environmental pattern including one of: quiet mode, outdoor model, conference mode, noisy mode;

and carrying out voice front-end processing on the recorded voice signal according to the environment mode.

Optionally, in an embodiment of the present application, after the processor determines the environmental information of the microphone array, the processor may further include:

adjusting the turned-on sensor according to the environment information of the microphone array, so that the sensor is matched with the environment information.

In the above embodiments, the processor may be implemented in any suitable way. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.

In the present embodiment, the microphone unit may convert sound into an electric signal to form an audio file. The microphone unit may take the form of a resistive microphone, an inductive microphone, a capacitive microphone, an aluminum ribbon microphone, a moving coil microphone, or an electret microphone.

The embodiment of the present application further provides a service device, where the service device includes a voice processing module, the voice processing module is coupled with a host of the service device, and the voice processing module is configured to implement the method according to any of the above embodiments.

Embodiments of the present application further provide a computer storage medium, where computer program instructions are stored, and when executed, the computer program instructions may implement the method according to any of the above embodiments.

In this embodiment, the computer storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card).

The functions and effects of the computer storage medium provided in the present embodiment, which are realized when the program instructions thereof are executed, can be explained with reference to other embodiments.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip 2. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardsradware (Hardware Description Language), vhjhd (Hardware Description Language), and vhigh-Language, which are currently used in most popular applications. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

Although the present application has been described in terms of embodiments, those of ordinary skill in the art will recognize that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims

1. A method of speech processing, the method comprising:

2. The method of claim 1, wherein obtaining a plurality of ambient signals for a microphone array used to record a speech signal comprises:

3. The method of claim 1 or 2, wherein the environmental signal comprises at least one of: video signals, infrared signals, gravity sensing signals, ultrasonic signals.

4. The method of claim 3, wherein the processing the ambient signal to determine the environmental information of the microphone array comprises:

5. The method of claim 4, wherein the information of the sound emitting object comprises at least one of: the number of the sound-emitting objects, the sound-emitting state of the sound-emitting objects, the positions of the sound-emitting objects, the facial information of the sound-emitting objects, the sex and the age of the sound-emitting objects.

6. The method of claim 4, wherein the determining whether a sound object is present in the recording range of the microphone array using the environment signal comprises one of:

7. The method according to claim 5, wherein the number of the sound-emitting objects and the positions of the sound-emitting objects are set to be obtained according to at least one of the following modes:

8. The method according to claim 5, wherein the face information of the sound-emitting subject, the sex and age of the sound-emitting subject are set to be identified from a video signal.

9. The method of claim 1, wherein the performing voice front-end processing on the recorded voice signal according to the environment information comprises:

10. The method of claim 2, wherein after the determining environmental information of the microphone array, the method further comprises:

11. The method of claim 4, wherein the performing voice front-end processing on the recorded voice signal according to the environment information comprises:

and based on the information of the sound-producing object, performing beam forming on the direction of the sound-producing object so as to enhance the voice signal in the direction of the sound-producing object.

12. A speech processing apparatus comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the steps of the method of any of claims 1 to 11.

13. A speech processing device, comprising: microphone array, at least one sensing device, processor, wherein:

the microphone array is used for recording voice signals;

14. The apparatus of claim 13, wherein the processor obtaining a plurality of ambient signals for a microphone array recording speech signals comprises:

15. The apparatus of claim 13 or 14, wherein the environmental signal comprises at least one of: video signals, infrared signals, gravity sensing signals, ultrasonic signals.

16. The apparatus of claim 15, wherein the processor processes the ambient signal and wherein determining the environmental information of the microphone array comprises:

17. The device of claim 16, wherein the information of the sound emitting object comprises at least one of: the number of the sound-emitting objects, the sound-emitting state of the sound-emitting objects, the positions of the sound-emitting objects, the facial information of the sound-emitting objects, the sex and the age of the sound-emitting objects.

18. The apparatus of claim 16, wherein the processor using the ambient signal to determine whether a sound object is present within the recording range of the microphone array comprises one of:

19. The device according to claim 17, wherein the number of the sound-emitting objects and the positions of the sound-emitting objects are set to be obtained according to at least one of the following modes:

20. The apparatus according to claim 17, wherein the face information of the sound-emitting subject, the sex and age of the sound-emitting subject are set to be identified from a video signal.

21. The apparatus of claim 13, wherein the processor performs voice front-end processing on the recorded voice signal according to the environment information comprises:

22. The device of claim 14, wherein after the processor determines environmental information of the microphone array, the processor further comprises:

23. The apparatus of claim 16, wherein the processor performs voice front-end processing on the recorded voice signal according to the environment information comprises:

24. A business service device comprising a voice processing module, the voice processing module coupled to a host of the business service device, the voice processing module configured to:

25. A computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 11.