CN113744732A

CN113744732A - Equipment wake-up related method and device and story machine

Info

Publication number: CN113744732A
Application number: CN202010481877.2A
Authority: CN
Inventors: 刘章; 田彪; 李昀; 王子腾; 纳跃跃
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2021-12-03

Abstract

The application discloses a device wake-up related system, method, device and device. The device awakening method comprises the following steps: determining the probability of awakening words of the last voice frame; adjusting a filter coefficient according to the awakening word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; and identifying the awakening words according to the current voice frame after the voice noise is suppressed so as to adjust the equipment to an awakening state. By adopting the processing mode, the beam forming filter coefficient is updated by combining the awakening feedback of the previous voice frame, so that the target voice and the voice noise can be distinguished, and the reliable noise reduction effect is obtained; therefore, the awakening performance of the equipment under the high noisy human voice interference can be effectively improved.

Description

Equipment wake-up related method and device and story machine

Technical Field

The application relates to the technical field of automation control, in particular to a device awakening system, a device awakening method and a device, a voice conference summary system, a method and a device, a service starting system, a method and a device, a story machine, an intelligent sound box and electronic equipment.

Background

With the progress of voice recognition technology in recent years, intelligent story machine with voice wake-up function has been widely used. Because the influence of foam noise and human voice interference cannot be effectively overcome by the existing acoustic model technology, the voice awakening effect is sharply reduced under the high noisy human voice interference.

Based on the microphone array signal processing technology, the signal-to-noise ratio and the performance of a voice system can be greatly improved. The beam forming is a common array signal processing algorithm, has the advantages of small calculated amount, easy deployment and the like, and is suitable for story tellers with limited hardware performance. Currently, the beamforming technology mainly requires Voice Activity Detection (VAD) to distinguish noise from a target sound source to provide information input.

However, in the process of implementing the invention, the inventor finds that the technical scheme has at least the following problems: conventional VADs fail under human noise, thereby causing a severe degradation of the wake-up performance of the story machine under the conditions of human interference noise. In summary, how to improve the beam forming scheme to distinguish the human noise from the target sound source, so as to improve the wake-up performance of the story machine under the highly noisy human interference, is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The application provides a device awakening method, which aims to solve the problem that the awakening performance is low under the high noisy human voice interference in the prior art. The application additionally provides a device awakening system and device, a voice conference summary system, method and device, a service starting system, method and device, a story machine, an intelligent sound box and electronic equipment.

The application provides a device awakening method, which comprises the following steps:

determining the probability of awakening words of the last voice frame;

adjusting a filter coefficient according to the awakening word probability and the previous voice frame;

performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame;

and identifying the awakening words according to the current voice frame after the voice noise is suppressed so as to adjust the equipment to an awakening state.

Optionally, the determining the probability of the wakeup word of the previous speech frame includes:

performing voice enhancement processing on the previous voice frame according to the filter coefficient before adjustment;

determining the acoustic probability of the acoustic unit related to the awakening word in the last voice frame after voice enhancement;

and determining the probability of the awakening word according to the acoustic probability of the acoustic unit related to the awakening word.

Optionally, the determining the probability of the wake-up word according to the acoustic probability of the acoustic unit related to the wake-up word includes:

and taking the maximum acoustic probability as the awakening word probability.

Optionally, the method further includes:

sequentially storing each voice frame to a buffer queue according to the acquisition time of the voice frame;

and reading the last voice frame from the buffer queue according to the processing time length for determining the acoustic probability.

Optionally, the adjusting the filter coefficient according to the probability of the wakeup word and the previous speech frame includes:

taking the probability of the awakening word as the weight of the last voice frame, and determining a target covariance matrix and a noise covariance matrix;

and determining the adjusted filter coefficient according to the target covariance matrix and the noise covariance matrix by a beam forming algorithm.

Optionally, the previous speech frame includes: a speech frame adjacent to the current speech frame, or a speech frame not adjacent to the current speech frame.

Optionally, the voice frame includes a voice frame collected by a plurality of microphones.

The present application further provides an apparatus wake-up device, including:

the awakening word probability determining unit is used for determining the awakening word probability of the previous voice frame;

the filter coefficient adjusting unit is used for adjusting the filter coefficient according to the awakening word probability and the previous voice frame;

the voice noise suppression unit is used for executing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to suppress voice noise except the target sound source in the current voice frame;

and the awakening unit is used for identifying awakening words according to the voice frame after the voice noise is suppressed so as to adjust the equipment to an awakening state.

Optionally, the wakeup word probability determining unit includes:

the voice enhancement unit is used for executing voice enhancement processing on the previous voice frame according to the filter coefficient before adjustment;

the acoustic probability determining subunit is used for determining the acoustic probability of the acoustic unit related to the awakening word in the last voice frame after voice enhancement;

and the awakening word probability determining subunit is used for determining the awakening word probability according to the acoustic probability of the acoustic unit related to the awakening word.

Optionally, the filter coefficient adjusting unit includes:

the weighting subunit is used for determining a target covariance matrix and a noise covariance matrix according to the weighted previous voice frame by taking the awakening word probability as the weight of the previous voice frame;

and the filter coefficient determining subunit is used for determining the adjusted filter coefficient according to the target covariance matrix and the noise covariance matrix through a beam forming algorithm.

The present application further provides a story machine, including:

a processor; and

a memory for storing a program for implementing a method for waking up a device, the device being powered on and running the program of the method via the processor for performing the steps of: determining the probability of awakening words of the last voice frame; adjusting a filter coefficient according to the awakening word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; and identifying the awakening words according to the voice frame after the voice noise is suppressed so as to adjust the story machine to an awakening state.

The present application further provides an electronic device, comprising:

a processor; and

a memory for storing a program for implementing a method for waking up a device, the device being powered on and running the program of the method via the processor for performing the steps of: determining the probability of awakening words of the last voice frame; adjusting a filter coefficient according to the awakening word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; and identifying the awakening words according to the voice frame after the voice noise is suppressed so as to adjust the equipment to an awakening state.

Optionally, the apparatus includes: intelligent audio amplifier, intelligent TV.

The application also provides a voice conference summary method, comprising:

determining the probability of a recording service starting word of the last voice frame;

adjusting a filter coefficient according to the starting word probability and the previous voice frame;

and recognizing the verb starter according to the voice frame after the voice noise is suppressed so as to start the recording service and record the voice conference summary.

Optionally, the method further includes:

and filtering sound sources with speaking time length larger than a time length threshold value.

The present application further provides a speech conference summary device, including:

the voice recording service starting word probability determining unit is used for determining the voice recording service starting word probability of the previous voice frame;

the filter coefficient adjusting unit is used for adjusting the filter coefficient according to the starting word probability and the previous voice frame;

and the recording unit is used for identifying the verb starter according to the voice frame after the voice noise is suppressed so as to start the recording service and record the voice conference summary.

The application further provides an intelligent sound box, include:

a processor; and

a memory for storing a program for implementing a method for waking up a device, the device being powered on and running the program of the method via the processor for performing the steps of: determining the probability of a recording service starting word of the last voice frame; adjusting a filter coefficient according to the starting word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; and recognizing the verb starter according to the voice frame after the voice noise is suppressed so as to start the recording service and record the voice conference summary.

The present application further provides an electronic device, comprising:

a processor; and

The application also provides a service starting method, which comprises the following steps:

determining the probability of a target service starting word of the last voice frame;

and recognizing the verb starter according to the voice frame after the voice noise is suppressed so as to start the target service.

The present application further provides a service initiation apparatus, comprising:

a service start word probability determining unit, configured to determine a target service start word probability of a previous speech frame;

and the service starting unit is used for identifying the verb starter according to the voice frame after the voice noise is suppressed so as to start the target service.

The application further provides an intelligent sound box, include:

a processor; and

a memory for storing a program for implementing the service initiation method, the device being powered on and the program for implementing the method being executed by the processor to perform the steps of: determining the probability of a target service starting word of the last voice frame; adjusting a filter coefficient according to the starting word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; and recognizing the verb starter according to the voice frame after the voice noise is suppressed so as to start the target service.

The present application further provides an electronic device, comprising:

a processor; and

The present application further provides a device wake-up system, including:

the terminal equipment is used for receiving the awakening word probability of the last voice frame collected by the terminal equipment and sent by the server side, and adjusting the filter coefficient according to the last voice frame and the awakening word probability; performing voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filtering coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; if the server identifies the awakening words according to the current voice frame after the voice noise is suppressed, the terminal equipment is adjusted to an awakening state;

the server is used for determining the probability of the awakening word of the previous voice frame; and identifying the awakening words according to the current voice frame after the voice noise is suppressed.

The present application further provides a device wake-up method, including:

receiving the awakening word probability of the last voice frame collected by the terminal equipment and sent by the server;

adjusting a filter coefficient according to the previous voice frame and the probability of the awakening word;

performing voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filtering coefficient so as to inhibit the voice noise except the target sound source in the current voice frame;

and if the server identifies the awakening words according to the current voice frame after the voice noise is suppressed, the terminal equipment is adjusted to an awakening state.

The present application further provides a device wake-up method, including:

determining the awakening word probability of the last voice frame collected by the terminal equipment, and returning the awakening word probability to the terminal equipment, so that the terminal equipment adjusts the filter coefficient according to the last voice frame and the awakening word probability, and performs voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filter coefficient so as to suppress the voice noise except the target sound source in the current voice frame;

identifying awakening words according to the current voice frame which is sent by the terminal equipment and inhibits the voice noise;

and informing the terminal equipment to identify the awakening word so that the terminal equipment can be adjusted to be in an awakening state.

The present application further provides a voice conference summary system, including:

the terminal equipment is used for receiving the recording service starting word probability of the last voice frame collected by the terminal equipment and sent by the server side, and adjusting the filter coefficient according to the last voice frame and the starting word probability; performing voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filtering coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; if the server identifies the verb starter according to the current voice frame after inhibiting the voice noise, starting the recording service to record the voice conference summary;

the server is used for receiving the last voice frame sent by the terminal equipment and determining the starting word probability of the last voice frame; and receiving the current voice frame after the voice noise is suppressed, which is sent by the terminal equipment, and identifying the verb starter according to the current voice frame after the voice noise is suppressed.

The application also provides a voice conference summary method, comprising:

receiving the probability of a recording service starting word of a last voice frame acquired by terminal equipment and sent by a server;

adjusting a filter coefficient according to the previous voice frame and the starting word probability;

and if the server identifies the verb starter according to the current voice frame after the voice noise is suppressed, starting the recording service to record the voice conference summary.

The application also provides a voice conference summary method, comprising:

determining the probability of starting words of the recording service of the previous voice frame according to the previous voice frame acquired by the terminal equipment, and returning the probability of starting words to the terminal equipment, so that the terminal equipment adjusts the filter coefficient according to the previous voice frame and the probability of starting words, and performs voice enhancement processing on the current voice frame acquired by the terminal equipment according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame;

recognizing the verb starter according to a current voice frame which is sent by the terminal equipment and is used for suppressing the voice noise;

and informing the terminal equipment of recognizing the start verb so as to start the recording service of the terminal equipment to record the voice conference summary.

The present application further provides a service initiation system, including:

the terminal equipment is used for receiving the target service starting word probability of the last voice frame collected by the terminal equipment and sent by the server side, and adjusting the filter coefficient according to the last voice frame and the starting word probability; performing voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filtering coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; if the server identifies the verb starter according to the current voice frame after inhibiting the voice noise, starting the target service;

receiving the probability of a target service starting word of a last voice frame acquired by terminal equipment and sent by a server;

and if the server identifies the verb starter according to the current voice frame after the voice noise is suppressed, starting the target service.

determining the target service starting word probability of the last voice frame according to the last voice frame collected by the terminal equipment, and returning the starting word probability to the terminal equipment, so that the terminal equipment adjusts the filter coefficient according to the last voice frame and the starting word probability, and performs voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame;

and informing the terminal equipment of recognizing the start verb so as to start the target service of the terminal equipment.

The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.

Compared with the prior art, the method has the following advantages:

according to the equipment awakening method provided by the embodiment of the application, the awakening word probability of the last voice frame is determined; adjusting a filter coefficient according to the awakening word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; identifying a wake-up word according to the current voice frame after the voice noise is suppressed so as to adjust the equipment to a wake-up state; by the processing mode, the beam forming filter coefficient is updated by combining the awakening feedback of the previous voice frame, so that the target voice and the voice noise can be distinguished, and a reliable noise reduction effect is obtained; therefore, the awakening performance of the equipment under the high noisy human voice interference can be effectively improved.

The voice conference summary method provided by the embodiment of the application starts word probability by determining the recording service of the previous voice frame; adjusting a filter coefficient according to the starting word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; recognizing the verb starter according to the voice frame after the voice noise is suppressed so as to start a recording service and record a voice conference summary; the processing mode enables the recording service starting feedback of the previous voice frame to be combined, and updates the wave beam forming filter coefficient, so that the target voice and the voice noise can be distinguished, and a reliable recording service starting effect is obtained; therefore, the starting performance of the recording service of the equipment under the high noisy human voice interference can be effectively improved.

The service starting method provided by the embodiment of the application determines the probability of the target service starting word of the last voice frame; adjusting a filter coefficient according to the starting word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; recognizing the verb starter according to the voice frame after the voice noise is suppressed so as to start the target service; the processing mode updates the wave beam forming filter coefficient by combining the service starting feedback of the last voice frame, so that the target voice and the voice noise can be distinguished, and a reliable noise reduction effect is obtained; therefore, the service starting performance of the equipment under the high noisy human voice interference can be effectively improved.

Drawings

Fig. 1 is a flowchart of an embodiment of a device wake-up method provided in the present application;

FIG. 2 is a signal model diagram of an embodiment of a device wake-up method provided by the present application;

fig. 3 is a detailed flowchart of an embodiment of a device wake-up method provided in the present application;

fig. 4 is a schematic diagram illustrating determination of a probability of a wakeup word according to an embodiment of a device wakeup method provided in the present application;

fig. 5 is a schematic diagram illustrating beam forming according to an embodiment of a device wake-up method provided in the present application;

fig. 6 is a detailed flowchart of an embodiment of a device wake-up method provided in the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

In the application, a device wake-up system, a method and a device, a voice conference summary system, a method and a device, a service starting system, a method and a device, a story machine, an intelligent sound box and an electronic device are provided. Each of the schemes is described in detail in the following examples.

First embodiment

Please refer to fig. 1, which is a flowchart illustrating an embodiment of a device wake-up method according to the present application. The execution subject of the method includes, but is not limited to, an intelligent story machine, and may also be other devices with a wake-up function, such as an intelligent sound box, an intelligent repeater, and the like. In this embodiment, the method may include the steps of:

step S101: and determining the probability of the awakening word of the last voice frame.

The story machine has a voice awakening function, and the story machine can be awakened directly through voice without pressing keys of a user. The story machine can collect voice data through a plurality of microphones (microphone arrays), one section of voice data can comprise a plurality of voice frames, one voice frame can be a section of voice data of 10ms, and the story machine can process each voice frame in sequence.

For convenience of description, a frame of speech being subjected to noise reduction processing is referred to as a current speech frame, a frame of speech at a time previous to the current speech frame may be referred to as a previous speech frame, and a frame of speech at a time subsequent to the current speech frame may be referred to as a next speech frame. The last speech frame may be a speech frame adjacent to the current speech frame, or may be a speech frame which is not adjacent to the current speech frame but is spaced by n frames (e.g., 1 frame or 2 frames). Under the condition that the last voice frame is not adjacent to the current voice frame, the filtering coefficient of each voice frame does not need to be adjusted, so that the computing resources can be effectively saved, and the awakening efficiency is improved.

The probability of the wake-up word may be a probability that an acoustic unit related to the wake-up word is included in a frame of speech. A wake-up word (e.g., a tianmao elfin story machine) typically involves multiple speech frames, each of which may include acoustic elements associated with the wake-up word, may include acoustic elements not associated with the wake-up word, and may include both acoustic elements associated with the wake-up word and acoustic elements not associated with the wake-up word.

According to the method provided by the embodiment of the application, when the noise reduction processing is performed on the current voice frame, the filter coefficient is related to the probability of the awakening word of the previous voice frame, that is, the filter coefficient of the current voice frame is determined by combining the probability of the awakening word of the previous voice frame.

Please refer to fig. 2, which is a signal diagram of an embodiment of an apparatus wake-up method according to the present application. In this embodiment, the noise reduction processing of the story machine involves the following modules: a beam forming (PMWF) module, a filter module, a wake-up VAD module, and a buffer queue. As shown in fig. 2, the original multi-microphone signal (mic in) collected by the story machine can be sent to the filter module and the buffer queue at the same time after entering the system. Wherein, the filter converts the micin signal into an enhanced signal (enhanced speed) through filtering operation, and then sends the enhanced signal into the awakener. The wake-up device outputs a corresponding wake-up signal to an external module (e.g., an application-level module), and simultaneously outputs a corresponding acoustic signal (registers) to the wake-up VAD module, and then the PMWF module updates the filter in combination with the wake-up VAD and a corresponding original audio signal output by the buffer queue, as shown by a dotted line in the figure, where a feedback loop is formed.

Please refer to fig. 3, which is a flowchart illustrating an embodiment of a device wake-up method according to the present application. In one example, step S101 may include the following sub-steps:

step S1011: and executing voice enhancement processing on the last voice frame according to the filter coefficient before adjustment.

According to the method provided by the embodiment of the application, when the noise reduction processing is performed on the current voice frame, the filter coefficient is related to the probability of the awakening word of the previous voice frame, so that the filter coefficients of different voice frames are possibly different. The filter coefficient of the last speech frame is called the filter coefficient before adjustment, and the filter coefficient of the current speech frame is called the filter coefficient after adjustment. The filter coefficient before adjustment may be determined according to the probability of the wakeup word of the previous speech frame.

Speech enhancement is a technique for extracting a useful speech signal from a noise background, and suppressing and reducing noise interference, when the speech signal is interfered or even submerged by various noises.

In particular, the speech enhancement process may be performed on the last speech frame by a filter. The filter converts the signal of the last speech frame into an enhanced signal through a filtering operation. Since the filter belongs to the mature prior art, it is not described here in detail.

Step S1013: and determining the acoustic probability of the acoustic unit related to the awakening word in the last voice frame after voice enhancement.

After the filter converts the signal of the last voice frame into an enhanced signal through filtering operation, the enhanced signal can be sent to the awakening device, and the awakening device determines the acoustic probability of the acoustic unit related to the awakening word in the last voice frame after voice enhancement.

By adopting the method provided by the embodiment of the application, after the awakening device identifies the awakening signal, the awakening device not only can output the corresponding awakening signal to the external module, but also can simultaneously output the corresponding acoustic signal to the awakening VAD module. The frame of speech may include a wakeup word related acoustic unit and may also include a wakeup word unrelated acoustic unit, for example, if the wakeup word is "hello tv", the modeling unit may have ni, hao, dian, shi, wo, qu, na, etc., then four of ni, hao, dian, shi are related units, and three of wo, qu, na are unrelated units. In this embodiment, the wake-up device may output the acoustic probability of each acoustic unit to the wake-up VAD module.

In this embodiment, a feedforward neural network (FSMN) based wake-up is employed as shown in fig. 4. And the audio signals are sent into an FSMN-based awakening model through feature extraction, and the probability of each acoustic unit is given. Here, the acoustic unit may be divided into a wakeup word independent unit and a wakeup word dependent unit.

The FSMN is a time sequence modeling neural network architecture, and can effectively utilize historical and future input information to determine the acoustic probability of each acoustic unit in a frame of voice; the processing mode has higher recognition accuracy of the awakening words. In specific implementation, other structures of the wake-up device can be adopted. Since the wake-up device belongs to the mature prior art, it is not described herein again.

Step S1015: and determining the probability of the awakening word according to the acoustic probability of the acoustic unit related to the awakening word.

After the awakening device outputs the acoustic probability of each acoustic unit to the awakening VAD module, the awakening VAD module can determine the awakening word probability according to the acoustic probability of the acoustic unit related to the awakening word.

In this embodiment, the maximum value of the acoustic probabilities in the acoustic units related to all the wake-up words is used as the wake-up word probability. The wake-up VAD module converts the score of the input acoustic model into the probability of wake-up word P according to the following formula_t(W_i) Denotes the ith wake word phase at time tProbability output of the relevant unit:

W_ielement related to e-wakeup word

In a specific implementation, the wake word probability may also be determined in other manners, such as averaging the acoustic probabilities of the acoustic units associated with all wake words.

After the probability of the awakening word of the previous voice frame is determined, the next step can be carried out, and the filter coefficient is adjusted by combining the awakening feedback.

Step S103: and adjusting the filter coefficient according to the awakening word probability and the last voice frame.

As shown in fig. 5, in the method provided in the embodiment of the present application, a beam forming module determines a filter coefficient of a current data frame according to a probability of a wakeup word of a previous speech frame (multi-microphone signal) and the multi-microphone signal. That is, the filter coefficients of the current data frame are related to the wake-up word information of the last speech frame. However, the conventional beam forming module performs voice activity detection on the current data frame through the conventional VAD module, and determines a filter coefficient of the current data frame according to a voice activity detection result and a multi-microphone signal of the current data frame, where the coefficient is not related to the previous data frame, and even not related to the wakeup word information of the previous data frame, and there is no concept of the wakeup word probability.

Please refer to fig. 6, which is a flowchart illustrating an embodiment of a device wake-up method according to the present application. In this embodiment, step S103 may include the following sub-steps:

step S1031: and determining a target covariance matrix and a noise covariance matrix by taking the awakening word probability as the weight of the last voice frame.

Step S1033: and determining the adjusted filter coefficient according to the target covariance matrix and the noise covariance matrix by a beam forming algorithm.

By adopting the processing method shown in fig. 6, when the target covariance matrix and the noise covariance matrix are calculated, the probability of the corresponding awakening word is weighted for the previous speech frame, and the beam coefficient is obtained through the PMWF formula, so that the output energy of the target sound source is enhanced.

In specific implementation, the following pmwf formula can be used for filter coefficient estimation:

wherein phi_s＝E{p²xx^H},Φ_n＝E{(1-p)²xx^H}; p represents the wake word probability; x is a voice input; e { } for expectation, can be realized by an average value; tr { } is the trace of the matrix, namely the sum of matrix diagonals; beta is a parameter that controls speech distortion.

In one example, the method may further comprise the steps of: according to the collection time of the voice frames, sequentially storing each voice frame into a buffer queue shown in figure 2; and reading the last voice frame from the buffer queue according to the processing time length for determining the acoustic probability. By adopting the processing mode, the multi-microphone signals are aligned with the awakening word probability output by the awakening VAD module in time through the first-in first-out queue shown in the figure 2 and then are input into the beam forming module together, and the awakening word probability is calculated immediately after the output of the acoustic unit, so that the awakening word probability output delay is the same as the awakening word model delay, the delay can be controlled by the FSMN model retainer in the figure 4, and the delay can be relatively low.

Step S105: and executing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame.

Because the adjusted filter coefficient can enhance the output energy of the target sound source, the voice enhancement processing is carried out on the current voice frame according to the filter coefficient, and the voice noise except the target sound source in the current voice frame can be effectively inhibited.

Step S107: and identifying the awakening words according to the voice frame after the voice noise is suppressed so as to adjust the story machine to an awakening state.

After the voice noise except the target sound source in the current voice frame is restrained, the awakening word can be identified through the awakening device according to the voice frame with the voice noise restrained, and the story machine can be adjusted to the awakening state after the awakening word is identified.

As can be seen from the foregoing embodiments, the device wake-up method provided in the embodiments of the present application determines the probability of a wake-up word of a previous speech frame; adjusting a filter coefficient according to the awakening word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; identifying a wake-up word according to the voice frame after the voice noise is suppressed so as to adjust the equipment to a wake-up state; by the processing mode, the beam forming filter coefficient is updated by combining the awakening feedback of the previous voice frame, so that the target voice and the voice noise can be distinguished, and a reliable noise reduction effect is obtained; therefore, the awakening performance of the equipment under the high noisy human voice interference can be effectively improved.

Second embodiment

In the foregoing embodiment, an apparatus wake-up method is provided, and correspondingly, an apparatus wake-up apparatus is also provided in the present application. The apparatus corresponds to an embodiment of the method described above.

Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment. The application provides a device awakening device includes:

In one example, the wake word probability determination unit includes:

In one example, the filter coefficient adjustment unit includes:

Third embodiment

The application also provides a story machine. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

A story machine of this embodiment, this story machine includes: a processor and a memory; a memory for storing a program for implementing a method for waking up a device, the device being powered on and running the program of the method via the processor for performing the steps of: determining the probability of awakening words of the last voice frame; adjusting a filter coefficient according to the awakening word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; and identifying the awakening words according to the voice frame after the voice noise is suppressed so as to adjust the story machine to an awakening state.

Fourth embodiment

The application also provides an electronic device embodiment. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing a method for waking up a device, the device being powered on and running the program of the method via the processor for performing the steps of: determining the probability of awakening words of the last voice frame; adjusting a filter coefficient according to the awakening word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; and identifying the awakening words according to the voice frame after the voice noise is suppressed so as to adjust the equipment to an awakening state.

Fifth embodiment

In the foregoing embodiment, a device wake-up method is provided, and correspondingly, the present application further provides a device wake-up system. The system corresponds to the embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a device awakens system includes: a server and a terminal device.

The terminal equipment has a wake-up function, such as an intelligent story machine, an intelligent sound box, an intelligent repeater and the like.

The terminal equipment is used for receiving the awakening word probability of the last voice frame collected by the terminal equipment and sent by the server side, and adjusting the filter coefficient according to the last voice frame and the awakening word probability; performing voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filtering coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; if the server identifies the awakening words according to the current voice frame after the voice noise is suppressed, adjusting the terminal equipment to an awakening state; correspondingly, the server is used for receiving the last voice frame sent by the terminal equipment and determining the probability of awakening words of the last voice frame; and receiving the current voice frame sent by the terminal equipment after the voice noise is suppressed, and identifying the awakening word according to the current voice frame after the voice noise is suppressed.

In this embodiment, the terminal device may collect voice data through a plurality of microphones (microphone array), and the terminal device may perform noise reduction processing on each voice frame through a filter, and send the processed voice frame to the server; the server side identifies the awakening words through the awakening device, outputs corresponding awakening signals to the external module, simultaneously outputs corresponding acoustic signals to the awakening VAD module, determines the awakening word probability of the voice frame through the awakening VAD module, and sends the probability back to the terminal equipment; and the terminal equipment determines the filter coefficient of the next voice frame according to the awakening word probability and the voice frame, and performs noise reduction processing on the collected next voice frame according to the adjusted filter coefficient.

It can be seen that the difference between the system provided by the present embodiment and the method provided by the first embodiment includes: in the system provided by the embodiment, the server performs the identification of the awakening words and the determination of the awakening word probability, so that the performance requirement on the terminal equipment can be further reduced, and the awakener updated by the server in real time can be used; therefore, the hardware cost of the terminal equipment can be effectively reduced, the identification accuracy of the awakening words is effectively improved, and the awakening performance of the equipment is effectively improved.

As can be seen from the foregoing embodiments, in the device wake-up system provided in the embodiments of the present application, a voice frame is sent to a server through a terminal device, and the server determines a probability of a wake-up word of a previous voice frame; the terminal equipment adjusts a filter coefficient according to the awakening word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filtering coefficient so as to suppress the voice noise except the target sound source in the current voice frame, and sending the current voice frame with the voice noise suppressed to the server; the server side identifies the awakening words according to the current voice frame after inhibiting the voice noise, and if the server side identifies the awakening words, the terminal equipment is adjusted to an awakening state; by the processing mode, the beam forming filter coefficient is updated by combining the awakening feedback of the previous voice frame, so that the target voice and the voice noise can be distinguished, and a reliable noise reduction effect is obtained; therefore, the awakening performance of the equipment under the high noisy human voice interference can be effectively improved.

Sixth embodiment

In the foregoing embodiments, a device wake-up system is provided, and correspondingly, the present application also provides a device wake-up method, where an execution subject of the method includes, but is not limited to, a terminal device. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the fifth embodiment are not described again, please refer to corresponding parts in the fifth embodiment.

The device awakening method provided by the application comprises the following steps:

step 1: receiving the awakening word probability of the last voice frame collected by the terminal equipment and sent by the server;

step 2: adjusting a filter coefficient according to the previous voice frame and the probability of the awakening word;

and step 3: performing voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filtering coefficient so as to inhibit the voice noise except the target sound source in the current voice frame;

and 4, step 4: and if the server identifies the awakening words according to the current voice frame after the voice noise is suppressed, the terminal equipment is adjusted to an awakening state.

Seventh embodiment

In the foregoing embodiments, a device wake-up system is provided, and correspondingly, the present application also provides a device wake-up method, where an execution subject of the method includes but is not limited to a server, and may also be other devices capable of implementing the method. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the fifth embodiment are not described again, please refer to corresponding parts in the fifth embodiment.

step 1: determining the awakening word probability of the last voice frame collected by the terminal equipment, and returning the awakening word probability to the terminal equipment, so that the terminal equipment adjusts the filter coefficient according to the last voice frame and the awakening word probability, and performs voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filter coefficient so as to suppress the voice noise except the target sound source in the current voice frame;

step 2: identifying awakening words according to the current voice frame which is sent by the terminal equipment and inhibits the voice noise;

and step 3: and informing the terminal equipment to identify the awakening word so that the terminal equipment can be adjusted to be in an awakening state.

Eighth embodiment

In the foregoing embodiment, an apparatus wake-up method is provided, and correspondingly, the present application also provides a voice conference summary method. The background art relating to this method will be explained first.

A typical recording scenario is that a conference room is holding a multi-user conference, and when a speaker in the conference speaks, a participant or a conference assistant can start a recording function of a terminal device in a voice manner at any time to record the voice content of the conference. The terminal equipment is usually loaded with a plurality of services, for example, a recording service, a song ordering service, an IOT equipment control service and the like are loaded in an intelligent sound box, each service has a corresponding start word, for example, the start verb of the recording service is 'Tianmaoling, starts recording', the start verb of the song ordering service is 'I want to order songs', and the like. The terminal equipment responds to the voice command of the user and starts the recording service in a mode that a verb start of the recording service is recognized from the voice of the user, and when the verb start of the service is recognized, the recording service is automatically started.

Under the interference of high noise human voice in a meeting place, in order to ensure accurate starting of a recording service, terminal equipment generally enhances voice (a target sound source) giving a recording service starting voice instruction through a microphone array signal processing technology, and suppresses environmental noise and human voice noise in the meeting place so as to accurately identify a recording service starting verb. In the prior art, a typical microphone array signal processing technique is a beam forming technique, which mainly uses Voice Activity Detection (VAD) method to distinguish noise from a target sound source to provide information input.

However, in the process of implementing the invention, the inventor finds that the technical scheme has at least the following problems: conventional VADs fail under the presence of voice noise, thereby causing a severe degradation of the recording service startup performance of the terminal device under the presence of voice interference noise. In summary, how to improve the beam forming scheme to improve the recording service starting performance of the terminal device under the highly noisy human voice interference becomes a technical problem that needs to be solved urgently by those skilled in the art.

In order to solve the problem, the application also provides a voice conference summary method. The execution subject of the method includes but is not limited to terminal devices, such as a smart sound box, a smart television and the like. The method corresponds to the embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a voice conference summary method, which comprises the following steps:

step 1: determining the probability of the voice recording service start word of the last voice frame.

The probability of the recording service start word may be a probability that an acoustic unit related to the recording service start word is included in one frame of voice. A recording service enabler (e.g., "tianmaoling," to initiate recording ") typically involves multiple speech frames, each of which may include acoustic elements associated with the recording service enabler, acoustic elements not associated with the recording service enabler, and acoustic elements associated with the recording service enabler and acoustic elements not associated with the recording service enabler.

In the method provided by the embodiment of the present application, when the noise reduction processing is performed on the current voice frame, the filter coefficient is related to the probability of the recording service start word of the previous voice frame, that is, the filter coefficient of the current voice frame is determined by combining the probability of the recording service start word of the previous voice frame.

In one example, step 1 may comprise the following sub-steps:

step 1-1: and executing voice enhancement processing on the last voice frame according to the filter coefficient before adjustment.

According to the method provided by the embodiment of the application, when the noise reduction processing is performed on the current voice frame, the filter coefficient is related to the probability of the starting word of the recording service of the previous voice frame, so that the filter coefficients of different voice frames may not be the same. The filter coefficient of the last speech frame is called the filter coefficient before adjustment, and the filter coefficient of the current speech frame is called the filter coefficient after adjustment. The filter coefficient before adjustment may be determined according to a probability of a recording service initiation word of a speech frame preceding the previous speech frame.

In particular, the speech enhancement process may be performed on the last speech frame by a filter. The filter converts the signal of the last speech frame into an enhanced signal through a filtering operation.

Step 1-3: and determining the acoustic probability of the acoustic unit related to the sound recording service starting word in the last voice frame after voice enhancement.

The filter converts the signal of the last voice frame into an enhanced signal through filtering operation, the enhanced signal can be sent to the service starter, and the service starter determines the acoustic probability of the acoustic unit related to the recording service starting word in the last voice frame after voice enhancement.

By adopting the method provided by the embodiment of the application, after the service starter identifies the recording service starting signal, the service starter can not only output the corresponding recording service starting signal to the external module, but also output the corresponding acoustic signal to the VAD module. The frame of speech may include a recording service activation word dependent acoustic unit and may also include a recording service activation word independent acoustic unit. In this embodiment, the service enabler may output the acoustic probabilities of the individual acoustic units to the service enabler VAD module.

In this embodiment, a feed forward neural network (FSMN) based service enabler is employed. The audio signals are sent to a service starting model based on FSMN through feature extraction, and the probability of each acoustic unit is given. Here, the acoustic unit may be divided into a recording service start word independent unit and a recording service start word dependent unit.

Step 1-5: and determining the probability of the sound recording service starting word according to the acoustic probability of the acoustic unit related to the sound recording service starting word.

After the service starter outputs the acoustic probability of each acoustic unit to the service start VAD module, the service start VAD module can determine the probability of the recording service start word according to the acoustic probability of the acoustic unit related to the recording service start word.

In this embodiment, the maximum value of the acoustic probabilities in the acoustic units related to all the recording service start words is used as the recording service start word probability.

After the probability of the recording service starting word of the previous voice frame is determined, the next step can be carried out, and the filtering coefficient is adjusted by combining the recording service starting feedback.

Step 2: and adjusting the filter coefficient according to the starting word probability and the last voice frame.

In the method provided by the embodiment of the application, the beam forming module is used for determining the filter coefficient of the current data frame according to the recording service starting word probability of the last voice frame (multi-microphone signal) and the multi-microphone signal. That is, the filter coefficients of the current data frame are related to the recording service initiation word information of the last speech frame. However, the conventional beam forming module performs voice activity detection on the current data frame through the conventional VAD module, and determines a filter coefficient of the current data frame according to a voice activity detection result and a multi-microphone signal of the current data frame, where the coefficient is not related to the previous data frame, and even not related to the recording service start word information of the previous data frame, and there is no concept of the probability of the recording service start word.

In this embodiment, step 2 may include the following sub-steps:

step 2-1: and determining a target covariance matrix and a noise covariance matrix by taking the probability of the voice recording service starting word as the weight of the last voice frame.

Step 2-3: and determining the adjusted filter coefficient according to the target covariance matrix and the noise covariance matrix by a beam forming algorithm.

In one example, when the target covariance matrix and the noise covariance matrix are calculated, the probability weighting of the words is started by the corresponding recording service for the previous voice frame, and the beam coefficient is obtained through the PMWF formula, so as to enhance the output energy of the target sound source.

In one example, the method may further comprise the steps of: sequentially storing each voice frame into a buffer queue according to the acquisition time of the voice frame; and reading the last voice frame from the buffer queue according to the processing time length for determining the acoustic probability. By adopting the processing mode, the multi-microphone signals are aligned with the probability of the recording service starting word output by the service starting VAD module in time through the first-in first-out queue and then input into the beam forming module together, and the probability of the recording service starting word is calculated immediately after the acoustic unit outputs, so that the probability output delay of the recording service starting word is the same as the model delay of the recording service starting word, the delay can be controlled by the FSMN model, and the delay can be relatively low.

And step 3: and executing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame.

And 4, step 4: and recognizing the verb starter according to the voice frame after the voice noise is suppressed so as to start the recording service and record the voice conference summary.

After voice noise except a target sound source in a current voice frame is suppressed, a voice recording service verb-starting function can be identified through the service starter according to the voice frame after the voice noise is suppressed, and the voice recording service function of the terminal equipment can be started after the voice recording service verb-starting function is identified.

In one example, a conference speaker usually focuses on the speech, the speaker has a low possibility of issuing a voice command for starting the recording service, and a conference assistant or other conference participants usually issues a voice command for starting the recording service, so that the voice of the conference speaker can be filtered out as voice noise for starting the recording service; accordingly, the method may further comprise the steps of: and filtering sound sources with speaking time length larger than a time length threshold value. By adopting the processing mode, the speed and the accuracy of starting the recording service can be effectively improved.

The duration threshold may be determined according to actual requirements. In specific implementation, the speaking duration of each sound source can be determined, for example, the speaking duration of each voiceprint is recorded; then, the sound source with the speaking time length larger than the time length threshold value is used as the sound source of the voice noise, and the sound source is filtered.

As can be seen from the foregoing embodiments, the voice conference summary method provided in the embodiments of the present application starts the word probability by determining the recording service of the previous voice frame; adjusting a filter coefficient according to the starting word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; recognizing the verb starter according to the voice frame after the voice noise is suppressed so as to start a recording service and record a voice conference summary; the processing mode enables the recording service starting feedback of the previous voice frame to be combined, and updates the wave beam forming filter coefficient, so that the target voice and the voice noise can be distinguished, and a reliable recording service starting effect is obtained; therefore, the starting performance of the recording service of the equipment under the high noisy human voice interference can be effectively improved.

Ninth embodiment

In the foregoing embodiment, a method for summarizing a voice conference is provided, and correspondingly, a system for summarizing a voice conference is also provided. The system corresponds to the embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a speech conference summary system includes: a server and a terminal device.

The terminal equipment is equipment with a recording service function, such as an intelligent sound box.

The terminal equipment is used for receiving the recording service starting word probability of the last voice frame collected by the terminal equipment and sent by the server side, and adjusting the filter coefficient according to the last voice frame and the starting word probability; performing voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filtering coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; if the server identifies the verb starter according to the current voice frame after inhibiting the voice noise, starting the recording service to record the voice conference summary; correspondingly, the server is used for receiving the last voice frame sent by the terminal equipment and determining the starting word probability of the last voice frame; and receiving the current voice frame after the voice noise is suppressed, which is sent by the terminal equipment, and identifying the verb starter according to the current voice frame after the voice noise is suppressed.

In this embodiment, the terminal device may collect voice data through a plurality of microphones (microphone array), and the terminal device may perform noise reduction processing on each voice frame through a filter, and send the processed voice frame to the server; the server side identifies the voice recording service starting verb through the service starter, outputs a corresponding voice recording service starting signal to the external module, simultaneously outputs a corresponding acoustic signal to the service starting VAD module, determines the voice recording service starting word probability of the voice frame through the service starting VAD module, and sends the probability back to the terminal equipment; and the terminal equipment determines the filter coefficient of the next voice frame according to the voice recording service starting word probability and the voice frame, and performs noise reduction processing on the collected next voice frame according to the adjusted filter coefficient.

It can be seen that the system provided by this embodiment is different from the method provided by the eighth embodiment in that: in the system provided by the embodiment, the server performs the processing of recognizing the voice recording service start word and determining the voice recording service start word probability, so that the performance requirement on the terminal equipment can be further reduced, and the service starter updated by the server in real time can be used; therefore, the hardware cost of the terminal equipment can be effectively reduced, the recognition accuracy of the recording service starting words is effectively improved, and the starting performance of the recording service of the equipment is effectively improved.

As can be seen from the foregoing embodiments, in the voice conference summary system provided in the embodiment of the present application, the voice frame is sent to the server through the terminal device, and the server determines the probability of the recording service start word of the previous voice frame; the terminal equipment adjusts a filter coefficient according to the probability of starting words of the recording service and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filtering coefficient so as to suppress the voice noise except the target sound source in the current voice frame, and sending the current voice frame with the voice noise suppressed to the server; the server side identifies the voice recording service verb starting according to the current voice frame after voice noise is suppressed, and if the server side identifies the voice recording service verb starting, the terminal equipment starts the voice recording service to record the voice conference summary; the processing mode enables the recording service combining the previous voice frame to start feedback and update the wave beam forming filter coefficient, thus being capable of distinguishing target voice and voice noise and obtaining reliable noise reduction effect; therefore, the starting performance of the recording service of the equipment under the high noisy human voice interference can be effectively improved.

Tenth embodiment

In the foregoing embodiment, a voice conference summary system is provided, and correspondingly, the present application also provides a voice conference summary method, where an execution subject of the method includes but is not limited to a terminal device. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the ninth embodiment are not described again, please refer to corresponding parts in the ninth embodiment.

step 1: receiving the probability of a recording service starting word of a last voice frame acquired by terminal equipment and sent by a server;

step 2: adjusting a filter coefficient according to the previous voice frame and the starting word probability;

and 4, step 4: and if the server identifies the verb starter according to the current voice frame after the voice noise is suppressed, starting the recording service to record the voice conference summary.

Eleventh embodiment

In the foregoing embodiment, a voice conference summary system is provided, and correspondingly, the present application also provides a voice conference summary method, where an execution subject of the method includes but is not limited to a server. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the ninth embodiment are not described again, please refer to corresponding parts in the ninth embodiment.

step 1: determining the probability of starting words of the recording service of the previous voice frame according to the previous voice frame acquired by the terminal equipment, and returning the probability of starting words to the terminal equipment, so that the terminal equipment adjusts the filter coefficient according to the previous voice frame and the probability of starting words, and performs voice enhancement processing on the current voice frame acquired by the terminal equipment according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame;

step 2: recognizing the verb starter according to a current voice frame which is sent by the terminal equipment and is used for suppressing the voice noise;

and step 3: and informing the terminal equipment of recognizing the start verb so as to start the recording service of the terminal equipment to record the voice conference summary.

Twelfth embodiment

In the foregoing embodiment, a voice conference method is provided, and correspondingly, the present application also provides a service starting method, where an execution subject of the method includes but is not limited to a terminal device, such as a smart speaker. The method corresponds to the embodiment of the method described above. Parts of this embodiment that are the same as the eighth embodiment will not be described again, please refer to corresponding parts in the eighth embodiment.

The service starting method provided by the application comprises the following steps:

step 1: determining the probability of a target service starting word of the last voice frame;

step 2: adjusting a filter coefficient according to the starting word probability and the previous voice frame;

and step 3: performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame;

and 4, step 4: and recognizing the verb starter according to the voice frame after the voice noise is suppressed so as to start the target service.

As can be seen from the foregoing embodiments, the service initiation method provided in the embodiments of the present application determines the probability of the target service initiation word of the previous speech frame; adjusting a filter coefficient according to the starting word probability and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; recognizing the verb starter according to the voice frame after the voice noise is suppressed so as to start the target service; the processing mode updates the wave beam forming filter coefficient by combining the service starting feedback of the last voice frame, so that the target voice and the voice noise can be distinguished, and a reliable noise reduction effect is obtained; therefore, the service starting performance of the equipment under the high noisy human voice interference can be effectively improved.

Thirteenth embodiment

In the foregoing embodiment, a service initiation method is provided, and correspondingly, the application further provides a service initiation system. The system corresponds to the embodiment of the method described above. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The application provides a service start system includes: a server and a terminal device.

The terminal equipment is equipment with a wake-up function, such as a smart sound box.

The terminal equipment is used for receiving the target service starting word probability of the last voice frame collected by the terminal equipment and sent by the server side, and adjusting the filter coefficient according to the last voice frame and the starting word probability; performing voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filtering coefficient so as to inhibit the voice noise except the target sound source in the current voice frame; if the server identifies the verb starter according to the current voice frame after inhibiting the voice noise, starting the target service; correspondingly, the server is used for receiving the last voice frame sent by the terminal equipment and determining the starting word probability of the last voice frame; and receiving the current voice frame after the voice noise is suppressed, which is sent by the terminal equipment, and identifying the verb starter according to the current voice frame after the voice noise is suppressed.

In this embodiment, the terminal device may collect voice data through a plurality of microphones (microphone array), and the terminal device may perform noise reduction processing on each voice frame through a filter, and send the processed voice frame to the server; the server side identifies a target service start verb through the service starter, outputs a corresponding wake-up signal to the external module, simultaneously outputs a corresponding acoustic signal to the service start VAD module, determines the probability of a target service start word of the voice frame through the service start VAD module, and sends the probability back to the terminal equipment; and the terminal equipment determines the filter coefficient of the next voice frame according to the target service starting word probability and the voice frame, and performs noise reduction processing on the collected next voice frame according to the adjusted filter coefficient.

It can be seen that the difference between the system provided by this embodiment and the method provided by embodiment twelve includes: in the system provided by the embodiment, the server performs the processing of identifying the service starting words and determining the service starting verb probability, so that the performance requirement on the terminal equipment can be further reduced, and the service starter updated by the server in real time can be used; therefore, the hardware cost of the terminal equipment can be effectively reduced, the identification accuracy of the service start verb is effectively improved, and the service start performance of the equipment is effectively improved.

As can be seen from the foregoing embodiments, in the service initiation system provided in the embodiments of the present application, a voice frame is sent to a server through a terminal device, and the server determines a target service initiation word probability of a previous voice frame; the terminal equipment adjusts a filter coefficient according to the probability of the target service starting word and the previous voice frame; performing voice enhancement processing on the current voice frame according to the adjusted filtering coefficient so as to suppress the voice noise except the target sound source in the current voice frame, and sending the current voice frame with the voice noise suppressed to the server; the server identifies a target service starting verb according to the current voice frame after inhibiting the voice noise, and if the server identifies the target service starting verb, the target service is started; by the processing mode, the beam forming filter coefficient is updated by combining the awakening feedback of the previous voice frame, so that the target voice and the voice noise can be distinguished, and a reliable noise reduction effect is obtained; therefore, the service starting performance of the equipment under the high noisy human voice interference can be effectively improved.

Fourteenth embodiment

In the foregoing embodiment, a service initiation system is provided, and correspondingly, the present application also provides a service initiation method, where an execution subject of the method includes, but is not limited to, a terminal device. The method corresponds to the embodiment of the system described above. Parts of this embodiment that are the same as the parts of the thirteenth embodiment will not be described again, please refer to the corresponding parts in the thirteenth embodiment.

step 1: receiving the probability of a target service starting word of a last voice frame acquired by terminal equipment and sent by a server;

and 4, step 4: and if the server identifies the verb starter according to the current voice frame after the voice noise is suppressed, starting the target service.

Fifteenth embodiment

step 1: determining the target service starting word probability of the last voice frame according to the last voice frame collected by the terminal equipment, and returning the starting word probability to the terminal equipment, so that the terminal equipment adjusts the filter coefficient according to the last voice frame and the starting word probability, and performs voice enhancement processing on the current voice frame collected by the terminal equipment according to the adjusted filter coefficient so as to inhibit the voice noise except the target sound source in the current voice frame;

and step 3: and informing the terminal equipment of recognizing the start verb so as to start the target service of the terminal equipment.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A device wake-up method, comprising:

determining the probability of awakening words of the last voice frame;

2. The method of claim 1, wherein the determining the probability of the wake word for the last speech frame comprises:

3. The method of claim 2, wherein determining the wake word probability according to the acoustic probability of the acoustic unit associated with the wake word comprises:

and taking the maximum acoustic probability as the awakening word probability.

4. The method of claim 2, further comprising:

5. The method of claim 1, wherein adjusting the filter coefficient according to the probability of the wake-up word and the previous speech frame comprises:

6. The method of claim 1,

the last speech frame comprises: a speech frame adjacent to the current speech frame, or a speech frame not adjacent to the current speech frame.

7. The method of claim 1,

the speech frames include speech frames acquired by a plurality of microphones.

8. An apparatus wake-up device, comprising:

9. The apparatus of claim 8, wherein the wakeup word probability determination unit comprises:

10. The apparatus of claim 8, wherein the filter coefficient adjusting unit comprises:

11. A story machine, comprising:

a processor; and

12. An electronic device, comprising:

a processor; and

13. A voice conference summary method, comprising:

14. The method of claim 13, further comprising:

15. A voice conference summary apparatus, comprising:

16. An intelligent sound box, comprising:

a processor; and

17. An electronic device, comprising:

a processor; and

18. A service initiation method, comprising:

19. A service initiation apparatus, comprising:

20. An intelligent sound box, comprising:

a processor; and

21. An electronic device, comprising:

a processor; and

22. A device wake-up system, comprising:

23. A device wake-up method, comprising:

24. A device wake-up method, comprising:

25. A voice conference summary system, comprising:

26. A voice conference summary method, comprising:

27. A voice conference summary method, comprising:

28. A service initiation system, comprising:

29. A service initiation method, comprising:

30. A service initiation method, comprising: