CN114758651A

CN114758651A - Voice signal processing method and device

Info

Publication number: CN114758651A
Application number: CN202011589178.6A
Authority: CN
Inventors: 黄龙; 王翃宇; 李勇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2022-07-15

Abstract

The application provides a voice signal processing method and a voice signal processing device, which relate to the technical field of voice recognition, and the method comprises the following steps: according to first text information obtained by converting a plurality of received continuous voice signals, whether second text information matched with the first text information exists in the obtained text information is inquired; and stopping receiving the voice signal under the condition that second text information matched with the first text information exists. Based on the scheme, the time point of stopping receiving the voice signal can be quickly determined by a text information matching method. And with the increase of the voice input times of the user, the recorded text information is more and more, the success rate of subsequent text information query matching is higher and higher, the speed is higher and faster, and the speed for determining to stop receiving the voice signal can be increased, so that the integral response speed of the voice service is increased, and the user experience is improved.

Description

Voice signal processing method and device

Technical Field

The present application relates to the field of speech recognition, and in particular, to a method and an apparatus for processing a speech signal.

Background

In the field of speech recognition, after a speech signal input by a user is received, the speech signal needs to be converted into text information so as to facilitate subsequent other operations.

How to quickly determine when to stop receiving the voice signal in the process of receiving the voice signal input by the user is very important, because when the voice signal is quickly determined to stop receiving, the overall response speed of the voice service is improved, and the user experience is further improved.

Disclosure of Invention

The application provides a voice signal processing method and a voice signal processing device, which are used for improving the efficiency of voice signal processing so as to reduce the response delay of voice service.

In a first aspect, an embodiment of the present application provides a speech signal processing method, including: according to first text information obtained by converting a plurality of received continuous voice signals, whether second text information matched with the first text information exists in the obtained text information is inquired; and stopping receiving the voice signal under the condition that second text information matched with the first text information exists.

Based on the scheme, the time point of stopping receiving the voice signal can be quickly determined by the text information matching method. And with the increase of the voice input times of the user, the recorded text information is more and more, the success rate of subsequent text information query matching is higher and higher, the speed is higher and higher, and the speed for determining to stop receiving the voice signal can be increased, so that the overall response speed of the voice service is increased, and the user experience is improved.

In a possible implementation method, when second text information matched with the first text information does not exist or is not inquired yet, and voice activity is carried out on the plurality of voice signals to detect a voice endpoint, the voice signals are stopped from being received.

Based on the scheme, whether to stop receiving the voice signal is judged simultaneously through a text information query matching method and a voice activity detection method. When either method satisfies the condition for stopping receiving the voice signal, the voice signal is stopped from being received. This scheme can achieve a rapid determination of a point in time at which to stop receiving a voice signal, compared to a case where only a text information query matching method or only a voice activity detection method is used. And with the increase of the voice input times of the user, the recorded text information is more and more, the success rate of subsequent text information query matching is higher and higher, the speed is higher and faster, and the speed for determining to stop receiving the voice signal can be further increased, so that the overall response speed of the voice service is increased, the efficiency of executing the service by the equipment is increased, and the user experience is improved.

In a possible implementation method, the obtained pieces of text information include text information corresponding to a plurality of speech signals respectively meeting a first condition, where the first condition includes one or more of the following: the plurality of voice signals are successfully executed; the number of times the plurality of voice signals are successfully executed satisfies a predetermined condition.

In a possible implementation method, the obtained pieces of text information include preset pieces of text information.

In a second aspect, an embodiment of the present application provides a speech signal processing apparatus having functions of implementing the implementation methods of the first aspect. The function can be realized by hardware, and can also be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above.

In a third aspect, an embodiment of the present application provides a speech signal processing apparatus, including a processor and a memory; the memory is used for storing computer-executable instructions, and when the apparatus is operated, the processor executes the computer-executable instructions stored in the memory, so that the apparatus executes the implementation methods as described in the first aspect.

In a fourth aspect, an embodiment of the present application provides a speech signal processing apparatus, which includes means or units (means) for performing the steps of the implementation methods of the first aspect.

In a fifth aspect, an embodiment of the present application provides a speech signal processing apparatus, including a processor and an interface circuit, where the processor is configured to communicate with other apparatuses through the interface circuit, and execute the implementation methods of the first aspect. The number of the processors is one or more.

In a sixth aspect, an embodiment of the present application provides a speech signal processing apparatus, including a processor, connected to a memory, and configured to call a program stored in the memory to execute the implementation methods of the first aspect. The memory may be located within the device or external to the device. The number of the processors is one or more.

In a seventh aspect, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the instructions cause the implementation methods of the first aspect to be performed.

In an eighth aspect, the present application further provides a computer program product, where the computer program product includes a computer program, and when the computer program runs, the implementation methods of the first aspect are executed.

In a ninth aspect, an embodiment of the present application further provides a chip system, including: and a processor configured to perform the implementation methods of the first aspect.

Drawings

Fig. 1 is a schematic diagram of functional modules included in a voice service system;

fig. 2 is a schematic diagram of a speech signal processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a process of stopping receiving a voice signal;

FIG. 4 is a schematic diagram of a speech signal processing method according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a process of stopping receiving a voice signal;

fig. 6 is a schematic diagram of a speech signal processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic diagram of a speech signal processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a mobile phone.

Detailed Description

The voice service refers to a service that gets a response by inputting voice. Such as the voice input "how much weather is today", the response is the query result of weather.

Fig. 1 is a schematic diagram of functional modules included in a voice service system. The voice service system includes: an Automatic Speech Recognition (ASR) function module, a Natural Language Understanding (NLU) function module, a Dialogue Management (DM) function module, a Natural Language Generation (NLG) function module, and a Text To Speech (TTS) function module. For convenience of description, the ASR function module, the NLU function module, the DM function module, the NLG function module, and the TTS function module are hereinafter referred to as ASR, NLU, DM, NLG, and TTS, respectively.

These five components are explained below.

First, speech recognition (ASR)

ASR is responsible for converting an input speech signal into textual information, equivalent to the human ear. The speech recognition principle flow: "input-encoding (feature extraction) -decoding-output".

Technologies related to ASR mainly include:

1) voice activity detection (voice active detection, VAD)

Voice activity detection may also be referred to as voice activity detection or silence detection, etc.

In a far-field recognition scene, a user cannot touch the equipment by hands, and the noise ratio is large, the signal-to-noise ratio is reduced violently, the situation can be simply understood that the signal is unclear, and VAD must be used. The function of the method is to determine when there is a voice signal input and when there is no voice signal input (i.e. silence), and the subsequent voice signal processing or voice recognition is performed on the valid voice segment intercepted by VAD. That is, the VAD is mainly used to detect whether the user completes the voice signal input.

VADs mainly include voice VADs and semantic VADs. The voice VAD means that when no voice signal is input within a set time period, the voice signal is stopped from being received (also referred to as stopping reception). The semantic VAD means that when the text information converted from the input voice signal is determined to have complete semantics, the voice signal is stopped from being received. The method adopting the voice VAD needs 700 ms of detection time, the method adopting the semantic VAD needs 500-3000 ms of detection events, and the combination of the two needs 500-700 ms of processing time.

2) Voice wake-up (Voice trigger, VT)

Also in far-field recognition, voice wake-up is required after VAD detects voice, which is equivalent to calling the name of the device to attract its attention, such as "minimums" of hundred minimums, and "makita demon" of makita smart speakers. The function of the method is to judge whether the word is a wakeup word or not and trigger subsequent voice recognition.

3) Microphone Array (Microphone Array)

This is a system for sampling and processing the spatial characteristics of a sound field, consisting of a certain number of acoustic sensors (typically microphones). The purposes are several: speech enhancement, a process of extracting pure speech from a speech signal containing noise; sound source positioning, namely calculating the angle and the distance of a target speaker by using a microphone array so as to realize the tracking of the target speaker and the subsequent directional voice pickup; dereverberation is carried out, and the influence of some reflected sounds is reduced; sound source signal extraction/separation, a plurality of mixed sounds are all extracted. The method is mainly suitable for complex environments with multiple noises, noises and echoes, such as outdoors and supermarkets.

Second, Natural Language Understanding (NLU)

And the voice information is converted into semantic information which can be understood by a machine. Common steps for NLUs include:

1. obtaining corpora: corpora are the content of linguistic studies, and are also the basic units that make up corpora. In the daily life, people simply use texts as substitutes, and the context in the texts is used as substitutes for the context of the language in the real world. There are two ways to obtain corpus: the method comprises the steps that firstly, the existing and accumulated corpus of the company is directly used, and the other method is to download the corpus from a network, such as a daily newspaper corpus of people, or to capture the corpus by using a crawler technology;

2. pretreatment: after the data is received, some preprocessing is performed, such as removing some ragged data, removing stop words, word segmentation, part-of-speech tagging, and the like. Some jagged data, such as abnormal symbols, spaces, etc., are removed. Stop words refer to words that occur almost every text, are numerous but indistinguishable, such as "of", and the like. If classified, these words have some effect on the processing results and may be considered to be removed first. The method has the advantages that no explicit marks such as spaces are arranged between words of the Chinese text to identify the boundaries of the words, so that the word segmentation problem becomes an important problem when the Chinese natural language is processed, and the main difficulties are that the word segmentation specification is uncertain, the ambiguity segmentation is carried out, and the unknown word is identified;

3. feature extraction: the words and terms after word segmentation are represented in a computer-computable form such as vectors, matrices, and the like. There are two commonly used representation models, the bag of words model and the word vector model;

4. model training: after the feature vectors are selected, what is needed is to train the models, using different models for different application requirements.

Third, Dialogue Management (DM)

And providing corresponding services according to the semantic information based on the state of the conversation. Dialog management controls the course of a man-machine dialog, which determines what reaction to the user should be made based on the history of the dialog.

The most common application is a task-driven multi-turn dialogue, users have definite purposes such as order inquiry and the like, the user requirements are complex, and the user requirements have many limiting conditions and may need to be stated in multiple turns. Essentially, task-driven session management is actually a decision-making process, and the system continuously determines the optimal actions to be taken next (such as providing results, inquiring specific limiting conditions, clarifying or confirming requirements, etc.) according to the current state in the session process, thereby most effectively assisting the user in completing the task of information or service acquisition.

Four, Natural Language Generation (NLG)

And the system is responsible for generating the natural language text according to the information of the service, and particularly, automatically generating the natural language text from the structured data in a readable mode. Natural language generation can be divided into three phases:

1. and (3) text planning: finishing the planning of the basic content in the structured data;

2. planning a sentence: assembling statements from the structured data to express the information flow;

3. the realization method comprises the following steps: grammatically smooth sentences are generated to express text.

Fifth, speech synthesis (TTS)

Responsible for changing the natural language text into output speech. In contrast to ASR, TTS converts text information into voice for a machine to read, corresponding to the mouth of a human being. Mainly divided into 3 modules: front-end processing, modeling, and vocoder. The TTS realization method mainly comprises a splicing method and a parameter method, wherein the two methods both comprise a front-end module and are mainly different from a rear-end acoustic modeling method.

1. Splicing method: from a large amount of pre-recorded voices, the required basic units are selected to be spliced, such as syllables, phonemes and the like. Its advantages are high speech quality, obvious defects and high requirement to database.

2. The parameter method comprises the following steps: the generation of moment-to-moment speech parameters (including fundamental frequency, formant frequency, etc.) from statistical models, and then the conversion of these parameters into waveforms has the advantage of relatively low database requirements and the disadvantage of a lower quality than splicing.

The scheme of the embodiment of the application can be applied to a voice recognition function module of communication equipment (such as terminal equipment) and used for improving the speed of determining to stop receiving the voice signal, so that the overall response speed of the voice service is improved, and the user experience is further improved.

Referring to fig. 2, a speech signal processing method provided in this embodiment of the present application may be executed by a speech signal processing apparatus, where the speech signal processing apparatus may be a speech recognition function module or a subunit in the speech recognition function module in the speech service system shown in fig. 1. The method comprises the following steps:

step 201, receiving a voice signal input by a user.

The speech signal herein refers to a signal corresponding to a single word of the user speech input, such as "open", "blue", "tooth", etc., of the user speech input.

Step 202, converting the voice signal into text information.

The text information here refers to the converted text information corresponding to the voice signal, such as the words "open", "blue", "tooth", and so on.

Step 203, according to the first text information obtained by converting the received multiple continuous voice signals, inquiring whether second text information matched with the first text information exists in the obtained multiple pieces of text information.

The first text information here refers to text information composed of a plurality of text information converted from a plurality of continuous speech signals, for example, the first text information is "open", "open blue", "open bluetooth", or the like.

As a first implementation method, the acquired pieces of text information include preset pieces of text information. For example, a plurality of pieces of text information may be preset in the communication device by a factory setting method. For another example, multiple pieces of text information may be downloaded from the cloud server by a cloud downloading method and stored in the communication device. For another example, a plurality of pieces of text information may be configured in the communication device by a method of manual input by the user. The embodiment of the present application does not limit the preset method.

As a third implementation method, the obtained pieces of text information include text information corresponding to a plurality of voice signals meeting a first condition, where the first condition includes one or more of the following:

1) the plurality of speech signals are successfully executed.

For example, if the user has voice-inputted "call to zhang san" and successfully dialed the zhang san phone, it indicates that the plurality of voice signals "call to zhang san" were successfully performed.

2) The number of times the plurality of voice signals are successfully executed satisfies a predetermined condition.

The predetermined condition satisfied here may be, for example: the ratio of the number of times the plurality of speech signals have been successfully executed is greater than a first threshold. Illustratively, the first threshold is 80%, and the total number of times that the user has voice-inputted "call three" is 10, wherein 9 times are successfully executed, i.e., successfully dialed the three-way telephone, and 1 time is not successfully executed, i.e., unsuccessfully dialed the three-way telephone (e.g., because the user actively hangs up the telephone before dialing, the telephone is not answered, etc.), the ratio of the number of times that the plurality of voice signals "call three" are successfully executed is 90%, and the number of times that the plurality of voice signals "call three" are successfully executed satisfies the predetermined condition because it is greater than 80%.

As a third implementation method, the obtained pieces of text information include preset pieces of text information and pieces of text information corresponding to a plurality of voice signals meeting a first condition, where the first condition includes one or more of the following: 1) the plurality of speech signals are successfully executed. 2) The number of times the plurality of voice signals have been successfully executed satisfies a predetermined condition. That is, the third implementation method is a combination of the first implementation method and the second implementation method.

It should be noted that the implementation method of the obtained multiple text messages is not limited to the above three implementation methods, and other implementation methods may also be used in actual use.

As an implementation method, a database may be established in the communication device, where the database is used to store the preset text information and/or the text information corresponding to the voice signals meeting the first condition. Therefore, when a voice signal is input by a user, the query matching can be carried out with the database according to the first text information obtained by converting a plurality of received continuous voice signals so as to determine whether the second text information matched with the first text information can be queried from the database.

And step 204, stopping receiving the voice signal under the condition that second text information matched with the first text information exists.

As an implementation, a "match" herein may be a perfect match. That is, when a second text message identical to the first text message is obtained from the plurality of obtained text messages, it is determined that a second text message matching the first text message exists in the plurality of obtained text messages. For example, the first text information is "bluetooth on", and the second text information is "bluetooth on".

As another implementation, a "match" herein may be an inclusive match. That is, when a second text message including the first text message is obtained from the obtained plurality of text messages, it is determined that a second text message matching the first text message exists in the obtained plurality of text messages. For example, the first text information is "bluetooth on", and the second text information is "please bluetooth on". Or when a second text message contained by the first text message is acquired from the acquired plurality of text messages, determining that the second text message matched with the first text message exists in the acquired plurality of text messages. For example, the first text information is "please turn on bluetooth", and the second text information is "turn on bluetooth".

It should be noted that, when there is no second text information matching the first text information, the voice signal continues to be received until the first text information obtained by converting the plurality of continuous voice signals input by the user can match the plurality of acquired text information, and then the voice signal is stopped from being received.

The above-described scheme is explained below with reference to an example.

For example, the communication device stores therein a plurality of text messages as shown in table 1 below. In a specific implementation, the text information may be stored in a database form or a table form, and the storage method of the text information is not limited in the embodiment of the present application.

The stored text information may include preset text information or text information satisfying a first condition.

TABLE 1

Bluetooth on
	Three telephone sets
Alarm clock with 8 points
	How much the weather is today
Opening the calendar
	……

It should be noted that the text information shown in table 1 is a dynamic update process. For example, if the number of times a certain text message is successfully executed does not satisfy a predetermined condition over time, the text message is marked as invalid or deleted. For another example, as time goes by, if the number of times a new text message is successfully executed satisfies a predetermined condition, the new text message is added to table 1.

Fig. 3 is a schematic diagram illustrating a process of stopping receiving a voice signal. During one voice input process, the voice signal continuously input by the user is 'bluetooth on'. Further, a perfect matching method will be described as an example.

Time T1: the ASR receives the speech signal "hit".

The voice signal is converted into the text information, the text information is matched with the text information (called as first text information) shown in the table 1, and the text information is matched with the text information (called as second text information) which is obtained.

Time T2: the ASR receives the speech signal "on".

The voice signal 'on' is converted into the text information 'on', and the text information 'on' (called as first text information) is matched with the acquired text information (called as second text information) shown in the table 1.

Time T3: the ASR receives the speech signal "blue".

The voice signal "blue" is converted into the text information "blue", and the text information "open blue" (referred to as first text information) is matched with the acquired text information (referred to as second text information) shown in table 1.

Time T4: the ASR receives the speech signal "teeth".

The voice signal tooth is converted into the character information tooth, the text information Bluetooth is turned on (called as first text information) and is matched with the acquired text information (called as second text information) shown in the table 1, and the text information Bluetooth is inquired and matched successfully because of the adoption of a complete matching method, and the voice signal is stopped being received.

Compared with the prior art, the embodiment shown in fig. 3 does not need to consume time and resources for VAD in the whole voice signal processing process, so that the efficiency of voice signal processing is improved, and the response delay of voice service is reduced.

Referring to fig. 4, another speech signal processing method provided for the embodiment of the present application may be executed by a speech signal processing apparatus, where the speech signal processing apparatus may be a speech recognition function module in the speech service system shown in fig. 1 or a subunit within the speech recognition function module. The method comprises the following steps:

step 401, receiving a voice signal input by a user.

Step 402, converting the voice signal into text information.

Step 403, according to the first text information obtained by converting the received multiple continuous voice signals, querying whether there is second text information matching the first text information in the obtained multiple text information, and performing Voice Activity Detection (VAD) on the multiple voice signals.

The implementation method of the obtained multiple pieces of text information is the same as that in the embodiment corresponding to fig. 2, and reference may be made to the foregoing description, which is not repeated.

In this step, two methods are simultaneously executed to determine whether to stop receiving the voice signal, one is a query matching method, that is, to query whether there is second text information matching the first text information in the acquired plurality of pieces of text information, and the other is a voice activity detection method, that is, to perform voice activity detection on the plurality of voice signals to determine whether to detect a voice endpoint.

As an implementation, a "match" herein may be a perfect match. That is, when a second text message identical to the first text message is obtained from the plurality of obtained text messages, it is determined that a second text message matching the first text message exists in the plurality of obtained text messages. For example, the first text information is "bluetooth on", and the second text information is "bluetooth on". Or when a second text message contained by the first text message is acquired from the acquired plurality of text messages, determining that the second text message matched with the first text message exists in the acquired plurality of text messages. For example, the first text information is "please turn on bluetooth", and the second text information is "turn on bluetooth".

As another implementation, a "match" herein may be an inclusive match. That is, when a second text message including the first text message is obtained from the obtained plurality of text messages, it is determined that a second text message matching the first text message exists in the obtained plurality of text messages. For example, the first text information is "bluetooth on", and the second text information is "please bluetooth on".

And step 404, stopping receiving the voice signal when the second text information matched with the first text information exists. And stopping receiving the voice signals when the second text information matched with the first text information does not exist or is not inquired yet and voice endpoints are detected by performing voice activity on a plurality of voice signals.

The voice endpoint detection means that a voice endpoint is detected if no voice signal is received within a period of time from the start of voice detection. For example, using a timer to detect a voice endpoint may be: and starting a timer, and detecting a voice endpoint when the voice signal is not received all the time before the timer is overtime.

Two methods are simultaneously adopted to judge whether to stop receiving the voice signal, and at least the following three situations can occur:

in case one, it is found that there is second text information matching the first text information, and when no voice endpoint is detected when voice activity detection is performed on the plurality of voice signals, the reception of the voice signals is stopped.

In this case, the query matching is successful before voice activity is performed on a plurality of voice signals to detect a voice endpoint, and then the reception of the voice signals is stopped.

And in case two, inquiring that second text information matched with the first text information does not exist, performing voice activity on a plurality of voice signals, and detecting a voice endpoint, stopping receiving the voice signals.

In this case, before the second text information matching the first text information is searched, voice activity is performed on a plurality of voice signals, and a voice endpoint is detected, and then the voice signals are stopped from being received.

And in case III, the second text information matched with the first text information is not inquired temporarily, and voice activity is carried out on a plurality of voice signals to detect a voice endpoint, and then the voice signals are stopped from being received.

Based on the scheme, whether to stop receiving the voice signal is judged simultaneously through a text information query matching method and a voice activity detection method. When either method satisfies the condition for stopping receiving the voice signal, the voice signal is stopped from being received. This scheme can achieve a rapid determination of a point in time at which to stop receiving a voice signal, compared to a case where only a text information query matching method or only a voice activity detection method is used. And with the increase of the voice input times of the user, the recorded text information is more and more, the success rate of subsequent text information query matching is higher and higher, the speed is higher and faster, and the speed for determining to stop receiving the voice signal can be further increased, so that the overall response speed of the voice service is increased, and the user experience is improved.

The above-described scheme is explained below with reference to an example.

For example, the communication device stores therein a plurality of text messages as shown in table 1 above.

Time T1: the ASR receives the speech signal "hit".

On one hand, the voice signal typing is converted into the character information typing, and the text information typing (called as first text information) is matched with the acquired text information (called as second text information) shown in the table 1.

On the other hand, based on the VAD method, after receiving a voice signal "beat", the timer starts to count, and if the next voice signal is not received within a set time length, the voice signal is stopped from being received.

Time T2: the ASR receives the speech signal "on".

On one hand, the voice signal 'on' is converted into the text information 'on', and the text information 'on' (called as first text information) is matched with the acquired text information (called as second text information) shown in the table 1.

On the other hand, based on the VAD method, after the voice signal "on" is received, the voice signal "on" is received again within the set time length, so that the timer is triggered to be cleared and timing is restarted, and if the next voice signal is not received within the set time length, the voice signal is stopped being received.

Time T3: the ASR receives the speech signal "blue".

On one hand, the voice signal "blue" is converted into the text information "blue", and the text information "open blue" (referred to as first text information) is matched with the acquired text information (referred to as second text information) shown in table 1.

On the other hand, based on the VAD method, after the voice signal "on" is received, the voice signal "blue" is received again within the set time duration, so that the timer is triggered to be cleared and timing is restarted, and if the next voice signal is not received within the set time duration, the voice signal is stopped being received.

Time T4: the ASR receives the speech signal "teeth".

On one hand, the voice signal tooth is converted into the character information tooth, and the text information Bluetooth is turned on (called as first text information) and is matched with the acquired text information (called as second text information) shown in the table 1.

On the other hand, according to the VAD method, after the voice signal "blue" is received, the "tooth" is received within the set time length, so that the trigger timer is cleared and the timing is restarted. And within a set time length, if the query matching method is successfully matched, stopping VAD.

It can be seen that if the VAD method is used only, it takes a long time after receiving the voice signal "tooth" to determine that there is no voice signal input, and thus the reception of the voice signal is stopped. That is, the above-described example method may stop receiving the voice signal in advance.

The above-described scheme is explained below with reference to another example.

Fig. 5 is a schematic diagram illustrating a process of stopping receiving a voice signal. In one voice input process, the voice signal continuously input by the user is 'open WeChat'. Further, a perfect matching method will be described as an example.

Time T1: the ASR receives the speech signal "hit".

On the other hand, according to the VAD method, after the voice signal is received, timing is started, and if the next voice signal is not received within a set time period, the voice signal is stopped from being received.

Time T2: the ASR receives the speech signal "on".

Time T3: the ASR receives the speech signal "micro".

On one hand, the voice signal 'micro' is converted into the text information 'micro', and the text information 'micro-open' (referred to as first text information) is matched with the acquired text information (referred to as second text information) shown in the table 1.

On the other hand, based on the VAD method, after the voice signal "on" is received, the voice signal "micro" is received within the set time length, so that the timer is triggered to be cleared and the timing is restarted, and if the next voice signal is not received within the set time length, the voice signal is stopped being received.

Time T4: the ASR receives the speech signal "letter".

On one hand, the voice signal letter is converted into the character information letter, and the text information opening WeChat (called as first text information) is matched with the acquired text information (called as second text information) shown in the table 1.

On the other hand, according to the VAD method, after the speech signal "micro" is received, the "signal" is received again within a set time length, so that the timer is triggered to be cleared and the timing is restarted. And within the set time length, no other voice signals are received, after the set time length is reached, the timer is overtime, and the voice endpoint is successfully detected based on the VAD method, so that the VAD is stopped and the voice signals are stopped from being received.

It can be seen that if only the query matching method is used, the reception of the voice signal will not be stopped after the voice signal "letter" is received, but continues to wait for the next voice signal to be received. And the above-described exemplary method may stop receiving the voice signal in advance by the VAD method.

Fig. 6 is a schematic diagram of a speech signal processing apparatus according to an embodiment of the present application. The apparatus is used for implementing the steps performed by the corresponding speech signal processing apparatus in the above-mentioned embodiment, as shown in fig. 6, the apparatus 600 includes a receiving unit 610, a converting unit 620, a querying unit 630 and a control unit 640.

A receiving unit 610 for receiving a plurality of continuous voice signals; a converting unit 620 for converting the plurality of continuous speech signals into first text information; a querying unit 630, configured to query, according to the first text information, whether there is second text information that matches the first text information in the obtained multiple pieces of text information; a control unit 640 for stopping receiving the voice signal in the case where there is second text information matching the first text information.

In a possible implementation method, the control unit 640 is further configured to stop receiving the voice signal when the second text information matching the first text information does not exist or is not queried yet, and voice activation is performed on the plurality of voice signals to detect a voice endpoint.

In a possible implementation method, the obtained pieces of text information include text information corresponding to a plurality of speech signals respectively meeting a first condition, where the first condition includes one or more of the following:

the plurality of voice signals are successfully executed;

the number of times the plurality of voice signals are successfully executed satisfies a predetermined condition.

In a possible implementation method, the acquired pieces of text information include preset pieces of text information.

Optionally, the speech signal processing apparatus 600 may further include a storage unit, which is configured to store data or instructions (also referred to as codes or programs), and the foregoing units may interact with or be coupled to the storage unit to implement corresponding methods or functions.

It should be understood that the division of the units in the above apparatus is only a division of logical functions, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And the units in the device can be realized in the form of software called by the processing element; or may be implemented entirely in hardware; part of the units can also be realized in the form of software called by a processing element, and part of the units can be realized in the form of hardware. For example, each unit may be a processing element separately set up, or may be implemented by being integrated into a chip of the apparatus, or may be stored in a memory in the form of a program, and a function of the unit may be called and executed by a processing element of the apparatus. In addition, all or part of the units can be integrated together or can be independently realized. The processing element described herein may in turn be a processor, which may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each unit above may be implemented by an integrated logic circuit of hardware in a processor element or in a form called by software through the processor element.

In one example, the units in any of the above apparatuses may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), or a combination of at least two of these Integrated Circuit forms. As another example, when a Unit in a device may be implemented in the form of a Processing element scheduler, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of invoking programs. As another example, these units may be integrated together and implemented in the form of a system-on-a-chip (SOC).

The above receiving unit 610 is an interface circuit of the apparatus for receiving signals from other apparatuses. For example, when the device is implemented in the form of a chip, the receiving unit 610 is an interface circuit for the chip to receive signals from other chips or devices.

Referring to fig. 7, a schematic diagram of a speech signal processing apparatus provided in an embodiment of the present application is used to implement the operation of the speech signal processing apparatus in the above embodiment. As shown in fig. 7, the speech signal processing apparatus includes: a processor 710 and an interface 730, and optionally, the speech signal processing apparatus further comprises a memory 720. Interface 730 is used to enable communication with other devices.

The method performed by the speech signal processing apparatus in the above embodiment can be implemented by the processor 710 calling a program stored in a memory (which may be the memory 720 in the speech signal processing apparatus, or an external memory). That is, the speech signal processing apparatus may include a processor 710, and the processor 710 may execute the method performed by the speech signal processing apparatus in the above method embodiment by calling a program in a memory. The processor here may be an integrated circuit with signal processing capabilities, such as a CPU. The speech signal processing apparatus may be implemented by one or more integrated circuits configured to implement the above method. For example: one or more ASICs, or one or more microprocessors DSP, or one or more FPGAs, etc., or a combination of at least two of these integrated circuit forms. Alternatively, the above implementations may be combined.

Specifically, the functions/implementation processes of the receiving unit 610, the converting unit 620, the querying unit 630 and the controlling unit 640 in fig. 6 can be implemented by the processor 710 in the speech signal processing apparatus 700 shown in fig. 7 calling the computer executable instructions stored in the memory 720. Alternatively, the functions/implementation processes of the conversion unit 620, the query unit 630 and the control unit 640 in fig. 6 may be implemented by the processor 710 in the speech signal processing apparatus 700 shown in fig. 7 calling a computer-executable instruction stored in the memory 720, and the functions/implementation processes of the receiving unit 610 in fig. 6 may be implemented by the interface 730 in the speech signal processing apparatus 700 shown in fig. 7.

The voice recognition device in the embodiment of the present application may be integrated in a mobile phone, so that the mobile phone may perform the voice recognition method. Fig. 8 shows a schematic structural diagram of a mobile phone 800. The cell phone 800 may include a processor 810, an external memory interface 820, an internal memory 821, a USB interface 830, a charging management module 840, a power management module 841, a battery 842, an antenna 1, an antenna 2, a mobile communication module 851, a wireless communication module 852, an audio module 870, a speaker 870A, a receiver 870B, a microphone 870C, a headset interface 870D, a sensor module 880, keys 890, a motor 891, an indicator 892, a camera 893, a display 894, a SIM card interface 895, and the like. The sensor module 880 may include a gyroscope sensor 880A, an acceleration sensor 880B, a proximity light sensor 880G, a fingerprint sensor 880H, and a touch sensor 880K (of course, the mobile phone 800 may further include other sensors, such as a temperature sensor, a pressure sensor, a distance sensor, a magnetic sensor, an ambient light sensor, an air pressure sensor, a bone conduction sensor, and the like, which are not shown in the figure).

It is to be understood that the illustrated structure of the embodiment of the present invention is not to be specifically limited to the mobile phone 800. In other embodiments of the present application, the cell phone 800 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 810 may include one or more processing units, such as: the processor 810 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a Neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors. The controller may be a neural center and a command center of the cell phone 800, among others. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 810 for storing instructions and data. In some embodiments, the memory in processor 810 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 810. If the processor 810 needs to use the instruction or data again, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 810, thereby increasing the efficiency of the system.

The processor 810 may execute the speech recognition method provided by the embodiment of the present application. The processor 810 may include different devices, such as an integrated CPU and a GPU, which may cooperate to perform the speech recognition method provided by the embodiments of the present application.

The display screen 894 is used to display images, video, and the like. The display screen 894 includes a display panel. The display panel may be a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), or the like. In some embodiments, the cell phone 800 may include 1 or N display screens 894, N being a positive integer greater than 1. The display screen 894 may be used to display information entered by or provided to the user as well as various Graphical User Interfaces (GUIs). For example, the display screen 894 may display a photograph, video, web page, or file, among others. As another example, display screen 894 may display a graphical user interface. The graphical user interface comprises a status bar, a hidden navigation bar, a time and weather widget (widget) and an application icon, such as a browser icon. The status bar includes the name of the operator (e.g., china mobile), the mobile network (e.g., 4G), the time and the remaining power. The navigation bar includes a back key icon, a home key icon, and a forward key icon. Further, it is understood that in some embodiments, a Bluetooth icon, a Wi-Fi icon, an add-on icon, etc. may also be included in the status bar. It will also be appreciated that in other embodiments, a Dock bar may be included in the graphical user interface, a commonly used application icon may be included in the Dock bar, and so on. When the processor 810 detects a touch event of a user's finger (or stylus, etc.) with respect to an application icon, in response to the touch event, a user interface of an application corresponding to the application icon is opened and displayed on the display screen 894.

In this embodiment, the display screen 894 may be an integrated flexible display screen, or may be a spliced display screen formed by two rigid screens and a flexible screen located between the two rigid screens. After the processor 810 executes the speech recognition method provided by the embodiment of the present application, the processor 810 may control an external audio output device to switch an output audio signal.

The cameras 893 (front camera or rear camera, or one camera may be both front and rear cameras) are used to capture still images or video. In general, the camera 893 may include a photosensitive element such as a lens group and an image sensor, wherein the lens group includes a plurality of lenses (convex or concave) for collecting light signals reflected by an object to be photographed and transferring the collected light signals to the image sensor. And the image sensor generates an original image of the object to be shot according to the optical signal.

The internal memory 821 may be used to store computer-executable program code, which includes instructions. The processor 810 executes various functional applications and data processing of the cellular phone 800 by executing instructions stored in the internal memory 821. The internal memory 821 may include a program storage area and a data storage area. Wherein the storage program area may store an operating system, codes of application programs (such as a camera application, a WeChat application, etc.), and the like. The data storage area can store data created during use of the mobile phone 800 (such as images, videos and the like acquired by a camera application) and the like.

The internal memory 821 may also store one or more computer programs corresponding to the voice recognition methods provided in the embodiments of the present application. The one or more computer programs stored in the memory 821 and configured to be executed by the one or more processors 810 include instructions that may be used to perform various steps as in the respective embodiments of fig. 2 or 4, and may include an account verification module 8211 and a priority comparison module 8212. The account verification module 8211 is used for authenticating system authentication accounts of other terminal devices in the local area network; the priority comparison module 8212 may be configured to compare the priority of the audio output request service with the priority of the current output service of the audio output device. The state synchronization module 8213 may be configured to synchronize the device state of the currently accessed audio output device to other terminal devices, or synchronize the device state of the currently accessed audio output device of the opposite terminal device to local. When the code of the audio output algorithm stored in the internal memory 821 is executed by the processor 810, the processor 810 may control the audio output device to switch the output audio signal.

In addition, the internal memory 821 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

Of course, the code of the speech recognition method provided by the embodiment of the present application may also be stored in the external memory. In this case, the processor 810 may execute the code of the voice recognition method stored in the external memory through the external memory interface 820, and the processor 810 may control the audio output device to switch the output audio signal.

The function of the sensor module 880 is described below.

The gyro sensor 880A may be used to determine the motion pose of the cell phone 800. In some embodiments, the angular velocity of the cell phone 800 about three axes (i.e., x, y, and z axes) may be determined by the gyro sensors 880A. That is, the gyro sensor 880A may be used to detect the current motion state of the cell phone 800, such as shaking or standing still.

When the display screen in the embodiment of the present application is a foldable screen, the gyro sensor 880A may be used to detect a folding or unfolding operation applied to the display screen 894. Gyroscope sensor 880A may report the detected folding or unfolding operation as an event to processor 810 to determine the folded or unfolded state of display 894.

The acceleration sensor 880B can detect the magnitude of acceleration of the handset 800 in various directions (typically three axes). That is, the gyro sensor 880A may be used to detect the current motion state of the cell phone 800, such as shaking or standing still. When the display screen in the embodiment of the present application is a foldable screen, the acceleration sensor 880B may be used to detect a folding or unfolding operation applied to the display screen 894. The acceleration sensor 880B may report the detected folding operation or unfolding operation as an event to the processor 810 to determine the folded state or unfolded state of the display 894.

The proximity light sensor 880G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The mobile phone emits infrared light outwards through the light emitting diode. The handset uses a photodiode to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the handset. When insufficient reflected light is detected, the handset can determine that there are no objects in the vicinity of the handset. When the display screen in this embodiment of the application is a foldable screen, the proximity light sensor 880G may be disposed on the first screen of the foldable display screen 894, and the proximity light sensor 880G may detect a folding angle or an unfolding angle of the first screen and the second screen according to an optical path difference of the infrared signal.

The gyro sensor 880A (or the acceleration sensor 880B) may transmit the detected motion state information (such as angular velocity) to the processor 810. The processor 810 determines whether the mobile phone 800 is currently in the hand-held state or the tripod state (for example, when the angular velocity is not 0, it indicates that the mobile phone 800 is in the hand-held state) based on the motion state information.

The fingerprint sensor 880H is used to collect a fingerprint. The cell phone 800 can utilize the collected fingerprint characteristics to achieve fingerprint unlocking, access to an application lock, fingerprint photographing, fingerprint incoming call answering, and the like.

Touch sensor 880K, also referred to as a "touch panel. The touch sensor 880K may be disposed on the display screen 894, and the touch sensor 880K and the display screen 894 form a touch screen, which is also referred to as a "touch screen". The touch sensor 880K is used to detect a touch operation applied thereto or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operations may be provided via the display screen 894. In other embodiments, the touch sensor 880K can be disposed on a surface of the cell phone 800 at a different location than the display 894.

Illustratively, the display 894 of the cell phone 800 displays a home interface that includes icons for a plurality of applications (e.g., a camera application, a WeChat application, etc.). The user clicks an icon of the camera application in the main interface through the touch sensor 880K, which triggers the processor 810 to start the camera application and open the camera 893. The display screen 894 displays an interface, such as a viewfinder interface, for a camera application.

The wireless communication function of the mobile phone 800 can be realized by the antenna 1, the antenna 2, the mobile communication module 851, the wireless communication module 852, the modem processor, the baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the handset 800 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 851 can provide a solution for wireless communication including 2G/3G/4G/5G, etc. applied to the handset 800. The mobile communication module 851 may include at least one filter, switch, power amplifier, Low Noise Amplifier (LNA), etc. The mobile communication module 851 may receive an electromagnetic wave from the antenna 1, filter, amplify, and the like the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 851 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 851 may be disposed in the processor 810. In some embodiments, at least part of the functional modules of the mobile communication module 851 may be provided in the same device as at least part of the modules of the processor 810. In the embodiment of the present application, the mobile communication module 851 can also be used for performing information interaction with other terminal devices, i.e. sending audio output requests to other terminal devices, or the mobile communication module 851 can be used for receiving the audio output requests and encapsulating the received audio output requests into messages in a specified format.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then passed to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 870A, the receiver 870B, etc.) or displays images or video through the display screen 894. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be separate from processor 810, in the same device as mobile communication module 851 or other functional modules.

The wireless communication module 852 may provide solutions for wireless communication applied to the mobile phone 800, including Wireless Local Area Networks (WLANs) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 852 may be one or more devices integrating at least one communication processing module. The wireless communication module 852 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on an electromagnetic wave signal, and transmits the processed signal to the processor 810. The wireless communication module 852 may also receive signals to be transmitted from the processor 810, frequency modulate them, amplify them, and convert them into electromagnetic waves via the antenna 2 to radiate them. In this embodiment, the wireless communication module 852 is configured to establish a connection with an audio output device, and output a voice signal through the audio output device. Or the wireless communication module 852 can be used for accessing the access point device, sending messages corresponding to audio output requests to other terminal devices, or receiving messages corresponding to audio output requests sent from other terminal devices. Optionally, the wireless communication module 852 may also be used for receiving voice data from other terminal devices.

In addition, the mobile phone 800 may implement audio functions through the audio module 870, the speaker 870A, the receiver 870B, the microphone 870C, the earphone interface 870D, and the application processor. Such as music playing, recording, etc. The cell phone 800 may receive key 890 inputs, generating key signal inputs relating to user settings and function control of the cell phone 800. The cell phone 800 may generate a vibration alert (e.g., an incoming call vibration alert) using the motor 891. The indicator 892 in the mobile phone 800 may be an indicator light, which may be used to indicate a charging status, a change in charge level, or may be used to indicate a message, missed call, notification, etc. The SIM card interface 895 in the handset 800 is used to connect a SIM card. The SIM card can be attached to and detached from the cellular phone 800 by being inserted into the SIM card interface 895 or being pulled out from the SIM card interface 895.

It should be understood that in practical applications, the cell phone 800 may include more or less components than those shown in fig. 8, and the embodiment of the present application is not limited thereto. The illustrated cell phone 800 is merely an example, and the cell phone 800 may have more or fewer components than shown in the figures, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

The present application also provides a computer-readable medium having stored thereon a computer program which, when executed by a computer, implements the functionality of any of the method embodiments described above.

The present application also provides a computer program product which, when executed by a computer, implements the functionality of any of the method embodiments described above.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The various illustrative logical units and circuits described in this application may be implemented or operated upon by design of a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in the embodiments herein may be embodied directly in hardware, in a software element executed by a processor, or in a combination of the two. The software cells may be stored in Random Access Memory (RAM), flash Memory, Read-Only Memory (ROM), EPROM Memory, EEPROM Memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one or more exemplary designs, the functions described herein may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source over a coaxial cable, fiber optic computer, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. The disk (disk) and the Disc (Disc) include a compact Disc, a laser Disc, an optical Disc, a Digital Versatile Disc (DVD), a floppy disk and a blu-ray Disc, where the disk usually reproduces data magnetically, and the Disc usually reproduces data optically with a laser. Combinations of the above may also be included in the computer-readable medium.

Those skilled in the art will recognize that in one or more of the examples described above, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above-mentioned embodiments, objects, technical solutions and advantages of the present application are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present application should be included in the scope of the present application. The above description of the specification of the application is provided to enable any person skilled in the art to make or use the teachings of the application, and any modifications based on the disclosure should be considered as obvious to those skilled in the art, and the basic principles described herein may be applied to other variations without departing from the inventive spirit and scope of the application. Thus, the disclosure is not intended to be limited to the embodiments and designs described, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Although the present application has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and figures are merely exemplary of the present application as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include such modifications and variations.

Claims

1. A speech signal processing method, comprising:

according to first text information obtained by converting a plurality of received continuous voice signals, whether second text information matched with the first text information exists in the obtained text information is inquired;

and stopping receiving the voice signal under the condition that second text information matched with the first text information exists.

2. The method of claim 1, wherein the method further comprises:

and stopping receiving the voice signals when second text information matched with the first text information does not exist or is not inquired yet and voice endpoints are detected by performing voice activity on the plurality of voice signals.

3. The method according to claim 1 or 2, wherein the acquired pieces of text information include text information corresponding to a plurality of speech signals conforming to a first condition, respectively, and the first condition includes one or more of the following:

the plurality of voice signals are successfully executed;

4. The method according to any one of claims 1 to 3, wherein the acquired plurality of pieces of text information include preset plurality of pieces of text information.

5. A speech signal processing apparatus, comprising:

a receiving unit for receiving a plurality of continuous voice signals;

a conversion unit configured to convert the plurality of continuous speech signals into first text information;

the query unit is used for querying whether second text information matched with the first text information exists in the obtained plurality of pieces of text information according to the first text information;

and the control unit is used for stopping receiving the voice signal under the condition that second text information matched with the first text information exists.

6. The apparatus of claim 5, wherein the control unit is further configured to stop receiving speech signals if a speech endpoint is detected when second text information matching the first text information is not present or has not been queried and speech activation of the plurality of speech signals.

7. The apparatus according to claim 5 or 6, wherein the obtained pieces of text information include text information corresponding to a plurality of speech signals meeting a first condition, respectively, and the first condition includes one or more of the following:

the plurality of voice signals are successfully executed;

8. The apparatus according to any one of claims 5 to 7, wherein the obtained plurality of pieces of text information includes a preset plurality of pieces of text information.

9. A speech signal processing apparatus, comprising:

a processor, the memory coupled to the processor, the memory for storing program instructions, the processor for executing the program instructions such that the method of any of claims 1-4 is implemented.

10. A computer program product comprising instructions for implementing the method of any of claims 1 to 4 when run on a computer.

11. A computer-readable storage medium having stored thereon instructions which, when executed on a computer, implement the method of any one of claims 1-4.