CN112489653A

CN112489653A - Speech recognition method, device and storage medium

Info

Publication number: CN112489653A
Application number: CN202011282047.3A
Authority: CN
Inventors: 程思
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2021-03-12
Anticipated expiration: 2040-11-16

Abstract

The present disclosure relates to a method, an apparatus, and a storage medium for speech recognition, which may acquire first audio data acquired at a current time period; acquiring a first quality evaluation parameter of the first audio data under the condition that a preset quality evaluation triggering condition is met; receiving a second quality evaluation parameter sent by other terminals except the target terminal in the terminal group network where the target terminal is located; determining whether the target terminal is a data uploading terminal or not according to the first quality evaluation parameter and the second quality evaluation parameter; and under the condition that the target terminal is determined to be the data uploading terminal, sending the audio data to be identified to the server so that the server performs voice identification on the audio data to be identified, wherein the audio data to be identified is the audio data collected in a target time period, the target time period comprises a target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

Description

Speech recognition method, device and storage medium

Technical Field

The present disclosure relates to the field of speech recognition, and in particular, to a method, an apparatus, and a storage medium for speech recognition.

Background

In recent years, the voice recognition technology has made remarkable progress, the interaction mode of human-computer interaction through voice is widely concerned, and a large number of intelligent devices based on voice interaction are already on the surface, such as intelligent sound boxes, intelligent air conditioners, voice assistants and the like, a user can wake up the devices by speaking wake-up words, in an actual application scene, a plurality of intelligent devices may exist in the same space, and after the user inputs wake-up audio, the intelligent devices closest to the user can be screened out of the intelligent devices in a distributed decision-making mode to respond.

In the related technology, after the intelligent device collects the input audio, on one hand, the intelligent device can input the awakening engine to perform a nearby awakening process through the awakening engine, and on the other hand, after the intelligent device is awakened nearby and judged to be awakened, the input audio is uploaded to the server by starting an automatic speech recognition ASR audio uploading function, so that the input audio is subjected to speech recognition through the server, and therefore man-machine interaction is achieved. The inventor has found and appreciated that if the location of the wake-up device is not consistent with the location of the user input audio command, or the quality of the audio collected by the microphone of the wake-up device is inferior to that collected by other devices without wake-up device, the accuracy of the server speech recognition may be affected.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method, an apparatus, and a storage medium for speech recognition.

According to a first aspect of the embodiments of the present disclosure, a method for speech recognition is provided, which is applied to a target terminal, and includes acquiring first audio data acquired in a current time period; acquiring a first quality evaluation parameter of the first audio data under the condition that a preset quality evaluation triggering condition is met; receiving a second quality evaluation parameter sent by other terminals except the target terminal in the terminal group network where the target terminal is located; determining whether the target terminal is a data uploading terminal or not according to the first quality evaluation parameter and the second quality evaluation parameter; and under the condition that the target terminal is determined to be the data uploading terminal, sending audio data to be identified to a server so that the server performs voice identification on the audio data to be identified, wherein the audio data to be identified is audio data collected in a target time period, the target time period comprises a target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

Optionally, the preset quality evaluation triggering condition includes: the first audio data collected in the current time period comprises preset awakening words; or receiving a first quality evaluation indication message sent by the other terminal, where the first quality evaluation indication message is used to indicate the target terminal to perform audio quality analysis on the first audio data.

Optionally, before the receiving of the second quality evaluation parameter sent by the terminal other than the target terminal in the terminal group network where the target terminal is located, the method further includes: if the first audio data comprises the preset awakening word, sending a second quality evaluation indication message to the other terminal, wherein the second quality evaluation indication message is used for indicating the other terminal to perform audio quality analysis on the second audio data; the receiving of the second quality evaluation parameter sent by the other terminals except the target terminal in the terminal group network where the target terminal is located includes: and receiving the second quality evaluation parameter sent by the other terminals according to the second quality evaluation indication message.

Optionally, the determining, according to the first quality evaluation parameter and the second quality evaluation parameter, whether the target terminal is a data upload terminal includes: taking the target terminal as the data uploading terminal under the condition that the first quality evaluation parameter is greater than or equal to a preset evaluation threshold value; or, under the condition that the first quality evaluation parameter is less than or equal to the preset evaluation threshold, determining whether each second quality evaluation parameter is less than or equal to the first quality evaluation parameter, and if each second quality evaluation parameter is less than or equal to the first quality evaluation parameter, taking the target terminal as the data uploading terminal.

Optionally, the determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter further includes: if a target quality evaluation parameter which is larger than the first quality evaluation parameter exists in the second quality evaluation parameter, calculating a difference value between the target quality evaluation parameter and the first quality evaluation parameter; and if the difference value of the preset number is smaller than or equal to a preset difference value threshold value, taking the target terminal as the data uploading terminal.

Optionally, the method further comprises: receiving awakening data sent by other terminals; determining whether to awaken the target terminal according to the first audio data and the awakening data; and taking the target terminal as the data uploading terminal under the condition of determining to awaken the target terminal.

Optionally, before the obtaining the first quality-assessment parameter of the first audio data, the method further comprises: performing echo cancellation on interference audio data in the first audio data to obtain target audio data; the obtaining of the first quality-assessment parameter of the first audio data includes: and acquiring the first quality evaluation parameter of the target audio data.

Optionally, before the performing echo cancellation on the interfering audio data in the first audio data, the method further includes: receiving external echo audio data sent by other terminals and a timestamp corresponding to each frame of external echo audio data; and/or acquiring echo audio data locally cached by the target terminal and a timestamp of each frame of the locally cached echo audio data; the performing echo cancellation on the interfering audio data in the first audio data comprises: searching corresponding audio data from the first audio data according to the acquired time stamp of each frame of echo audio data to obtain aligned audio data; performing echo cancellation on the first audio data according to the aligned audio data and the echo audio data, where the echo audio data includes the external echo audio data and/or the local cache echo audio data.

According to a second aspect of the embodiments of the present disclosure, there is provided a method for speech recognition, applied to a server, the method including: receiving audio data to be identified sent by at least one terminal in a terminal group network; performing audio quality analysis on each audio data to be identified to obtain a third quality evaluation parameter; determining target identification audio data from the audio data to be identified sent by at least one terminal according to the third quality evaluation parameter; and performing voice recognition on the target recognition audio data.

Optionally, the terminal includes a data uploading terminal, and the receiving of the audio data to be identified sent by at least one terminal in the terminal group network includes: receiving the audio data to be identified sent by at least one data uploading terminal in the terminal group network; the data uploading terminal is a terminal determined by a target terminal according to a first quality evaluation parameter and a second quality evaluation parameter, the first quality evaluation parameter is a quality evaluation parameter obtained by performing audio quality analysis on the acquired first audio data by the target terminal under the condition that a preset trigger condition is met, and the second quality evaluation parameter is a parameter obtained by performing audio quality analysis on the acquired second audio data by other terminals except the target terminal in the terminal group network; the audio quality analysis of each audio data to be identified to obtain a third quality evaluation parameter includes: and performing audio quality analysis on the audio data to be identified sent by each data uploading terminal to obtain the third quality evaluation parameter.

Optionally, the audio data to be identified includes echo cancellation audio data obtained after echo cancellation is performed by the terminal; the audio quality analysis of each audio data to be identified to obtain a third quality evaluation parameter includes: and performing audio quality analysis on each echo cancellation audio data to obtain the third quality evaluation parameter.

According to a third aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus applied to a target terminal, including: the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is configured to acquire first audio data acquired in a current time period; the second acquisition module is configured to acquire a first quality evaluation parameter of the first audio data under the condition that a preset quality evaluation triggering condition is met; a first receiving module configured to receive a second quality evaluation parameter sent by a terminal other than the target terminal in a terminal group network in which the target terminal is located; a first determining module configured to determine whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter; the first sending module is configured to send audio data to be recognized to a server under the condition that the target terminal is determined to be the data uploading terminal, so that the server performs voice recognition on the audio data to be recognized, the audio data to be recognized is audio data collected in a target time period, the target time period comprises a target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

Optionally, the apparatus further comprises: a second sending module, configured to send a second quality evaluation indication message to the other terminal if the first audio data includes the preset wake-up word, where the second quality evaluation indication message is used to indicate the other terminal to perform audio quality analysis on second audio data; the first receiving module is configured to receive the second quality evaluation parameter sent by the other terminal according to the second quality evaluation indication message.

Optionally, the second quality evaluation parameter includes at least one quality evaluation parameter, and the first determining module is configured to take the target terminal as the data uploading terminal when the first quality evaluation parameter is greater than or equal to a preset evaluation threshold; or, under the condition that the first quality evaluation parameter is less than or equal to the preset evaluation threshold, determining whether each second quality evaluation parameter is less than or equal to the first quality evaluation parameter, and if each second quality evaluation parameter is less than or equal to the first quality evaluation parameter, taking the target terminal as the data uploading terminal.

Optionally, the first determining module is configured to calculate a difference between the target quality evaluation parameter and the first quality evaluation parameter if a target quality evaluation parameter greater than the first quality evaluation parameter exists in the second quality evaluation parameter; and if the difference value of the preset number is smaller than or equal to a preset difference value threshold value, taking the target terminal as the data uploading terminal.

Optionally, the apparatus further comprises: a second receiving module configured to receive the wake-up data sent by the other terminal; a second determining module configured to determine whether to wake up the target terminal according to the first audio data and the wake-up data; and taking the target terminal as the data uploading terminal under the condition of determining to awaken the target terminal.

Optionally, the apparatus further comprises: the echo cancellation module is configured to perform echo cancellation on interference audio data in the first audio data to obtain target audio data; the second obtaining module is configured to obtain the first quality assessment parameter of the target audio data.

Optionally, the apparatus further comprises: the third receiving module is configured to receive the external echo audio data sent by the other terminal and a timestamp corresponding to each frame of external echo audio data; and/or acquiring echo audio data locally cached by the target terminal and a timestamp of each frame of the locally cached echo audio data; the echo cancellation module is configured to search for corresponding audio data from the first audio data according to the obtained timestamp of each frame of the echo audio data, so as to obtain aligned audio data; performing echo cancellation on the first audio data according to the aligned audio data and the echo audio data, where the echo audio data includes the external echo audio data and/or the local cache echo audio data.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an apparatus for speech recognition, applied to a server, the apparatus including: the fourth receiving module is configured to receive audio data to be identified, which is sent by at least one terminal in the terminal group network; the audio quality analysis module is configured to perform audio quality analysis on each audio data to be identified to obtain a third quality evaluation parameter; a third determining module configured to determine target identification audio data from the audio data to be identified sent by at least one terminal according to the third quality evaluation parameter; a speech recognition module configured to perform speech recognition on the target recognition audio data.

Optionally, the terminal includes a data uploading terminal, and the fourth receiving module is configured to receive the audio data to be identified sent by at least one data uploading terminal in the terminal group network; the data uploading terminal is a terminal determined by a target terminal according to a first quality evaluation parameter and a second quality evaluation parameter, the first quality evaluation parameter is a quality evaluation parameter obtained by performing audio quality analysis on the acquired first audio data by the target terminal under the condition that a preset trigger condition is met, and the second quality evaluation parameter is a parameter obtained by performing audio quality analysis on the acquired second audio data by other terminals except the target terminal in the terminal group network; the audio quality analysis module is configured to perform audio quality analysis on the audio data to be identified sent by each data uploading terminal to obtain the third quality evaluation parameter.

Optionally, the audio data to be identified includes echo cancellation audio data obtained after echo cancellation is performed by the terminal; the audio quality analysis module is configured to perform audio quality analysis on each echo cancellation audio data to obtain the third quality evaluation parameter.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus applied to a target terminal, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: acquiring first audio data acquired in the current time period; acquiring a first quality evaluation parameter of the first audio data under the condition that a preset quality evaluation triggering condition is met; receiving a second quality evaluation parameter sent by other terminals except the target terminal in the terminal group network where the target terminal is located; determining whether the target terminal is a data uploading terminal or not according to the first quality evaluation parameter and the second quality evaluation parameter; and under the condition that the target terminal is determined to be the data uploading terminal, sending audio data to be identified to a server so that the server performs voice identification on the audio data to be identified, wherein the audio data to be identified is audio data collected in a target time period, the target time period comprises a target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of the first aspect of the present disclosure.

According to a seventh aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus applied to a server, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: receiving audio data to be identified sent by at least one terminal in a terminal group network; performing audio quality analysis on each audio data to be identified to obtain a third quality evaluation parameter; determining target identification audio data from the audio data to be identified sent by at least one terminal according to the third quality evaluation parameter; and performing voice recognition on the target recognition audio data.

According to an eighth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of the second aspect of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: first audio data collected in the current time period can be acquired through a target terminal; acquiring a first quality evaluation parameter of the first audio data under the condition that a preset quality evaluation triggering condition is met; receiving a second quality evaluation parameter sent by other terminals except the target terminal in the terminal group network where the target terminal is located; determining whether the target terminal is a data uploading terminal or not according to the first quality evaluation parameter and the second quality evaluation parameter; under the condition that the target terminal is determined to be the data uploading terminal, sending audio data to be identified to a server so that the server can perform voice identification on the audio data to be identified, wherein the audio data to be identified is the audio data collected in a target time period, the target time period comprises a target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal, so that the audio quality analysis can be performed on the first audio data collected in the current time period at the target terminal side, whether the target terminal is the data uploading terminal or not is determined by combining the analysis result of the audio data received by other terminals, and under the condition that the target terminal is determined to be the data uploading terminal, sending the audio data to be identified to the server so as to ensure the audio quality of the audio data to be identified uploaded by the data uploading terminal, and then the accuracy of speech recognition is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a diagram illustrating a scenario for terminal networking in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a first method of speech recognition according to an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a second method of speech recognition according to an exemplary embodiment;

FIG. 4 is a flow diagram illustrating a third method of speech recognition according to an exemplary embodiment;

FIG. 5 is a block diagram illustrating an apparatus for a first type of speech recognition, according to an example embodiment;

FIG. 6 is a block diagram illustrating an apparatus for a second type of speech recognition, according to an example embodiment;

FIG. 7 is a block diagram illustrating an apparatus for third type of speech recognition according to an example embodiment;

FIG. 8 is a block diagram illustrating a fourth apparatus for speech recognition according to an example embodiment;

FIG. 9 is a block diagram illustrating a fifth apparatus for speech recognition according to an example embodiment;

FIG. 10 is a block diagram illustrating an apparatus for sixth type of speech recognition in accordance with an illustrative embodiment;

FIG. 11 is a block diagram illustrating a speech recognition apparatus according to an example embodiment;

FIG. 12 is a block diagram illustrating another speech recognition apparatus according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

First, an application scenario of the present disclosure is introduced, and the present disclosure is mainly applied to a voice wake-up scenario in a distributed device networking environment, for example, in a scene that a user wakes up any intelligent device in a terminal group network in a voice mode and realizes human-computer interaction in a voice information input mode, wherein the terminal networking may include a network of at least two intelligent devices, for example, fig. 1 is a scene diagram of terminal networking, as shown in fig. 1, a plurality of intelligent devices (such as smart speakers, notebook computers, televisions, watches, mobile phones, etc.) are bound under the same millet account, and when the plurality of intelligent devices are all in a networked state, networking can be carried out, when a user wants to awaken a certain terminal in the terminal group network, the terminal can be awakened by speaking an awakening word, and the terminal networking screens out the terminal equipment closest to the user according to the decision logic awakened nearby to respond.

In the related technology, after the intelligent device collects the input audio, on one hand, the intelligent device can input the awakening engine to perform a nearby awakening process through the awakening engine, and on the other hand, after the intelligent device is awakened nearby and judged to be awakened, the input audio is uploaded to the server by starting an automatic speech recognition ASR audio uploading function, so that the input audio is subjected to speech recognition through the server, and therefore man-machine interaction is achieved. The inventor finds and realizes that if the position of the awakening device is inconsistent with the position of the audio instruction input by the user, or the quality of the audio collected by the microphone of the awakening device is lower than that of the audio collected by other devices without the awakening device, the accuracy of the server voice recognition is influenced; in addition, in order to further improve the accuracy of voice recognition, in the related art, if a target terminal (i.e. any terminal in the terminal group network) is currently playing music, the audio data collected by the microphone may be subjected to local echo cancellation first to eliminate the interference of the music, and then noise reduction processing is performed, if the target terminal is not currently playing music, echo cancellation is not required, noise reduction processing is directly performed, and the audio data after noise reduction is transmitted to the wake-up engine to perform a nearby wake-up decision, but if a plurality of intelligent devices exist in the same space and are playing music (i.e. other terminals exist in the terminal group network and are also playing music), the audio data collected by the target terminal may also be subjected to the interference of the music played by other terminals, and if only local echo cancellation is performed (i.e. the audio data played by the target terminal in the audio data collected by the microphone is cancelled), the accuracy of speech recognition is also affected.

In order to solve the existing problems, the present disclosure provides a method, an apparatus, and a storage medium for voice recognition, which may perform audio quality analysis on first audio data acquired at a current time period at a target terminal to obtain a first quality evaluation parameter, determine whether the target terminal is a data upload terminal by combining with a second quality evaluation parameter that is a quality analysis result of audio data received by other terminals, and send audio data to be recognized to a server under the condition that the target terminal is determined to be the data upload terminal, thereby ensuring the audio quality of the audio data to be recognized uploaded by the data upload terminal, and further improving the accuracy of voice recognition; in addition, the method and the device can perform global echo cancellation on the first audio data and the audio data to be recognized before the audio data is subjected to audio quality analysis and the audio data to be recognized is uploaded to the server, completely cancel the received audio data played by other terminals and the audio data played by the target terminal from the audio data collected by the microphone, and upload the audio data to be recognized after the global echo cancellation to the server, so that the accuracy of voice recognition is further improved.

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating a voice recognition method according to an exemplary embodiment, where the method is applied to a target terminal, and the target terminal may be any terminal in a terminal group network, as shown in fig. 2, and the method includes the following steps:

in step S201, first audio data acquired for a current time period is acquired.

In the present disclosure, the audio data may be collected by the target terminal in real time, and the audio data may include one or more of a wake-up audio (such as a wake-up word) issued by the user, an instruction audio (such as "how is the weather today.

In a possible implementation manner, the audio data may be collected in real time by a microphone on the target terminal, and after the audio data is collected in real time, overlay buffering may be performed in the buffer space of the target terminal, for example, assuming that 10 seconds of audio data may be buffered in the buffer space of the target terminal, after 11 seconds of audio data are collected in real time, 1 second of audio data buffered in the buffer space in history may be deleted from the buffer space, and then the 11 th second of audio data may be stored in the buffer space, so that the audio data buffered in the buffer space is changed to 2 second to 11 second of audio data.

The current time period may include a preset cache time period corresponding to the cache space of the target terminal, for example, the audio data may be set in the cache space to be cached for at most 10 seconds, and the current time period is a historical 10-second time period with the current time as an end time, so that the target terminal may obtain the first audio data acquired by the current time period from the cache space.

In addition, in order to facilitate subsequent echo cancellation of the collected audio data, the time stamp corresponding to each frame of audio data can be recorded after the audio data are collected, so that the collected audio data and the time stamp of each frame of audio data can be cached and recorded.

In step S202, a first quality evaluation parameter of the first audio data is acquired in a case where a preset quality evaluation trigger condition is satisfied.

The first quality evaluation parameter may be a signal-to-noise ratio or an audio quality score, and the preset quality evaluation triggering condition may include: the first audio data acquired in the current time period comprises a preset awakening word (such as 'love classmates'); or receiving a first quality evaluation indication message sent by the other terminal, where the first quality evaluation indication message is used to indicate the target terminal to perform audio quality analysis on the first audio data.

Considering that the performance of the wake-up engines of different terminals in the terminal group network is different, the wake-up engine with the better performance can recognize the preset wake-up word at first, so that in order to ensure the timeliness and accuracy of voice recognition, in the disclosure, the terminal which recognizes the preset wake-up word at first in the terminal group network can trigger other terminals in the terminal group network to perform audio quality analysis together, and here, if other terminals in the terminal group network recognize the collected second audio data including the preset wake-up word before the target terminal, the other terminals can send the first quality evaluation indication message to the target terminal, and at this moment, the target terminal can receive the first quality evaluation indication message, so as to trigger audio quality analysis on the first audio data according to the first quality evaluation indication message.

That is, in the present disclosure, the target terminal may perform audio quality analysis on the first audio data when determining that any one of the above conditions is satisfied.

In step S203, a second quality evaluation parameter sent by a terminal other than the target terminal in the terminal group network in which the target terminal is located is received.

The second quality evaluation parameter may include a signal-to-noise ratio or an audio quality score obtained by performing audio quality analysis on the second audio data.

In this disclosure, if the wake-up engine of the target terminal detects that the first audio data includes the preset wake-up word, it indicates that the target terminal is the terminal that recognizes the preset wake-up word first in the terminal group network, in this case, the target terminal may trigger other terminals in the group network to perform audio quality analysis on the collected second audio data to obtain the second quality evaluation parameter, and the other terminals may also send the second quality evaluation parameter to the target terminal, so that the target terminal may receive the second quality evaluation parameter

In step S204, it is determined whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter.

The data uploading terminal may include a terminal that sends audio data to be identified to the server, and the audio data to be identified may include instruction audio data input by the user, such as instruction audio data of "how to look like today" and "please play a song of a baby" and the like.

In step S205, in the case that the target terminal is determined to be the data uploading terminal, the audio data to be recognized is sent to the server, so that the server performs voice recognition on the audio data to be recognized.

The audio data to be identified is audio data collected within a target time period, the target time period includes a target time and a preset time period before the target time, the target time is a time when the target terminal is determined as the data uploading terminal, for example, the preset time period may be set to 5 seconds, assuming that 10 seconds of audio data collected latest is stored in the cache space, and the target time is t1, the target terminal may intercept the audio data collected within 10 seconds of audio data cached in the cache space and 5 seconds before t1 and t1 as the audio data to be identified.

By adopting the method, the audio quality analysis can be carried out on the first audio data acquired in the current time period at the target terminal side, whether the target terminal is a data uploading terminal is determined by combining the analysis result of the audio data received by other terminals, and the audio data to be identified is sent to the server under the condition that the target terminal is determined to be the data uploading terminal, so that the audio quality of the audio data to be identified uploaded by the data uploading terminal is ensured, and the accuracy of voice identification is improved.

Fig. 3 is a flow chart illustrating a method of speech recognition, which may be applied to a server, according to an example embodiment, the method comprising the steps of, as shown in fig. 3:

in step S301, audio data to be identified sent by at least one terminal in the terminal group network is received.

The audio data to be recognized may include instruction audio data input by the user, such as instruction audio data of "how to look like today" and "please play a song of a baby" and the like.

In step S302, audio quality analysis is performed on each piece of audio data to be identified, so as to obtain a third quality evaluation parameter.

The third quality evaluation parameter may include a signal-to-noise ratio or an audio quality score obtained by performing audio quality analysis on the audio data to be identified.

In this step, the server may perform audio quality analysis on the audio data to be identified by using the following two ways to obtain the third quality evaluation parameter:

and in the first mode, performing signal-to-noise ratio analysis on the audio data to be identified to obtain the third quality evaluation parameter.

In a second mode, the audio quality analysis may be performed on the audio data to be recognized through an audio quality analysis model (e.g., a deep learning model) obtained through pre-training, so as to obtain the third quality evaluation parameter.

It should be noted that, for the specific implementation steps of the two manners, reference may be made to descriptions in the related art, and details are not described herein.

In step S303, target identification audio data is determined from the audio data to be identified transmitted by at least one of the terminals according to the third quality evaluation parameter.

The target identification audio data may be audio data corresponding to a maximum quality evaluation parameter in the audio data to be identified, where the maximum quality evaluation parameter is a maximum quality evaluation parameter in the third quality evaluation parameters.

In step S304, speech recognition is performed on the target recognition audio data.

By adopting the method, the server can analyze the audio quality of the audio data to be recognized uploaded by the terminals in the terminal group network, and selects the target recognition audio data with the best audio quality to perform voice recognition, so that the accuracy of the voice recognition can be improved, and the use experience of a user can be improved.

FIG. 4 is a flow chart illustrating a method of speech recognition, as shown in FIG. 4, according to an exemplary embodiment, including the steps of:

in step S401, the target terminal collects audio data in real time.

The target terminal may be any terminal in the terminal group network, and the audio data may include one or more audio data of a wake-up audio (such as a wake-up word) emitted by a user, an instruction audio (such as "how is the weather today"), an environmental noise audio, and a play audio (such as music) of any terminal in the terminal group network.

In a possible implementation manner, the audio data may be collected in real time by a microphone on the target terminal, and after the audio data is collected in real time, overlay buffering may be performed in a buffer space of the target terminal, for example, assuming that 10 seconds of audio data may be buffered in the buffer space of the target terminal, after 11 seconds of audio data are collected in real time, the 1 st second of audio data that is buffered in the buffer space in history may be deleted from the buffer space, and then the 11 th second of audio data is stored in the buffer space, so that the audio data buffered in the buffer space is changed to 2 nd to 11 th second of audio data.

In addition, in order to facilitate echo cancellation of the collected audio data, the time stamp corresponding to each frame of audio data can be recorded after the audio data is collected, so that the collected audio data and the time stamp of each frame of audio data can be cached and recorded.

In step S402, the target terminal performs echo cancellation on the audio data collected in real time.

In the present disclosure, in order to improve accuracy of voice recognition, echo cancellation may be performed on audio data collected by a microphone, and in the related art, if a target terminal plays music, local echo cancellation may be performed on the audio data collected by the microphone, so as to eliminate interference of the music.

In one possible implementation, any terminal (i.e. the target terminal) in the terminal group network plays audio while, a time stamp of the start of audio playback may be recorded, for example, by detecting the presence or absence of an echo signal, a time stamp of the start of audio playback is recorded according to the earliest time when the echo signal is detected (i.e., the time when the echo signal is absent), and then a time stamp of each frame of audio data (e.g., one frame of data every 10 ms) can be recorded according to a preset time interval, and each frame of data and the corresponding time stamp can be sent to other terminals in the network, and at the same time, each frame of data and the corresponding time stamp can be stored locally, so that, in this step, the target terminal may receive external echo audio data sent by other terminals and a timestamp corresponding to each frame of external echo audio data; and/or, obtaining echo audio data locally cached by the target terminal and a timestamp of each frame of the locally cached echo audio data; in the echo cancellation process, corresponding audio data can be searched from the first audio data acquired in the current time period according to the acquired time stamp of each frame of echo audio data, and aligned audio data is obtained; then, performing echo cancellation on the first audio data according to the aligned audio data and echo audio data, where the echo audio data includes the external echo audio data and/or the local cache echo audio data, and the current time period may include a preset cache time period corresponding to a cache space of the target terminal, for example, may be preset to cache 10 seconds of audio data at most in the cache space, and the current time period is a historical 10 seconds time period that takes the current time as an end time.

It should be further noted that, in order to ensure that the audio data alignment is performed based on the timestamp of each frame of audio data, and further achieve the accuracy of echo cancellation, in the present disclosure, before recording the timestamp of each frame of data, Time alignment needs to be performed on each terminal in the terminal group Network and the clock of the server corresponding to the terminal in the terminal group Network, for example, Time synchronization may be performed through NTP (Network Time Protocol).

In addition, in order to further improve the accuracy of speech recognition, the present disclosure may further perform noise reduction processing on the audio data after echo cancellation, and the specific noise reduction method may refer to descriptions in the related art and is not described herein again.

In consideration of the related art, after local echo cancellation and noise reduction are performed on acquired audio data, acquired instruction audio is sent to a server only through a wake-up terminal screened based on a nearby wake-up logic in a terminal group network, but the inventor finds that, if the position of the wake-up audio sent by a user is not consistent with the position of the command audio sent by the wake-up terminal, or the quality of audio collected by a microphone of the wake-up terminal is lower than that of audio collected by other terminals without wake-up, the quality of audio data sent to the server through the wake-up terminal is lower, which affects the accuracy of server voice recognition, so to solve the technical problem, in this embodiment, audio quality analysis can be performed on audio data acquired in the current time period of a target terminal by performing step S403 to obtain a first quality evaluation parameter, and audio quality analysis can be performed on audio data acquired by other terminals in the terminal group network by performing steps S404-S405 to receive audio data acquired by the other terminals And the second quality evaluation parameter obtained later enables the target terminal to determine whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter by executing the step S406, and sends the audio data to be identified to the server through the data uploading terminal under the condition that the target terminal is determined to be the data uploading terminal, so that the audio quality of the uploaded audio data to be identified is ensured, and the accuracy of the server in voice identification is improved.

In step S403, when a preset quality evaluation triggering condition is satisfied, the target terminal obtains target audio data obtained by performing echo cancellation on interference audio data in the first audio data, and obtains a first quality evaluation parameter of the target audio data.

The first audio data is audio data acquired in a current time period, the first quality evaluation parameter may be a signal-to-noise ratio or an audio quality score, and the preset quality evaluation triggering condition may include: the first audio data acquired in the current time period comprises a preset awakening word (such as 'love classmates'); or receiving a first quality evaluation indication message sent by the other terminal, where the first quality evaluation indication message is used to indicate the target terminal to perform audio quality analysis on the first audio data.

That is, in the present disclosure, when it is determined that any one of the above conditions is satisfied, the target terminal may perform audio quality analysis on the target audio data.

In addition, in this step, the first quality evaluation parameter may be obtained by performing audio quality analysis on the target audio data in the following two ways:

and in the first mode, performing signal-to-noise ratio analysis on the target audio data to obtain the first quality evaluation parameter.

In a second mode, the first quality evaluation parameter may be obtained by performing audio quality analysis on the target audio data through an audio quality analysis model (e.g., a deep learning model) obtained through pre-training.

In step S404, if the first audio data includes the preset wake-up word, the target terminal sends a second quality evaluation indication message to the other terminal, where the second quality evaluation indication message is used to indicate the other terminal to perform audio quality analysis on the second audio data.

In this step, if the target terminal recognizes that the first audio data includes the preset wake-up word, it indicates that the target terminal is the terminal in the terminal group network that recognizes the preset wake-up word first, and in this case, the target terminal may trigger other terminals in the group network to perform audio quality analysis on the collected second audio data to obtain the second quality evaluation parameter.

In step S405, the target terminal receives the second quality evaluation parameter sent by the other terminal according to the second quality evaluation indication message.

In step S406, the target terminal determines whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter.

The second quality evaluation parameter includes at least one quality evaluation parameter, and in a general case, the second quality evaluation parameter corresponds to other terminals in the terminal group network one to one, and a corresponding second quality evaluation parameter that can be acquired by one of the other terminals, and the data upload terminal may include a terminal that sends audio data to be recognized to a server, where the audio data to be recognized may include instruction audio data input by a user, such as instruction audio data of "how to look like in the weather today" and "please play a song of a son", and the like.

In this step, the target terminal may be used as the data uploading terminal when the first quality evaluation parameter is greater than or equal to a preset evaluation threshold (the preset evaluation threshold may be a minimum threshold representing the quality of the audio frequency); or, determining whether each of the second quality evaluation parameters is less than or equal to the first quality evaluation parameter under the condition that the first quality evaluation parameter is less than or equal to the preset evaluation threshold, and taking the target terminal as the data uploading terminal if each of the second quality evaluation parameters is less than or equal to the first quality evaluation parameter.

In addition, if a target quality evaluation parameter which is larger than the first quality evaluation parameter exists in the second quality evaluation parameter, calculating a difference value between the target quality evaluation parameter and the first quality evaluation parameter; and if the difference value of the preset number is smaller than or equal to a preset difference value threshold value, taking the target terminal as the data uploading terminal.

For convenience of description, it may be assumed that the current terminal networking includes A, B, C three terminals, and the terminal a is the target terminal, where the first quality evaluation parameter obtained by the terminal a is represented as Xa, the second quality evaluation parameter obtained by the terminal B is represented as Xb, the second quality evaluation parameter obtained by the terminal C is represented as Xc, the preset evaluation threshold is represented as N, and if it is determined that Xa ≧ N, the terminal a (i.e., the target terminal) may be determined as the data upload terminal; if Xa < N is determined, whether Xb and Xc are both smaller than Xa can be further determined, that is, whether the audio collected by the terminal A is the terminal with the best audio quality in the audio collected by A, B, C can be determined, and the terminal A can also be used as the data uploading terminal under the condition that Xb and Xc are both smaller than Xa; if data greater than or equal to Xa exists in Xb and Xc, that is, at least one of Xb and Xc is greater than or equal to Xa, in this case, a difference between Xa and the target quality evaluation parameter (that is, data greater than or equal to Xa in Xb and Xc) may be calculated, and if a difference between a preset number (in this example, the preset number may be set to 1 or 2) is less than or equal to a preset difference threshold (in this case, it may be understood that the first quality evaluation parameter of the audio data acquired by the target terminal is not much different from the target quality evaluation parameter), the a terminal may also be used as the data upload terminal.

In addition, in the case where it is determined that the target terminal is not the data upload terminal based on the quality evaluation parameter obtained by the audio quality analysis, the present disclosure may also determine whether the target terminal is the data upload terminal based on the wake-up terminal determined by the nearby wake-up decision by performing steps S407 to S408.

In step S407, the target terminal receives the wake-up data sent by the other terminal; and determining whether to awaken the target terminal according to the first audio data and the awakening data.

The wake-up data may include wake-up audio feature data, such as MFCC (Mel Frequency Cepstrum Coefficient) feature values.

In this step, whether to wake up the target terminal may be determined according to the first audio data and the wake-up data based on a distributed wake-up decision manner, and specific implementation steps may refer to descriptions in related technologies and are not described herein again.

In step S408, if the target terminal determines that the target terminal is not the data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter, the target terminal is regarded as the data uploading terminal if the target terminal is determined to be woken up.

In step S409, the target terminal sends audio data to be recognized to the server so that the server performs voice recognition on the audio data to be recognized if the target terminal is determined to be the data uploading terminal.

The audio data to be recognized may include instruction audio data input by a user, such as instruction audio data of "how to look like today" and "please play a song of a baby" and the like.

It should be noted that the target terminal may obtain, from the buffer space, the audio data acquired in a target time period as the audio data to be identified, where the target time period includes a target time and a preset time period before the target time, and the target time may be a time when the target terminal is determined as the data uploading terminal.

For example, the preset time period may be set to 5 seconds, and assuming that 10 seconds of newly acquired audio data are stored in the buffer space and the target time is time t1, the target terminal may intercept audio data acquired at time t1 and 5 seconds before time t1 from the 10 seconds of audio data buffered in the buffer space as the audio data to be recognized.

It should be further noted that, as mentioned above, after the target terminal acquires the audio data, the target terminal may perform echo cancellation on the acquired audio data first, and therefore, the audio data to be recognized may include the audio data after echo cancellation (local echo cancellation and external echo cancellation), so that the target terminal may send the audio data after echo cancellation to the server, and accuracy of server speech recognition is improved.

In step S410, the server performs audio quality analysis on the audio data to be identified sent by each data uploading terminal to obtain the third quality evaluation parameter.

In step S411, the server determines target identification audio data from the audio data to be identified sent by at least one data uploading terminal according to the third quality evaluation parameter.

By way of example, it is assumed that each terminal within the terminal group network, after performing steps S401-S408, determining A, B, C that three terminals are the data uploading terminals, the server can receive A, B, C audio data to be identified sent by the three terminals respectively, and at this time, in order to further improve the accuracy of voice recognition, the server may perform quality analysis again on the audio data to be recognized sent by the three terminals to obtain the third quality evaluation parameter, thereby determining the target recognition audio data with the best audio quality from the audio data to be recognized transmitted from the A, B, C three terminals based on the third quality evaluation parameter, furthermore, the target recognition audio data with the best audio quality is subjected to speech recognition, so that the accuracy of speech recognition can be remarkably improved.

In step S412, the server performs speech recognition on the target recognition audio data.

By adopting the method, the audio quality analysis can be carried out on the first audio data acquired in the current time period at the target terminal side to obtain the first quality evaluation parameter, whether the target terminal is a data uploading terminal is determined by combining the second quality evaluation parameter which is the quality analysis result of the audio data received by the other terminals, and the audio data to be identified is sent to the server under the condition that the target terminal is determined to be the data uploading terminal, so that the audio quality of the audio data to be identified uploaded by the data uploading terminal is ensured, and the accuracy of voice identification is further improved; in addition, the method and the device can perform global echo cancellation on the first audio data and the audio data to be recognized before the audio data is subjected to audio quality analysis and the audio data to be recognized is uploaded to the server, completely cancel the received audio data played by other terminals and the audio data played by the target terminal from the audio data collected by the microphone, and upload the audio data to be recognized after the global echo cancellation to the server, so that the accuracy of voice recognition is further improved.

Fig. 5 is a block diagram illustrating an apparatus for speech recognition according to an exemplary embodiment, applied to a target terminal, as shown in fig. 5, the apparatus including:

a first obtaining module 501 configured to obtain first audio data collected at a current time period;

a second obtaining module 502, configured to obtain a first quality evaluation parameter of the first audio data if a preset quality evaluation triggering condition is met;

a first receiving module 503, configured to receive a second quality evaluation parameter sent by a terminal other than the target terminal in the terminal group network where the target terminal is located;

a first determining module 504 configured to determine whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter;

a first sending module 505, configured to send to-be-recognized audio data to a server so that the server performs voice recognition on the to-be-recognized audio data when the target terminal is determined to be the data uploading terminal, where the to-be-recognized audio data is audio data collected within a target time period, the target time period includes a target time and a preset time period before the target time, and the target time is a time when the target terminal is determined to be the data uploading terminal.

Optionally, the preset quality evaluation triggering condition includes:

the first audio data collected in the current time period comprises preset awakening words; alternatively, the first and second electrodes may be,

and receiving a first quality evaluation indication message sent by the other terminal, wherein the first quality evaluation indication message is used for indicating the target terminal to perform audio quality analysis on the first audio data.

Optionally, fig. 6 is a block diagram of an apparatus for speech recognition according to the embodiment shown in fig. 5, and as shown in fig. 6, the apparatus further includes:

a second sending module 506, configured to send a second quality evaluation indication message to the other terminal if the first audio data includes the preset wakeup word, where the second quality evaluation indication message is used to indicate the other terminal to perform audio quality analysis on second audio data;

the first receiving module 503 is configured to receive the second quality evaluation parameter sent by the other terminal according to the second quality evaluation indication message.

Optionally, the second quality-assessment parameter comprises at least one quality-assessment parameter,

the first determining module 504 is configured to take the target terminal as the data uploading terminal when the first quality evaluation parameter is greater than or equal to a preset evaluation threshold; or, determining whether each of the second quality evaluation parameters is less than or equal to the first quality evaluation parameter under the condition that the first quality evaluation parameter is less than or equal to the preset evaluation threshold, and taking the target terminal as the data uploading terminal if each of the second quality evaluation parameters is less than or equal to the first quality evaluation parameter.

Optionally, the first determining module 504 is configured to calculate a difference between the target quality evaluation parameter and the first quality evaluation parameter if the target quality evaluation parameter greater than the first quality evaluation parameter exists in the second quality evaluation parameter; and if the difference value of the preset number is smaller than or equal to a preset difference value threshold value, taking the target terminal as the data uploading terminal.

Optionally, fig. 7 is a block diagram of an apparatus for speech recognition according to the embodiment shown in fig. 5, and as shown in fig. 7, the apparatus further includes:

a second receiving module 507 configured to receive the wake-up data sent by the other terminal;

a second determining module 508 configured to determine whether to wake up the target terminal according to the first audio data and the wake-up data; and taking the target terminal as the data uploading terminal under the condition of determining to awaken the target terminal.

Optionally, fig. 8 is a block diagram of an apparatus for speech recognition according to the embodiment shown in fig. 5, and as shown in fig. 8, the apparatus further includes:

an echo cancellation module 509 configured to perform echo cancellation on the interference audio data in the first audio data to obtain target audio data;

the second obtaining module 502 is configured to obtain the first quality-assessment parameter of the target audio data.

Alternatively, fig. 9 is a block diagram of an apparatus for speech recognition according to the embodiment shown in fig. 8, and as shown in fig. 9, the apparatus further includes:

a third receiving module 510, configured to receive external echo audio data sent by the other terminal and a timestamp corresponding to each frame of external echo audio data; and/or, obtaining echo audio data locally cached by the target terminal and a timestamp of each frame of the locally cached echo audio data;

the echo cancellation module 509 is configured to search corresponding audio data from the first audio data according to the timestamp of each frame of acquired echo audio data, so as to obtain aligned audio data; performing echo cancellation on the first audio data according to the aligned audio data and echo audio data, where the echo audio data includes the external echo audio data and/or the local cache echo audio data.

By adopting the device, the audio quality analysis can be carried out on the first audio data acquired in the current time period at the target terminal side to obtain a first quality evaluation parameter, whether the target terminal is a data uploading terminal is determined by combining the second quality evaluation parameter which is the quality analysis result of the audio data received by the other terminals, and the audio data to be identified is sent to the server under the condition that the target terminal is determined to be the data uploading terminal, so that the audio quality of the audio data to be identified uploaded by the data uploading terminal is ensured, and the accuracy of voice identification is further improved; in addition, the method and the device can perform global echo cancellation on the first audio data and the audio data to be recognized before the audio data is subjected to audio quality analysis and the audio data to be recognized is uploaded to the server, completely cancel the received audio data played by other terminals and the audio data played by the target terminal from the audio data collected by the microphone, and upload the audio data to be recognized after the global echo cancellation to the server, so that the accuracy of voice recognition is further improved.

Fig. 10 is a block diagram illustrating an apparatus for speech recognition according to an exemplary embodiment of the present disclosure, applied to a server, as shown in fig. 10, the apparatus including:

a fourth receiving module 1001 configured to receive audio data to be identified sent by at least one terminal in the terminal group network;

the audio quality analysis module 1002 is configured to perform audio quality analysis on each piece of audio data to be identified to obtain a third quality evaluation parameter;

a third determining module 1003 configured to determine target identification audio data from the audio data to be identified sent by at least one of the terminals according to the third quality evaluation parameter;

a speech recognition module 1004 configured to perform speech recognition on the target recognition audio data.

Optionally, the terminal includes a data uploading terminal, and the fourth receiving module 1001 is configured to receive the audio data to be identified sent by at least one data uploading terminal in the terminal group network; the data uploading terminal is a terminal determined by a target terminal according to a first quality evaluation parameter and a second quality evaluation parameter, the first quality evaluation parameter is a quality evaluation parameter obtained by performing audio quality analysis on the acquired first audio data by the target terminal under the condition that a preset trigger condition is met, and the second quality evaluation parameter is a parameter obtained by performing audio quality analysis on the acquired second audio data by other terminals except the target terminal in the terminal group network;

the audio quality analysis module 1002 is configured to perform audio quality analysis on the audio data to be identified sent by each data uploading terminal, so as to obtain the third quality evaluation parameter.

Optionally, the audio data to be identified includes echo cancellation audio data obtained after echo cancellation is performed by the terminal;

the audio quality analysis module 1002 is configured to perform audio quality analysis on each of the echo cancellation audio data to obtain the third quality evaluation parameter.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

By adopting the device, the server can analyze the audio quality of the audio data to be recognized uploaded by the terminals in the terminal group network, and selects the target recognition audio data with the best audio quality to perform voice recognition, so that the accuracy of the voice recognition can be improved, and the use experience of a user can be improved.

The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of speech recognition provided by the present disclosure.

FIG. 11 is a block diagram illustrating an apparatus 1100 for speech recognition according to an example embodiment. For example, the apparatus 1100 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 11, apparatus 1100 may include one or more of the following components: a processing component 1102, a memory 1104, a power component 1106, a multimedia component 1108, an audio component 1110, an input/output (I/O) interface 1112, a sensor component 1114, and a communication component 1116.

The processing component 1102 generally controls the overall operation of the device 1100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 1102 may include one or more processors 1120 to execute instructions to perform all or a portion of the steps of the method of speech recognition described above. Further, the processing component 1102 may include one or more modules that facilitate interaction between the processing component 1102 and other components. For example, the processing component 1102 may include a multimedia module to facilitate interaction between the multimedia component 1108 and the processing component 1102.

The memory 1104 is configured to store various types of data to support operations at the apparatus 1100. Examples of such data include instructions for any application or method operating on device 1100, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1104 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 1106 provide power to the various components of device 1100. The power components 1106 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the apparatus 1100.

The multimedia component 1108 includes a screen that provides an output interface between the device 1100 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1108 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 1100 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1110 is configured to output and/or input audio signals. For example, the audio component 1110 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 1100 is in operating modes, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1104 or transmitted via the communication component 1116. In some embodiments, the audio assembly 1110 further includes a speaker for outputting audio signals.

The I/O interface 1112 provides an interface between the processing component 1102 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1114 includes one or more sensors for providing various aspects of state assessment for the apparatus 1100. For example, the sensor assembly 1114 may detect an open/closed state of the apparatus 1100, the relative positioning of components, such as a display and keypad of the apparatus 1100, the sensor assembly 1114 may also detect a change in position of the apparatus 1100 or a component of the apparatus 1100, the presence or absence of user contact with the apparatus 1100, orientation or acceleration/deceleration of the apparatus 1100, and a change in temperature of the apparatus 1100. The sensor assembly 1114 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1114 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1114 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1116 is configured to facilitate wired or wireless communication between the apparatus 1100 and other devices. The apparatus 1100 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1116 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1116 also includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 1100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described voice recognition methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 1104 comprising instructions, executable by the processor 1120 of the device 1100 to perform the speech recognition method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned speech recognition method when executed by the programmable apparatus.

Fig. 12 is a block diagram illustrating an apparatus 1200 for speech recognition according to an example embodiment. For example, the apparatus 1200 may be provided as a server. Referring to fig. 12, the apparatus 1200 includes a processing component 1222 that further includes one or more processors, and memory resources, represented by memory 1232, for storing instructions, such as application programs, that are executable by the processing component 1222. The application programs stored in memory 1232 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1222 is configured to execute instructions to perform the speech recognition methods described above.

The apparatus 1200 may also include a power supply component 1226 configured to perform power management of the apparatus 1200, a wired or wireless network interface 1250 configured to connect the apparatus 1200 to a network, and an input output (I/O) interface 1258. The apparatus 1200 may operate based on an operating system, such as Windows Server, stored in the memory 1232^TM，Mac OS X^TM，Unix^TM，Linux^TM，FreeBSD^TMOr the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A voice recognition method is applied to a target terminal and comprises the following steps:

acquiring first audio data acquired in the current time period;

acquiring a first quality evaluation parameter of the first audio data under the condition that a preset quality evaluation triggering condition is met;

receiving a second quality evaluation parameter sent by other terminals except the target terminal in the terminal group network where the target terminal is located;

determining whether the target terminal is a data uploading terminal or not according to the first quality evaluation parameter and the second quality evaluation parameter;

and under the condition that the target terminal is determined to be the data uploading terminal, sending audio data to be identified to a server so that the server performs voice identification on the audio data to be identified, wherein the audio data to be identified is audio data collected in a target time period, the target time period comprises a target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

2. The method of claim 1, wherein the preset quality assessment triggering condition comprises:

3. The method according to claim 2, wherein before said receiving the second quality evaluation parameter transmitted by the other terminal except the target terminal in the terminal group network where the target terminal is located, the method further comprises:

if the first audio data comprises the preset awakening word, sending a second quality evaluation indication message to the other terminal, wherein the second quality evaluation indication message is used for indicating the other terminal to perform audio quality analysis on the second audio data;

the receiving of the second quality evaluation parameter sent by the other terminals except the target terminal in the terminal group network where the target terminal is located includes:

and receiving the second quality evaluation parameter sent by the other terminals according to the second quality evaluation indication message.

4. The method of claim 1, wherein the second quality evaluation parameter comprises at least one quality evaluation parameter, and wherein the determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter comprises:

taking the target terminal as the data uploading terminal under the condition that the first quality evaluation parameter is greater than or equal to a preset evaluation threshold value; alternatively, the first and second electrodes may be,

and under the condition that the first quality evaluation parameter is smaller than or equal to the preset evaluation threshold, determining whether each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, and if each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, taking the target terminal as the data uploading terminal.

5. The method of claim 4, wherein the determining whether the target terminal is a data upload terminal according to the first quality evaluation parameter and the second quality evaluation parameter further comprises:

if a target quality evaluation parameter which is larger than the first quality evaluation parameter exists in the second quality evaluation parameter, calculating a difference value between the target quality evaluation parameter and the first quality evaluation parameter;

and if the difference value of the preset number is smaller than or equal to a preset difference value threshold value, taking the target terminal as the data uploading terminal.

6. The method of claim 1, further comprising:

receiving awakening data sent by other terminals;

determining whether to awaken the target terminal according to the first audio data and the awakening data;

and taking the target terminal as the data uploading terminal under the condition of determining to awaken the target terminal.

7. The method according to any of claims 1 to 6, wherein prior to said obtaining a first quality-assessment parameter of said first audio data, the method further comprises:

performing echo cancellation on interference audio data in the first audio data to obtain target audio data;

the obtaining of the first quality-assessment parameter of the first audio data includes:

and acquiring the first quality evaluation parameter of the target audio data.

8. The method of claim 7, wherein prior to the echo canceling interfering audio data in the first audio data, the method further comprises:

receiving external echo audio data sent by other terminals and a timestamp corresponding to each frame of external echo audio data; and/or acquiring echo audio data locally cached by the target terminal and a timestamp of each frame of the locally cached echo audio data;

the performing echo cancellation on the interfering audio data in the first audio data comprises:

searching corresponding audio data from the first audio data according to the acquired time stamp of each frame of echo audio data to obtain aligned audio data;

performing echo cancellation on the first audio data according to the aligned audio data and the echo audio data, where the echo audio data includes the external echo audio data and/or the local cache echo audio data.

9. A method of speech recognition, applied to a server, the method comprising:

receiving audio data to be identified sent by at least one terminal in a terminal group network;

performing audio quality analysis on each audio data to be identified to obtain a third quality evaluation parameter;

determining target identification audio data from the audio data to be identified sent by at least one terminal according to the third quality evaluation parameter;

and performing voice recognition on the target recognition audio data.

10. The method of claim 9, wherein the terminal comprises a data uploading terminal, and wherein the receiving the audio data to be recognized sent by at least one terminal in the terminal group network comprises:

receiving the audio data to be identified sent by at least one data uploading terminal in the terminal group network; the data uploading terminal is a terminal determined by a target terminal according to a first quality evaluation parameter and a second quality evaluation parameter, the first quality evaluation parameter is a quality evaluation parameter obtained by performing audio quality analysis on the acquired first audio data by the target terminal under the condition that a preset trigger condition is met, and the second quality evaluation parameter is a parameter obtained by performing audio quality analysis on the acquired second audio data by other terminals except the target terminal in the terminal group network;

the audio quality analysis of each audio data to be identified to obtain a third quality evaluation parameter includes:

and performing audio quality analysis on the audio data to be identified sent by each data uploading terminal to obtain the third quality evaluation parameter.

11. The method according to claim 9 or 10, wherein the audio data to be identified comprises echo cancellation audio data obtained by performing echo cancellation by the terminal;

and performing audio quality analysis on each echo cancellation audio data to obtain the third quality evaluation parameter.

12. A speech recognition device, applied to a target terminal, includes:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is configured to acquire first audio data acquired in a current time period;

the second acquisition module is configured to acquire a first quality evaluation parameter of the first audio data under the condition that a preset quality evaluation triggering condition is met;

a first receiving module configured to receive a second quality evaluation parameter sent by a terminal other than the target terminal in a terminal group network in which the target terminal is located;

a first determining module configured to determine whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter;

the first sending module is configured to send audio data to be recognized to a server under the condition that the target terminal is determined to be the data uploading terminal, so that the server performs voice recognition on the audio data to be recognized, the audio data to be recognized is audio data collected in a target time period, the target time period comprises a target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

13. The apparatus of claim 12, wherein the preset quality assessment triggering condition comprises:

14. The apparatus of claim 13, further comprising:

a second sending module, configured to send a second quality evaluation indication message to the other terminal if the first audio data includes the preset wake-up word, where the second quality evaluation indication message is used to indicate the other terminal to perform audio quality analysis on second audio data;

the first receiving module is configured to receive the second quality evaluation parameter sent by the other terminal according to the second quality evaluation indication message.

15. The apparatus of claim 12, wherein the second quality-assessment parameter comprises at least one quality-assessment parameter,

the first determining module is configured to take the target terminal as the data uploading terminal under the condition that the first quality evaluation parameter is greater than or equal to a preset evaluation threshold value; or, under the condition that the first quality evaluation parameter is less than or equal to the preset evaluation threshold, determining whether each second quality evaluation parameter is less than or equal to the first quality evaluation parameter, and if each second quality evaluation parameter is less than or equal to the first quality evaluation parameter, taking the target terminal as the data uploading terminal.

16. The apparatus according to claim 15, wherein the first determining module is configured to calculate a difference between the target quality-evaluation parameter and the first quality-evaluation parameter if there is a target quality-evaluation parameter greater than the first quality-evaluation parameter in the second quality-evaluation parameters; and if the difference value of the preset number is smaller than or equal to a preset difference value threshold value, taking the target terminal as the data uploading terminal.

17. The apparatus of claim 12, further comprising:

a second receiving module configured to receive the wake-up data sent by the other terminal;

a second determining module configured to determine whether to wake up the target terminal according to the first audio data and the wake-up data; and taking the target terminal as the data uploading terminal under the condition of determining to awaken the target terminal.

18. The apparatus of any one of claims 12 to 17, further comprising:

the echo cancellation module is configured to perform echo cancellation on interference audio data in the first audio data to obtain target audio data;

the second obtaining module is configured to obtain the first quality assessment parameter of the target audio data.

19. The apparatus of claim 18, further comprising:

the third receiving module is configured to receive the external echo audio data sent by the other terminal and a timestamp corresponding to each frame of external echo audio data; and/or acquiring echo audio data locally cached by the target terminal and a timestamp of each frame of the locally cached echo audio data;

the echo cancellation module is configured to search for corresponding audio data from the first audio data according to the obtained timestamp of each frame of the echo audio data, so as to obtain aligned audio data; performing echo cancellation on the first audio data according to the aligned audio data and the echo audio data, where the echo audio data includes the external echo audio data and/or the local cache echo audio data.

20. An apparatus for speech recognition, applied to a server, the apparatus comprising:

the fourth receiving module is configured to receive audio data to be identified, which is sent by at least one terminal in the terminal group network;

the audio quality analysis module is configured to perform audio quality analysis on each audio data to be identified to obtain a third quality evaluation parameter;

a third determining module configured to determine target identification audio data from the audio data to be identified sent by at least one terminal according to the third quality evaluation parameter;

a speech recognition module configured to perform speech recognition on the target recognition audio data.

21. The apparatus of claim 20, wherein the terminal comprises a data uploading terminal, and the fourth receiving module is configured to receive the audio data to be identified sent by at least one of the data uploading terminals in the terminal group network; the data uploading terminal is a terminal determined by a target terminal according to a first quality evaluation parameter and a second quality evaluation parameter, the first quality evaluation parameter is a quality evaluation parameter obtained by performing audio quality analysis on the acquired first audio data by the target terminal under the condition that a preset trigger condition is met, and the second quality evaluation parameter is a parameter obtained by performing audio quality analysis on the acquired second audio data by other terminals except the target terminal in the terminal group network;

the audio quality analysis module is configured to perform audio quality analysis on the audio data to be identified sent by each data uploading terminal to obtain the third quality evaluation parameter.

22. The apparatus according to claim 20 or 21, wherein the audio data to be identified comprises echo cancellation audio data obtained by performing echo cancellation by the terminal;

the audio quality analysis module is configured to perform audio quality analysis on each echo cancellation audio data to obtain the third quality evaluation parameter.

23. A speech recognition device, applied to a target terminal, includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring first audio data acquired in the current time period; acquiring a first quality evaluation parameter of the first audio data under the condition that a preset quality evaluation triggering condition is met; receiving a second quality evaluation parameter sent by other terminals except the target terminal in the terminal group network where the target terminal is located; determining whether the target terminal is a data uploading terminal or not according to the first quality evaluation parameter and the second quality evaluation parameter; and under the condition that the target terminal is determined to be the data uploading terminal, sending audio data to be identified to a server so that the server performs voice identification on the audio data to be identified, wherein the audio data to be identified is audio data collected in a target time period, the target time period comprises a target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

24. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 8.

25. A speech recognition device, applied to a server, includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: receiving audio data to be identified sent by at least one terminal in a terminal group network; performing audio quality analysis on each audio data to be identified to obtain a third quality evaluation parameter; determining target identification audio data from the audio data to be identified sent by at least one terminal according to the third quality evaluation parameter; and performing voice recognition on the target recognition audio data.

26. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 9 to 11.