CN112489653B

CN112489653B - Speech recognition method, device and storage medium

Info

Publication number: CN112489653B
Application number: CN202011282047.3A
Authority: CN
Inventors: 程思
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2024-04-26
Anticipated expiration: 2040-11-16
Also published as: CN112489653A

Abstract

The disclosure relates to a method, a device and a storage medium for voice recognition, which can acquire first audio data acquired in a current time period; under the condition that a preset quality evaluation triggering condition is met, acquiring a first quality evaluation parameter of first audio data; receiving second quality evaluation parameters sent by other terminals except the target terminal in the terminal network where the target terminal is located; determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter; and under the condition that the target terminal is a data uploading terminal, sending the audio data to be identified to the server so that the server carries out voice recognition on the audio data to be identified, wherein the audio data to be identified is the audio data collected in a target time period, the target time period comprises a target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

Description

Speech recognition method, device and storage medium

Technical Field

The present disclosure relates to the field of speech recognition, and in particular, to a method, an apparatus, and a storage medium for speech recognition.

Background

In recent years, a voice recognition technology has been significantly advanced, an interactive mode of performing man-machine interaction through voice has been widely focused, a large number of intelligent devices based on voice interaction have been developed, such as intelligent sound boxes, intelligent air conditioners, voice assistants and the like, a user can wake up the devices by speaking wake-up words, a plurality of intelligent devices may exist in the same space in an actual application scene, and after the user inputs the wake-up audio, the intelligent devices closest to the user can be screened out from the plurality of intelligent devices in a distributed decision mode to perform response.

In the related art, after the intelligent device collects the input audio, on one hand, the wake-up engine can be input to perform a nearby wake-up process through the wake-up engine, and on the other hand, after the nearby wake-up engine judges that the current device is wake-up, the input audio is uploaded to the server through starting an automatic speech recognition ASR audio uploading function, so that the server is used for carrying out speech recognition on the input audio, and man-machine interaction is achieved. The inventors have found and appreciated that if the location of the wake-up device does not coincide with the location of the user input audio instructions, or if the quality of audio collected by the microphone of the wake-up device is worse than that of other audio collected without the wake-up device, the accuracy of the server speech recognition may be affected.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method, apparatus, and storage medium for speech recognition.

According to a first aspect of embodiments of the present disclosure, a method for speech recognition is provided, which is applied to a target terminal, and includes acquiring first audio data acquired in a current time period; under the condition that a preset quality evaluation triggering condition is met, acquiring a first quality evaluation parameter of the first audio data; receiving a second quality evaluation parameter sent by other terminals except the target terminal in a terminal network where the target terminal is located; determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter; and under the condition that the target terminal is the data uploading terminal, sending audio data to be identified to a server so that the server can conduct voice identification on the audio data to be identified, wherein the audio data to be identified is audio data collected in a target time period, the target time period comprises target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

Optionally, the preset quality evaluation triggering condition includes: the first audio data acquired in the current time period comprise preset wake-up words; or receiving a first quality evaluation indication message sent by the other terminal, where the first quality evaluation indication message is used to instruct the target terminal to perform audio quality analysis on the first audio data.

Optionally, before the receiving the second quality evaluation parameters sent by other terminals except the target terminal in the terminal network where the target terminal is located, the method further includes: if the first audio data comprises the preset wake-up word, sending a second quality evaluation indication message to the other terminal, wherein the second quality evaluation indication message is used for indicating the other terminal to perform audio quality analysis on the second audio data; the receiving the second quality evaluation parameters sent by other terminals except the target terminal in the terminal network where the target terminal is located includes: and receiving the second quality evaluation parameters sent by the other terminals according to the second quality evaluation indication message.

Optionally, the second quality evaluation parameter includes at least one quality evaluation parameter, and determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter includes: under the condition that the first quality evaluation parameter is larger than or equal to a preset evaluation threshold, the target terminal is used as the data uploading terminal; or if the first quality evaluation parameter is smaller than or equal to the preset evaluation threshold, determining whether each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, and if each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, taking the target terminal as the data uploading terminal.

Optionally, the determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter further includes: if the second quality evaluation parameter has a target quality evaluation parameter which is larger than the first quality evaluation parameter, calculating a difference value between the target quality evaluation parameter and the first quality evaluation parameter; and if the preset number of the differences is smaller than or equal to a preset difference threshold, taking the target terminal as the data uploading terminal.

Optionally, the method further comprises: receiving wake-up data sent by the other terminals; determining whether to wake up the target terminal according to the first audio data and the wake-up data; and under the condition that the target terminal is determined to be awakened, the target terminal is used as the data uploading terminal.

Optionally, before the acquiring the first quality evaluation parameter of the first audio data, the method further comprises: echo cancellation is carried out on the interference audio data in the first audio data to obtain target audio data; the obtaining the first quality evaluation parameter of the first audio data includes: the first quality evaluation parameter of the target audio data is acquired.

Optionally, before the echo cancellation of the interfering audio data in the first audio data, the method further comprises: receiving external echo audio data sent by the other terminals and a time stamp corresponding to each frame of external echo audio data; and/or acquiring echo audio data locally cached by the target terminal and a time stamp of the locally cached echo audio data of each frame; the echo cancellation of the interfering audio data in the first audio data includes: searching corresponding audio data from the first audio data according to the obtained time stamp of each frame of the echo audio data to obtain aligned audio data; and performing echo cancellation on the first audio data according to the aligned audio data and the echo audio data, wherein the echo audio data comprises the external echo audio data and/or the locally cached echo audio data.

According to a second aspect of embodiments of the present disclosure, there is provided a method of speech recognition, applied to a server, the method comprising: receiving audio data to be identified sent by at least one terminal in a terminal networking; carrying out audio quality analysis on each piece of audio data to be identified to obtain a third quality evaluation parameter; determining target identification audio data from the audio data to be identified sent by at least one terminal according to the third quality evaluation parameter; and carrying out voice recognition on the target recognition audio data.

Optionally, the terminal includes a data uploading terminal, and the audio data to be identified sent by at least one terminal in the receiving terminal network includes: receiving the audio data to be identified sent by at least one data uploading terminal in the terminal networking; the data uploading terminal is a terminal determined by a target terminal according to a first quality evaluation parameter and a second quality evaluation parameter, wherein the first quality evaluation parameter is a quality evaluation parameter obtained by performing audio quality analysis on the collected first audio data under the condition that the target terminal meets a preset trigger condition, and the second quality evaluation parameter is a parameter obtained by performing audio quality analysis on the collected second audio data by other terminals except the target terminal in the terminal networking; the audio quality analysis is carried out on each piece of audio data to be identified, and the obtaining of the third quality evaluation parameter comprises the following steps: and carrying out audio quality analysis on the audio data to be identified, which are sent by each data uploading terminal, so as to obtain the third quality evaluation parameter.

Optionally, the audio data to be identified includes echo cancellation audio data obtained after echo cancellation by the terminal; the audio quality analysis is carried out on each piece of audio data to be identified, and the obtaining of the third quality evaluation parameter comprises the following steps: and carrying out audio quality analysis on each echo cancellation audio data to obtain the third quality evaluation parameter.

According to a third aspect of the embodiments of the present disclosure, there is provided a voice recognition apparatus applied to a target terminal, including: the first acquisition module is configured to acquire first audio data acquired in the current time period; the second acquisition module is configured to acquire a first quality evaluation parameter of the first audio data under the condition that a preset quality evaluation triggering condition is met; the first receiving module is configured to receive second quality evaluation parameters sent by other terminals except the target terminal in a terminal network where the target terminal is located; the first determining module is configured to determine whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter; the first sending module is configured to send audio data to be identified to a server under the condition that the target terminal is determined to be the data uploading terminal, so that the server can conduct voice recognition on the audio data to be identified, the audio data to be identified are audio data collected in a target time period, the target time period comprises target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

Optionally, the apparatus further comprises: the second sending module is configured to send a second quality evaluation indication message to the other terminal if the first audio data comprises the preset wake-up word, wherein the second quality evaluation indication message is used for indicating the other terminal to perform audio quality analysis on second audio data; the first receiving module is configured to receive the second quality evaluation parameters sent by the other terminals according to the second quality evaluation indication message.

Optionally, the second quality evaluation parameter includes at least one quality evaluation parameter, and the first determining module is configured to take the target terminal as the data uploading terminal if the first quality evaluation parameter is greater than or equal to a preset evaluation threshold; or if the first quality evaluation parameter is smaller than or equal to the preset evaluation threshold, determining whether each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, and if each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, taking the target terminal as the data uploading terminal.

Optionally, the first determining module is configured to calculate a difference value between the target quality evaluation parameter and the first quality evaluation parameter if there is a target quality evaluation parameter greater than the first quality evaluation parameter in the second quality evaluation parameters; and if the preset number of the differences is smaller than or equal to a preset difference threshold, taking the target terminal as the data uploading terminal.

Optionally, the apparatus further comprises: the second receiving module is configured to receive wake-up data sent by the other terminals; a second determining module configured to determine whether to wake up the target terminal according to the first audio data and the wake-up data; and under the condition that the target terminal is determined to be awakened, the target terminal is used as the data uploading terminal.

Optionally, the apparatus further comprises: the echo cancellation module is configured to perform echo cancellation on the interference audio data in the first audio data to obtain target audio data; the second acquisition module is configured to acquire the first quality evaluation parameter of the target audio data.

Optionally, the apparatus further comprises: the third receiving module is configured to receive external echo audio data sent by the other terminals and time stamps corresponding to each frame of external echo audio data; and/or acquiring echo audio data locally cached by the target terminal and a time stamp of the locally cached echo audio data of each frame; the echo cancellation module is configured to search corresponding audio data from the first audio data according to the obtained time stamp of the echo audio data of each frame to obtain aligned audio data; and performing echo cancellation on the first audio data according to the aligned audio data and the echo audio data, wherein the echo audio data comprises the external echo audio data and/or the locally cached echo audio data.

According to a fourth aspect of embodiments of the present disclosure, there is provided an apparatus for speech recognition, applied to a server, the apparatus comprising: the fourth receiving module is configured to receive audio data to be identified, which is sent by at least one terminal in the terminal networking; the audio quality analysis module is configured to perform audio quality analysis on each piece of audio data to be identified to obtain a third quality evaluation parameter; a third determining module configured to determine target recognition audio data from the audio data to be recognized transmitted by at least one of the terminals according to the third quality evaluation parameter; and the voice recognition module is configured to perform voice recognition on the target recognition audio data.

Optionally, the terminal includes a data uploading terminal, and the fourth receiving module is configured to receive the audio data to be identified sent by at least one data uploading terminal in the terminal network; the data uploading terminal is a terminal determined by a target terminal according to a first quality evaluation parameter and a second quality evaluation parameter, wherein the first quality evaluation parameter is a quality evaluation parameter obtained by performing audio quality analysis on the collected first audio data under the condition that the target terminal meets a preset trigger condition, and the second quality evaluation parameter is a parameter obtained by performing audio quality analysis on the collected second audio data by other terminals except the target terminal in the terminal networking; the audio quality analysis module is configured to perform audio quality analysis on the audio data to be identified, which is sent by each data uploading terminal, so as to obtain the third quality evaluation parameter.

Optionally, the audio data to be identified includes echo cancellation audio data obtained after echo cancellation by the terminal; the audio quality analysis module is configured to perform audio quality analysis on each echo cancellation audio data to obtain the third quality evaluation parameter.

According to a fifth aspect of embodiments of the present disclosure, there is provided a voice recognition apparatus applied to a target terminal, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: acquiring first audio data acquired in a current time period; under the condition that a preset quality evaluation triggering condition is met, acquiring a first quality evaluation parameter of the first audio data; receiving a second quality evaluation parameter sent by other terminals except the target terminal in a terminal network where the target terminal is located; determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter; and under the condition that the target terminal is the data uploading terminal, sending audio data to be identified to a server so that the server can conduct voice identification on the audio data to be identified, wherein the audio data to be identified is audio data collected in a target time period, the target time period comprises target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which when executed by a processor perform the steps of the method of the first aspect of the present disclosure.

According to a seventh aspect of embodiments of the present disclosure, there is provided a voice recognition apparatus, applied to a server, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: receiving audio data to be identified sent by at least one terminal in a terminal networking; carrying out audio quality analysis on each piece of audio data to be identified to obtain a third quality evaluation parameter; determining target identification audio data from the audio data to be identified sent by at least one terminal according to the third quality evaluation parameter; and carrying out voice recognition on the target recognition audio data.

According to an eighth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of the second aspect of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: acquiring first audio data acquired in a current time period through a target terminal; under the condition that a preset quality evaluation triggering condition is met, acquiring a first quality evaluation parameter of the first audio data; receiving a second quality evaluation parameter sent by other terminals except the target terminal in a terminal network where the target terminal is located; determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter; and under the condition that the target terminal is determined to be the data uploading terminal, sending audio data to be identified to a server so that the server carries out voice identification on the audio data to be identified, wherein the audio data to be identified is the audio data collected in a target time period, the target time period comprises target time and a preset time period before the target time, the target time is the time when the target terminal is determined to be the data uploading terminal, in this way, the audio quality analysis can be carried out on the first audio data collected in the current time period at the target terminal side, the analysis result of other terminals on the received audio data is combined to determine whether the target terminal is the data uploading terminal, and under the condition that the target terminal is determined to be the data uploading terminal, the audio data to be identified is sent to the server, so that the audio quality of the audio data to be identified, which is uploaded by the data uploading terminal, is ensured, and the voice identification accuracy is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of a scenario illustrating a terminal networking according to an exemplary embodiment;

FIG. 2 is a flowchart illustrating a first speech recognition method according to an exemplary embodiment;

FIG. 3 is a flowchart illustrating a second speech recognition method according to an exemplary embodiment;

FIG. 4 is a flowchart illustrating a third speech recognition method according to an exemplary embodiment;

FIG. 5 is a block diagram of an apparatus for a first type of speech recognition, according to an example embodiment;

FIG. 6 is a block diagram of an apparatus for second speech recognition, shown according to an exemplary embodiment;

FIG. 7 is a block diagram of an apparatus for third speech recognition, shown according to an exemplary embodiment;

FIG. 8 is a block diagram of an apparatus for fourth speech recognition, shown according to an exemplary embodiment;

FIG. 9 is a block diagram of an apparatus for fifth speech recognition, shown according to an exemplary embodiment;

FIG. 10 is a block diagram of an apparatus for sixth speech recognition, shown according to an exemplary embodiment;

FIG. 11 is a block diagram of a speech recognition device, according to an example embodiment;

fig. 12 is a block diagram illustrating another speech recognition device according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Firstly, introducing an application scenario of the present disclosure, the present disclosure is mainly applied to a voice wake-up scenario in a distributed device networking environment, for example, a user wakes up any intelligent device in a terminal networking by voice and implements man-machine interaction by inputting voice information, where the terminal networking may include a network composed of at least two intelligent devices, for example, fig. 1 is a schematic view of a terminal networking, as shown in fig. 1, a plurality of intelligent devices (such as devices of an intelligent sound box, a notebook computer, a television, a watch, a mobile phone, etc.) are bound under the same millet account, and networking can be performed in a state where the plurality of intelligent devices are all in networking.

In the related art, after the intelligent device collects the input audio, on one hand, the wake-up engine can be input to perform a nearby wake-up process through the wake-up engine, and on the other hand, after the nearby wake-up engine judges that the current device is wake-up, the input audio is uploaded to the server through starting an automatic speech recognition ASR audio uploading function, so that the server is used for carrying out speech recognition on the input audio, and man-machine interaction is achieved. The inventor discovers and realizes that if the position of the awakening device is inconsistent with the position of the audio instruction input by the user, or the audio quality collected by a microphone of the awakening device is poorer than that collected by other awakening devices, the accuracy of server voice recognition can be affected; in addition, in order to further improve the accuracy of speech recognition, in the related art, if the target terminal (i.e., any terminal in the terminal network) is currently playing music, the audio data collected by the microphone may be first subjected to local echo cancellation, to eliminate the interference of the music, and then noise reduction is performed, if the target terminal is not currently playing music, no echo cancellation is required, the noise reduction is directly performed, and the noise-reduced audio data is transmitted to the wake-up engine to perform a nearby wake-up decision, but if a plurality of intelligent devices exist in the same space, the music is played by a plurality of intelligent devices (i.e., other terminals exist in the terminal network), the audio data collected by the target terminal may be interfered by the music played by other terminals, and if only local echo cancellation (i.e., the audio data played by the target terminal in the audio data collected by the microphone is eliminated), the accuracy of speech recognition may be affected.

In order to solve the above-mentioned problems, the present disclosure provides a method, an apparatus, and a storage medium for voice recognition, where the method, the apparatus, and the storage medium can perform audio quality analysis on first audio data collected in a current time period at a target terminal side to obtain a first quality evaluation parameter, and determine whether the target terminal is a data uploading terminal according to a quality analysis result of the audio data received by the target terminal, that is, a second quality evaluation parameter, and send audio data to be recognized to a server when the target terminal is determined to be the data uploading terminal, thereby ensuring audio quality of the audio data to be recognized uploaded by the data uploading terminal, and further improving accuracy of voice recognition; in addition, before audio quality analysis is performed on the collected audio data and before the audio data to be identified is uploaded to the server, global echo cancellation is performed on the first audio data and the audio data to be identified, all the received audio data played by other terminals and the audio data played by the target terminal are cancelled from the audio data collected by the microphone, and the audio data to be identified after the global echo cancellation is uploaded to the server, so that the accuracy of voice recognition is further improved.

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating a voice recognition method according to an exemplary embodiment, where the method is applied to a target terminal, and the target terminal may be any terminal in a terminal network, as shown in fig. 2, and the method includes the following steps:

In step S201, first audio data acquired in the current period of time is acquired.

In the present disclosure, audio data may be collected in real time by a target terminal, and the audio data may include one or more of wake-up audio (e.g., wake-up words) issued by a user, instruction audio (e.g., "how does today be.

In one possible implementation manner, the audio data may be collected in real time through a microphone on the target terminal, and after the audio data is collected in real time, an overlay buffer may be performed in a buffer space of the target terminal, for example, assuming that 10 seconds of audio data may be buffered in the buffer space of the target terminal, after 11 seconds of audio data is collected in real time, 1 second of audio data that is historically buffered in the buffer space may be deleted from the buffer space, and then 11 seconds of audio data may be stored in the buffer space, so that the audio data buffered in the buffer space is changed into 2 seconds of audio data collected to 11 seconds of audio data collected.

The current time period may include a preset buffer time period corresponding to a buffer space of the target terminal, for example, audio data buffered for at most 10 seconds in the buffer space may be preset, and the current time period is a time period of 10 seconds of history with the current time as an ending time, so that the target terminal may acquire the first audio data acquired in the current time period from the buffer space.

In addition, in order to facilitate the subsequent echo cancellation of the collected audio data, the present disclosure may record the time stamp corresponding to each frame of audio data after the collected audio data is collected, so that the collected audio data and the time stamp of each frame of audio data may be cached and recorded.

In step S202, in the case where a preset quality evaluation trigger condition is satisfied, a first quality evaluation parameter of the first audio data is acquired.

The first quality evaluation parameter may be a signal-to-noise ratio or an audio quality score, and the preset quality evaluation triggering condition may include: the first audio data collected in the current time period comprises a preset wake-up word (such as 'little colleague'); or receiving a first quality evaluation indication message sent by the other terminal, where the first quality evaluation indication message is used to instruct the target terminal to perform audio quality analysis on the first audio data.

In consideration of the fact that the performance of the wake-up engines of different terminals in the terminal networking is different, the wake-up engine with better performance can recognize the preset wake-up word first, so that in order to ensure timeliness and accuracy of voice recognition, in the method, terminals with the preset wake-up word first recognized in the terminal networking can trigger other terminals in the terminal networking to perform audio quality analysis together, if the other terminals in the terminal networking recognize the acquired second audio data before the target terminal, the other terminals can send the first quality evaluation indication message to the target terminal, and at the moment, the target terminal can receive the first quality evaluation indication message, so that the first audio data is triggered to perform audio quality analysis according to the first quality evaluation indication message.

That is, the target terminal in the present disclosure may perform audio quality analysis on the first audio data when it is determined that any of the above conditions is satisfied.

In step S203, a second quality evaluation parameter transmitted by other terminals than the target terminal in the terminal network where the target terminal is located is received.

The second quality evaluation parameter may include a signal-to-noise ratio or an audio quality score obtained after performing audio quality analysis on the second audio data.

In the present disclosure, if the wake-up engine of the target terminal detects that the first audio data includes the preset wake-up word, which indicates that the target terminal is the terminal in the terminal network that first recognizes the preset wake-up word, in this case, the target terminal may trigger other terminals in the terminal network to perform audio quality analysis on the collected second audio data to obtain the second quality evaluation parameter, and other terminals may also send the second quality evaluation parameter to the target terminal, so that the target terminal may receive the second quality evaluation parameter

In step S204, it is determined whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter.

The data uploading terminal may include a terminal that sends audio data to be identified to the server, where the audio data to be identified may include instruction audio data input by a user, such as "how weather today" and "please play a child song" instruction audio data.

In step S205, in the case where the target terminal is determined to be the data uploading terminal, audio data to be recognized is transmitted to the server so that the server performs voice recognition on the audio data to be recognized.

The audio data to be identified is audio data collected in a target time period, the target time period includes a target time and a preset time period before the target time, the target time is a time for determining the target terminal as the data uploading terminal, for example, the preset time period may be set to 5 seconds, the latest collected audio data for 10 seconds is assumed to be stored in a buffer space, the target time is a time t1, and the target terminal may intercept the audio data collected in the time t1 and 5 seconds before the time t1 from the audio data for 10 seconds buffered in the buffer space to be the audio data to be identified.

By adopting the method, the first audio data acquired in the current time period can be subjected to audio quality analysis at the target terminal side, the analysis result of the audio data received by the target terminal can be combined with other terminals to determine whether the target terminal is a data uploading terminal, and the audio data to be identified can be sent to the server under the condition that the target terminal is determined to be the data uploading terminal, so that the audio quality of the audio data to be identified, which is uploaded by the data uploading terminal, is ensured, and the accuracy of voice identification is further improved.

FIG. 3 is a flowchart illustrating a method of speech recognition that may be applied to a server, as shown in FIG. 3, according to an exemplary embodiment, the method comprising the steps of:

In step S301, audio data to be identified, which is transmitted by at least one terminal within the terminal network, is received.

The audio data to be identified may include instruction audio data input by a user, such as "how weather is today", "please play a song", and the like.

In step S302, an audio quality analysis is performed on each of the audio data to be identified, so as to obtain a third quality evaluation parameter.

The third quality evaluation parameter may include a signal-to-noise ratio or an audio quality score obtained after performing audio quality analysis on the audio data to be identified.

In this step, the server may perform audio quality analysis on the audio data to be identified in the following two manners to obtain the third quality evaluation parameter:

and in the first mode, performing signal-to-noise ratio analysis on the audio data to be identified to obtain the third quality evaluation parameter.

And secondly, performing audio quality analysis on the audio data to be identified through an audio quality analysis model (such as a deep learning model) obtained through pre-training to obtain the third quality evaluation parameter.

It should be noted that, the specific implementation steps of the two ways may refer to descriptions in the related art, and are not repeated herein.

In step S303, target recognition audio data is determined from the audio data to be recognized transmitted by at least one of the terminals according to the third quality evaluation parameter.

The target recognition audio data may be audio data corresponding to a maximum quality evaluation parameter in the audio data to be recognized, where the maximum quality evaluation parameter is a maximum quality evaluation parameter in the third quality evaluation parameter.

In step S304, speech recognition is performed on the target recognition audio data.

By adopting the method, the server can analyze the audio quality of the audio data to be identified uploaded by a plurality of terminals in the terminal networking, and select the target identification audio data with the best audio quality for voice identification, so that the accuracy of voice identification can be improved, and the use experience of a user can be improved.

FIG. 4 is a flowchart illustrating a method of speech recognition, as shown in FIG. 4, according to an exemplary embodiment, the method comprising the steps of:

In step S401, the target terminal collects audio data in real time.

The target terminal may be any terminal in the terminal network, and the audio data may include one or more of wake-up audio (such as wake-up words) sent by the user, instruction audio (such as "how weather today is.

In addition, in order to facilitate echo cancellation of the collected audio data, the present disclosure may record a time stamp corresponding to each frame of audio data after the collected audio data is collected, so that the collected audio data and the time stamp of each frame of audio data may be cached and recorded.

In step S402, the target terminal performs echo cancellation on the audio data acquired in real time.

In the present disclosure, in order to improve accuracy of voice recognition, echo cancellation may be performed on audio data collected by a microphone, in the related art, if a target terminal is playing music, local echo cancellation may be performed on audio data collected by the microphone, to cancel interference of music, and in the present disclosure, it is considered that other terminals in a terminal network except for the target terminal may also be playing music, at this time, if only interference of music played by the target terminal is eliminated, music played by other terminals may also affect quality of audio information to be recognized, so, in order to further improve accuracy of voice recognition, echo cancellation may be performed on local cached audio data (such as music played by the target terminal) and external echo audio data (such as music played by other terminals) in the audio data collected by the microphone of the target terminal.

In one possible implementation manner, any terminal (i.e., a target terminal) in the terminal network may record a time stamp of the audio playing start while playing audio, for example, by detecting whether an echo signal exists or not to determine the time stamp of the audio playing start, recording the time stamp of the audio playing start according to the time when the echo signal is detected earliest (i.e., the time when the echo signal is not available), recording a time stamp of each frame of audio data (for example, each 10ms of data is one frame) according to a preset time interval, and transmitting each frame of data and the corresponding time stamp to other terminals in the network, and meanwhile, locally storing each frame of data and the corresponding time stamp thereof, so that in this step, the target terminal may receive external echo audio data transmitted by other terminals and the time stamp corresponding to each frame of external echo audio data; and/or acquiring the locally cached echo audio data of the target terminal and a time stamp of the locally cached echo audio data of each frame; in this way, in the process of echo cancellation, corresponding audio data can be searched from the first audio data acquired in the current time period according to the obtained time stamp of each frame of echo audio data, so as to obtain aligned audio data; and then echo cancellation is performed on the first audio data according to the aligned audio data and echo audio data, wherein the echo audio data comprises the external echo audio data and/or the local cache echo audio data, and the current time period may comprise a preset cache time period corresponding to a cache space of the target terminal, for example, the current time period is a time period of 10 seconds in history taking the current time as the end time, and the audio data can be cached in the cache space for at most 10 seconds in advance.

It should be further noted that, in order to ensure that the audio data alignment is performed based on the time stamp of each frame of audio data, and thus to achieve the accuracy of echo cancellation, in the present disclosure, before the time stamp of each frame of data is recorded, time alignment needs to be performed on clocks of each terminal in the terminal network and a server corresponding to the terminal in the terminal network, for example, time synchronization may be performed through NTP (Network Time Protocol ).

In addition, in order to further improve the accuracy of voice recognition, the present disclosure may further perform noise reduction processing on the audio data after echo cancellation, and the specific noise reduction method may refer to descriptions in the related art, which is not described herein.

In view of the fact that in the related art, after local echo cancellation and noise reduction processing are performed on collected audio data, collected instruction audio is sent to a server only through a wake-up terminal screened based on nearby wake-up logic in a terminal networking, however, the inventor finds that if the position of the wake-up audio sent by a user is inconsistent with the position of the instruction audio or the audio quality collected by a microphone of the wake-up terminal is worse than that of other terminals without wake-up, the quality of audio data sent to the server through the wake-up terminal is poorer, so that in order to solve the technical problem, in the embodiment, audio quality analysis can be performed on the audio data collected by a target terminal in the current time period through executing step S403 to obtain a first quality evaluation parameter, and through executing step S404-S405, the second quality evaluation parameter obtained after audio quality analysis is performed on the audio data collected by other terminals in the terminal networking is received, the target terminal can determine whether the target terminal is the data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter, and the audio quality of the audio data uploading to the server is determined to be the target terminal, and the quality of the audio data is further identified to be further improved.

In step S403, under the condition that the preset quality evaluation triggering condition is satisfied, the target terminal obtains target audio data obtained after echo cancellation is performed on the interference audio data in the first audio data, and obtains a first quality evaluation parameter of the target audio data.

The first audio data is audio data collected in the current time period, the first quality evaluation parameter may be a signal-to-noise ratio or an audio quality score, and the preset quality evaluation triggering condition may include: the first audio data collected in the current time period comprises a preset wake-up word (such as 'little colleague'); or receiving a first quality evaluation indication message sent by the other terminal, where the first quality evaluation indication message is used to instruct the target terminal to perform audio quality analysis on the first audio data.

That is, the target terminal in the present disclosure may perform audio quality analysis on the target audio data when it is determined that any of the above conditions is satisfied.

In addition, in this step, the audio quality analysis may be performed on the target audio data in the following two manners, to obtain the first quality evaluation parameter:

And in the first mode, performing signal-to-noise ratio analysis on the target audio data to obtain the first quality evaluation parameter.

And secondly, performing audio quality analysis on the target audio data through an audio quality analysis model (such as a deep learning model) obtained through pre-training to obtain the first quality evaluation parameter.

In step S404, if the first audio data includes the preset wake-up word, the target terminal sends a second quality evaluation indication message to other terminals, where the second quality evaluation indication message is used to instruct the other terminals to perform audio quality analysis on the second audio data.

In this step, if the target terminal identifies that the first audio data includes the preset wake-up word, it indicates that the target terminal is the terminal in the terminal network that first identifies the preset wake-up word, in this case, the target terminal may trigger other terminals in the network to perform audio quality analysis on the acquired second audio data, so as to obtain the second quality evaluation parameter.

In step S405, the target terminal receives the second quality evaluation parameter sent by the other terminal according to the second quality evaluation indication message.

The second quality evaluation parameter may include a signal-to-noise ratio or an audio quality score obtained after performing an audio quality analysis on the second audio data.

In step S406, the target terminal determines whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter.

The second quality evaluation parameter includes at least one quality evaluation parameter, and in general, the second quality evaluation parameter corresponds to other terminals in the terminal network, one corresponding second quality evaluation parameter can be obtained by one other terminal, the data uploading terminal can include a terminal for sending audio data to be identified to the server, and the audio data to be identified can include instruction audio data input by a user, such as "how weather today", "please play a baby song", and the like.

In this step, the target terminal may be used as the data uploading terminal when the first quality evaluation parameter is greater than or equal to a preset evaluation threshold (the preset evaluation threshold may be a minimum threshold for characterizing whether the audio quality is good or bad); or if the first quality evaluation parameter is smaller than or equal to the preset evaluation threshold, determining whether each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, and if each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, taking the target terminal as the data uploading terminal.

In addition, if the second quality evaluation parameter has a target quality evaluation parameter larger than the first quality evaluation parameter, calculating a difference value between the target quality evaluation parameter and the first quality evaluation parameter; and if the preset number of the differences is smaller than or equal to a preset difference threshold, taking the target terminal as the data uploading terminal.

For example, assuming that the current terminal network includes A, B, C terminals and that the a terminal is the target terminal, for convenience of description, the first quality evaluation parameter acquired by the a terminal may be denoted as Xa, the second quality evaluation parameter acquired by the B terminal may be denoted as Xb, the second quality evaluation parameter acquired by the C terminal may be denoted as Xc, the preset evaluation threshold may be denoted as N, and if Xa is determined to be greater than or equal to N, it may be determined that the a terminal (i.e., the target terminal) is the data uploading terminal; if Xa is smaller than N, it can be further determined whether Xb and Xc are smaller than Xa, namely, it is determined whether the audio collected by the terminal A is the terminal with the best audio quality in the audio collected by A, B, C terminals, and under the condition that Xb and Xc are smaller than Xa, the terminal A can be used as the data uploading terminal; if there is data greater than or equal to Xa in Xb and Xc, that is, at least one of Xb and Xc is greater than or equal to Xa, a difference between Xa and the target quality evaluation parameter (that is, data greater than or equal to Xa in Xb and Xc) may be calculated, and if there is a preset number (in this example, the preset number may be set to 1 or 2) of differences each being less than or equal to a preset difference threshold (in this case, it may be understood that the first quality evaluation parameter of the audio data collected by the target terminal is not much different from the target quality evaluation parameter), the a terminal may be used as the data uploading terminal, which is also merely illustrative.

In addition, in the case where it is determined that the target terminal is not the data uploading terminal based on the quality evaluation parameter obtained by the audio quality analysis, the present disclosure may also determine whether the target terminal is the data uploading terminal based on the wake-up terminal determined by the nearby wake-up decision by executing steps S407-S408.

In step S407, the target terminal receives the wake-up data sent by the other terminal; and determining whether to wake up the target terminal according to the first audio data and the wake-up data.

The wake-up data may include wake-up audio feature data, such as MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstral coefficient) feature values, among others.

In this step, whether to wake up the target terminal may be determined according to the first audio data and the wake-up data based on the distributed wake-up decision mode, and the specific implementation steps may refer to descriptions in the related art, which are not described herein.

In step S408, if the target terminal determines to wake up the target terminal if it is determined that the target terminal is not the data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter, the target terminal is regarded as the data uploading terminal.

In step S409, the target terminal sends the audio data to be recognized to the server in the case where the target terminal is determined to be the data uploading terminal, so that the server performs voice recognition on the audio data to be recognized.

It should be noted that, the target terminal may acquire, from the buffer space, the audio data acquired in the target time period as the audio data to be identified, where the target time period includes a target time and a preset time period before the target time, and the target time may be a time when the target terminal is determined to be the data uploading terminal.

For example, the preset time period may be set to 5 seconds, and assuming that the latest acquired audio data of 10 seconds is stored in the buffer space, the target time is t1 time, and the target terminal may intercept the audio data acquired at the time t1 from the audio data of 10 seconds buffered in the buffer space and 5 seconds before the time t1 as the audio data to be identified.

It should be noted that, after the target terminal collects the audio data, the target terminal may perform echo cancellation on the collected audio data, so that the audio data to be identified may include the audio data after echo cancellation (local echo cancellation and external echo cancellation), so that the target terminal may send the audio data after echo cancellation to the server, thereby improving the accuracy of server voice recognition.

In step S410, the server performs audio quality analysis on the audio data to be identified sent by each data uploading terminal, to obtain the third quality evaluation parameter.

In step S411, the server determines target recognition audio data from the audio data to be recognized transmitted by at least one of the data uploading terminals according to the third quality evaluation parameter.

For example, assuming that after each terminal in the terminal network performs steps S401-S408, it is determined that A, B, C three terminals are the data uploading terminals, then the server may receive the audio data to be identified sent by A, B, C three terminals, where, to further improve accuracy of voice recognition, the server may perform quality analysis on the audio data to be identified sent by three terminals again to obtain the third quality evaluation parameter, so that, based on the third quality evaluation parameter, target recognition audio data with best audio quality is determined from the audio data to be identified sent by A, B, C three terminals, and then voice recognition is performed on the target recognition audio data with best audio quality, which may significantly improve accuracy of voice recognition.

In step S412, the server performs speech recognition on the target recognition audio data.

By adopting the method, the first audio data acquired in the current time period at the target terminal side can be subjected to audio quality analysis to obtain the first quality evaluation parameter, the quality analysis result of the audio data received by the target terminal, namely the second quality evaluation parameter, is combined with other terminals to determine whether the target terminal is a data uploading terminal, and the audio data to be identified is sent to the server under the condition that the target terminal is the data uploading terminal, so that the audio quality of the audio data to be identified, which is uploaded by the data uploading terminal, is ensured, and the accuracy of voice identification is improved; in addition, before audio quality analysis is performed on the collected audio data and before the audio data to be identified is uploaded to the server, global echo cancellation is performed on the first audio data and the audio data to be identified, all the received audio data played by other terminals and the audio data played by the target terminal are cancelled from the audio data collected by the microphone, and the audio data to be identified after the global echo cancellation is uploaded to the server, so that the accuracy of voice recognition is further improved.

Fig. 5 is a block diagram of an apparatus for voice recognition, according to an exemplary embodiment, applied to a target terminal, as shown in fig. 5, the apparatus comprising:

a first obtaining module 501 configured to obtain first audio data collected in a current time period;

a second obtaining module 502, configured to obtain a first quality evaluation parameter of the first audio data if a preset quality evaluation trigger condition is satisfied;

a first receiving module 503 configured to receive a second quality evaluation parameter sent by other terminals except the target terminal in the terminal network where the target terminal is located;

a first determining module 504 configured to determine whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter;

the first sending module 505 is configured to send, when it is determined that the target terminal is the data uploading terminal, audio data to be identified to a server, so that the server performs voice recognition on the audio data to be identified, where the audio data to be identified is audio data collected in a target time period, and the target time period includes a target time and a preset time period before the target time, and the target time is a time when the target terminal is determined to be the data uploading terminal.

Optionally, the preset quality evaluation triggering condition includes:

the first audio data collected in the current time period comprises a preset wake-up word; or alternatively

And receiving a first quality evaluation indication message sent by the other terminal, wherein the first quality evaluation indication message is used for indicating the target terminal to perform audio quality analysis on the first audio data.

Optionally, fig. 6 is a block diagram of a speech recognition apparatus according to the embodiment shown in fig. 5, and as shown in fig. 6, the apparatus further includes:

a second sending module 506, configured to send a second quality evaluation indication message to the other terminal if the first audio data includes the preset wake-up word, where the second quality evaluation indication message is used to instruct the other terminal to perform audio quality analysis on the second audio data;

the first receiving module 503 is configured to receive the second quality evaluation parameter sent by the other terminal according to the second quality evaluation indication message.

Optionally, the second quality assessment parameter comprises at least one quality assessment parameter,

The first determining module 504 is configured to take the target terminal as the data uploading terminal when the first quality evaluation parameter is greater than or equal to a preset evaluation threshold; or if the first quality evaluation parameter is smaller than or equal to the preset evaluation threshold, determining whether each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, and if each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, taking the target terminal as the data uploading terminal.

Optionally, the first determining module 504 is configured to calculate a difference between the target quality evaluation parameter and the first quality evaluation parameter if there is a target quality evaluation parameter greater than the first quality evaluation parameter in the second quality evaluation parameter; and if the preset number of the differences is smaller than or equal to a preset difference threshold, taking the target terminal as the data uploading terminal.

Optionally, fig. 7 is a block diagram of a speech recognition apparatus according to the embodiment shown in fig. 5, and as shown in fig. 7, the apparatus further includes:

a second receiving module 507 configured to receive wake-up data sent by the other terminal;

A second determining module 508 configured to determine whether to wake the target terminal based on the first audio data and the wake-up data; and under the condition that the target terminal is determined to be awakened, the target terminal is used as the data uploading terminal.

Optionally, fig. 8 is a block diagram of a voice recognition apparatus according to the embodiment shown in fig. 5, and as shown in fig. 8, the apparatus further includes:

an echo cancellation module 509 configured to perform echo cancellation on the interference audio data in the first audio data to obtain target audio data;

The second obtaining module 502 is configured to obtain the first quality evaluation parameter of the target audio data.

Optionally, fig. 9 is a block diagram of a voice recognition apparatus according to the embodiment shown in fig. 8, and as shown in fig. 9, the apparatus further includes:

A third receiving module 510, configured to receive external echo audio data sent by the other terminal and a timestamp corresponding to each frame of external echo audio data; and/or acquiring the locally cached echo audio data of the target terminal and a time stamp of the locally cached echo audio data of each frame;

The echo cancellation module 509 is configured to find corresponding audio data from the first audio data according to the obtained time stamp of each frame of echo audio data, so as to obtain aligned audio data; and performing echo cancellation on the first audio data according to the aligned audio data and echo audio data, wherein the echo audio data comprises the external echo audio data and/or the locally cached echo audio data.

By adopting the device, the first audio data acquired in the current time period at the target terminal side can be subjected to audio quality analysis to obtain the first quality evaluation parameter, the quality analysis result of the audio data received by the target terminal, namely the second quality evaluation parameter, is combined with other terminals to determine whether the target terminal is a data uploading terminal, and the audio data to be identified is sent to the server under the condition that the target terminal is the data uploading terminal, so that the audio quality of the audio data to be identified, which is uploaded by the data uploading terminal, is ensured, and the accuracy of voice identification is improved; in addition, before audio quality analysis is performed on the collected audio data and before the audio data to be identified is uploaded to the server, global echo cancellation is performed on the first audio data and the audio data to be identified, all the received audio data played by other terminals and the audio data played by the target terminal are cancelled from the audio data collected by the microphone, and the audio data to be identified after the global echo cancellation is uploaded to the server, so that the accuracy of voice recognition is further improved.

Fig. 10 is a block diagram of an apparatus for voice recognition, applied to a server, according to an exemplary embodiment of the present disclosure, as shown in fig. 10, the apparatus comprising:

a fourth receiving module 1001 configured to receive audio data to be identified sent by at least one terminal in the terminal network;

The audio quality analysis module 1002 is configured to perform audio quality analysis on each piece of audio data to be identified, so as to obtain a third quality evaluation parameter;

A third determining module 1003 configured to determine target recognition audio data from the audio data to be recognized transmitted by at least one of the terminals according to the third quality evaluation parameter;

The voice recognition module 1004 is configured to perform voice recognition on the target recognition audio data.

Optionally, the terminal includes a data uploading terminal, and the fourth receiving module 1001 is configured to receive the audio data to be identified sent by at least one data uploading terminal in the terminal network; the data uploading terminal is a terminal determined by a target terminal according to a first quality evaluation parameter and a second quality evaluation parameter, wherein the first quality evaluation parameter is a quality evaluation parameter obtained by the target terminal after performing audio quality analysis on the collected first audio data under the condition that a preset trigger condition is met, and the second quality evaluation parameter is a parameter obtained by the terminal networking other terminals except the target terminal after performing audio quality analysis on the collected second audio data;

The audio quality analysis module 1002 is configured to perform audio quality analysis on the audio data to be identified sent by each data uploading terminal, so as to obtain the third quality evaluation parameter.

Optionally, the audio data to be identified includes echo cancellation audio data obtained after echo cancellation by the terminal;

The audio quality analysis module 1002 is configured to perform audio quality analysis on each of the echo cancellation audio data, to obtain the third quality evaluation parameter.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

By adopting the device, the server can analyze the audio quality of the audio data to be identified uploaded by a plurality of terminals in the terminal networking, and select the target identification audio data with the best audio quality for voice identification, so that the accuracy of voice identification can be improved, and the use experience of a user can be improved.

The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of speech recognition provided by the present disclosure.

Fig. 11 is a block diagram illustrating an apparatus 1100 for speech recognition according to an example embodiment. For example, apparatus 1100 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 11, apparatus 1100 may include one or more of the following components: a processing component 1102, a memory 1104, a power component 1106, a multimedia component 1108, an audio component 1110, an input/output (I/O) interface 1112, a sensor component 1114, and a communication component 1116.

The processing component 1102 generally controls overall operation of the apparatus 1100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1102 may include one or more processors 1120 to execute instructions to perform all or part of the steps of the method of speech recognition described above. Further, the processing component 1102 can include one or more modules that facilitate interactions between the processing component 1102 and other components. For example, the processing component 1102 may include a multimedia module to facilitate interaction between the multimedia component 1108 and the processing component 1102.

Memory 1104 is configured to store various types of data to support operations at apparatus 1100. Examples of such data include instructions for any application or method operating on the device 1100, contact data, phonebook data, messages, pictures, videos, and the like. The memory 1104 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power component 1106 provides power to the various components of the device 1100. The power components 1106 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 1100.

Multimedia component 1108 includes a screen between the device 1100 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, multimedia component 1108 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 1100 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 1110 is configured to output and/or input an audio signal. For example, the audio component 1110 includes a Microphone (MIC) configured to receive external audio signals when the device 1100 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 1104 or transmitted via the communication component 1116. In some embodiments, the audio component 1110 further comprises a speaker for outputting audio signals.

The I/O interface 1112 provides an interface between the processing component 1102 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 1114 includes one or more sensors for providing status assessment of various aspects of the apparatus 1100. For example, the sensor assembly 1114 may detect the on/off state of the device 1100, the relative positioning of the components, such as the display and keypad of the device 1100, the sensor assembly 1114 may also detect a change in position of the device 1100 or a component of the device 1100, the presence or absence of user contact with the device 1100, the orientation or acceleration/deceleration of the device 1100, and a change in temperature of the device 1100. The sensor assembly 1114 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 1114 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1114 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1116 is configured to facilitate communication between the apparatus 1100 and other devices in a wired or wireless manner. The device 1100 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 1116 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 1116 further includes a Near Field Communication (NFC) module to facilitate short range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 1100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the above-described voice recognition method.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory 1104 including instructions executable by the processor 1120 of the apparatus 1100 to perform the above-described speech recognition method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In another exemplary embodiment, a computer program product is also provided, comprising a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-described speech recognition method when executed by the programmable apparatus.

Fig. 12 is a block diagram illustrating an apparatus 1200 for speech recognition according to an example embodiment. For example, apparatus 1200 may be provided as a server. Referring to fig. 12, apparatus 1200 includes a processing component 1222 that further includes one or more processors, and memory resources represented by memory 1232 for storing instructions, such as applications, executable by processing component 1222. The application programs stored in memory 1232 may include one or more modules each corresponding to a set of instructions. Further, the processing component 1222 is configured to execute instructions to perform the speech recognition method described above.

The apparatus 1200 may also include a power component 1226 configured to perform power management of the apparatus 1200, a wired or wireless network interface 1250 configured to connect the apparatus 1200 to a network, and an input output (I/O) interface 1258. The apparatus 1200 may operate based on an operating system stored in the memory 1232, such as Windows Server ^TM,Mac OS X^TM,Unix^TM,Linux^TM,FreeBSD^TM or the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A voice recognition method, applied to a target terminal, comprising:

acquiring first audio data acquired in a current time period;

Under the condition that a preset quality evaluation triggering condition is met, acquiring a first quality evaluation parameter of the first audio data;

receiving a second quality evaluation parameter sent by other terminals except the target terminal in a terminal network where the target terminal is located;

determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter;

Under the condition that the target terminal is the data uploading terminal, sending audio data to be identified to a server so that the server can conduct voice identification on the audio data to be identified, wherein the audio data to be identified is audio data collected in a target time period, the target time period comprises target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal;

the method further comprises the steps of:

If the first audio data comprises a preset wake-up word, sending a second quality evaluation indication message to the other terminals, wherein the second quality evaluation indication message is used for indicating the other terminals to perform audio quality analysis on second audio data acquired by the other terminals;

The receiving the second quality evaluation parameters sent by other terminals except the target terminal in the terminal network where the target terminal is located includes: receiving the second quality evaluation parameters sent by the other terminals according to the second quality evaluation indication message; the second quality evaluation parameter is a parameter obtained after the other terminals perform audio quality analysis on the second audio data;

The second quality evaluation parameter includes at least one quality evaluation parameter, and the determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter includes:

Under the condition that the first quality evaluation parameter is larger than or equal to a preset evaluation threshold, the target terminal is used as the data uploading terminal; or alternatively

And under the condition that the first quality evaluation parameters are smaller than or equal to the preset evaluation threshold, determining whether each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, and taking the target terminal as the data uploading terminal if each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter.

2. The method of claim 1, wherein the preset quality assessment triggering condition comprises:

The first audio data acquired in the current time period comprise preset wake-up words; or alternatively

3. The method of claim 1, wherein the determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter further comprises:

If the second quality evaluation parameter has a target quality evaluation parameter which is larger than the first quality evaluation parameter, calculating a difference value between the target quality evaluation parameter and the first quality evaluation parameter;

and if the preset number of the differences is smaller than or equal to a preset difference threshold, taking the target terminal as the data uploading terminal.

4. The method according to claim 1, wherein the method further comprises:

receiving wake-up data sent by the other terminals;

determining whether to wake up the target terminal according to the first audio data and the wake-up data;

And under the condition that the target terminal is determined to be awakened, the target terminal is used as the data uploading terminal.

5. The method according to any one of claims 1 to 4, wherein prior to said obtaining the first quality assessment parameter of the first audio data, the method further comprises:

echo cancellation is carried out on the interference audio data in the first audio data to obtain target audio data;

The obtaining the first quality evaluation parameter of the first audio data includes:

the first quality evaluation parameter of the target audio data is acquired.

6. The method of claim 5, wherein prior to said echo cancelling interfering audio data in said first audio data, said method further comprises:

Receiving external echo audio data sent by the other terminals and a time stamp corresponding to each frame of external echo audio data; and/or acquiring echo audio data locally cached by the target terminal and a time stamp of the locally cached echo audio data of each frame;

The echo cancellation of the interfering audio data in the first audio data includes:

Searching corresponding audio data from the first audio data according to the obtained time stamp of each frame of the echo audio data to obtain aligned audio data;

And performing echo cancellation on the first audio data according to the aligned audio data and the echo audio data, wherein the echo audio data comprises the external echo audio data and/or the locally cached echo audio data.

7. A method of speech recognition, applied to a server, the method comprising:

receiving audio data to be identified sent by at least one terminal in a terminal networking;

carrying out audio quality analysis on each piece of audio data to be identified to obtain a third quality evaluation parameter;

determining target identification audio data from the audio data to be identified sent by at least one terminal according to the third quality evaluation parameter;

performing voice recognition on the target recognition audio data;

the terminal comprises a data uploading terminal, and the audio data to be identified sent by at least one terminal in the receiving terminal network comprises:

Receiving the audio data to be identified sent by at least one data uploading terminal in the terminal networking; the data uploading terminal is a terminal determined by a target terminal according to a first quality evaluation parameter and a second quality evaluation parameter, the first quality evaluation parameter is a quality evaluation parameter obtained by the target terminal after performing audio quality analysis on the collected first audio data under the condition that a preset trigger condition is met, and the second quality evaluation parameter is a parameter obtained by the terminal networking other terminals except the target terminal after performing audio quality analysis on the collected second audio data.

8. The method of claim 7, wherein the step of determining the position of the probe is performed,

The audio quality analysis is carried out on each piece of audio data to be identified, and the obtaining of the third quality evaluation parameter comprises the following steps:

and carrying out audio quality analysis on the audio data to be identified, which are sent by each data uploading terminal, so as to obtain the third quality evaluation parameter.

9. The method according to claim 7 or 8, wherein the audio data to be identified comprises echo cancelled audio data obtained after echo cancellation by the terminal;

and carrying out audio quality analysis on each echo cancellation audio data to obtain the third quality evaluation parameter.

10. A voice recognition apparatus, applied to a target terminal, comprising:

The first acquisition module is configured to acquire first audio data acquired in the current time period;

the second acquisition module is configured to acquire a first quality evaluation parameter of the first audio data under the condition that a preset quality evaluation triggering condition is met;

The first receiving module is configured to receive second quality evaluation parameters sent by other terminals except the target terminal in a terminal network where the target terminal is located;

The first determining module is configured to determine whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter;

The first sending module is configured to send audio data to be identified to a server under the condition that the target terminal is determined to be the data uploading terminal, so that the server carries out voice recognition on the audio data to be identified, the audio data to be identified is audio data collected in a target time period, the target time period comprises target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal;

The apparatus further comprises:

The second sending module is configured to send a second quality evaluation indication message to the other terminal if the first audio data comprises a preset wake-up word, wherein the second quality evaluation indication message is used for indicating the other terminal to perform audio quality analysis on second audio data acquired by the other terminal;

The first receiving module is configured to receive the second quality evaluation parameters sent by the other terminals according to the second quality evaluation indication message; the second quality evaluation parameter is a parameter obtained after the other terminals perform audio quality analysis on the second audio data;

The second quality evaluation parameter comprises at least one quality evaluation parameter, and the first determining module is configured to take the target terminal as the data uploading terminal when the first quality evaluation parameter is greater than or equal to a preset evaluation threshold; or if the first quality evaluation parameter is smaller than or equal to the preset evaluation threshold, determining whether each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, and if each second quality evaluation parameter is smaller than or equal to the first quality evaluation parameter, taking the target terminal as the data uploading terminal.

11. The apparatus of claim 10, wherein the preset quality assessment triggering condition comprises:

12. The apparatus of claim 10, wherein the first determination module is configured to calculate a difference between the target quality evaluation parameter and the first quality evaluation parameter if there is a target quality evaluation parameter in the second quality evaluation parameter that is greater than the first quality evaluation parameter; and if the preset number of the differences is smaller than or equal to a preset difference threshold, taking the target terminal as the data uploading terminal.

13. The apparatus of claim 10, wherein the apparatus further comprises:

The second receiving module is configured to receive wake-up data sent by the other terminals;

A second determining module configured to determine whether to wake up the target terminal according to the first audio data and the wake-up data; and under the condition that the target terminal is determined to be awakened, the target terminal is used as the data uploading terminal.

14. The apparatus according to any one of claims 10 to 13, further comprising:

The echo cancellation module is configured to perform echo cancellation on the interference audio data in the first audio data to obtain target audio data;

The second acquisition module is configured to acquire the first quality evaluation parameter of the target audio data.

15. The apparatus of claim 14, wherein the apparatus further comprises:

the third receiving module is configured to receive external echo audio data sent by the other terminals and time stamps corresponding to each frame of external echo audio data; and/or acquiring echo audio data locally cached by the target terminal and a time stamp of the locally cached echo audio data of each frame;

The echo cancellation module is configured to search corresponding audio data from the first audio data according to the obtained time stamp of the echo audio data of each frame to obtain aligned audio data; and performing echo cancellation on the first audio data according to the aligned audio data and the echo audio data, wherein the echo audio data comprises the external echo audio data and/or the locally cached echo audio data.

16. An apparatus for speech recognition, applied to a server, the apparatus comprising:

The fourth receiving module is configured to receive audio data to be identified, which is sent by at least one terminal in the terminal networking;

the audio quality analysis module is configured to perform audio quality analysis on each piece of audio data to be identified to obtain a third quality evaluation parameter;

a third determining module configured to determine target recognition audio data from the audio data to be recognized transmitted by at least one of the terminals according to the third quality evaluation parameter;

a voice recognition module configured to perform voice recognition on the target recognition audio data;

The terminal comprises a data uploading terminal, and the fourth receiving module is configured to receive the audio data to be identified, which is sent by at least one data uploading terminal in the terminal networking; the data uploading terminal is a terminal determined by a target terminal according to a first quality evaluation parameter and a second quality evaluation parameter, the first quality evaluation parameter is a quality evaluation parameter obtained by the target terminal after performing audio quality analysis on the collected first audio data under the condition that a preset trigger condition is met, and the second quality evaluation parameter is a parameter obtained by the terminal networking other terminals except the target terminal after performing audio quality analysis on the collected second audio data.

17. The apparatus of claim 16, wherein the audio quality analysis module is configured to perform audio quality analysis on the audio data to be identified sent by each data uploading terminal, to obtain the third quality evaluation parameter.

18. The apparatus according to claim 16 or 17, wherein the audio data to be identified comprises echo cancelled audio data obtained after echo cancellation by the terminal;

the audio quality analysis module is configured to perform audio quality analysis on each echo cancellation audio data to obtain the third quality evaluation parameter.

19. A voice recognition apparatus, applied to a target terminal, comprising:

A processor;

A memory for storing processor-executable instructions;

wherein the processor is configured to:

Acquiring first audio data acquired in a current time period; under the condition that a preset quality evaluation triggering condition is met, acquiring a first quality evaluation parameter of the first audio data; receiving a second quality evaluation parameter sent by other terminals except the target terminal in a terminal network where the target terminal is located; determining whether the target terminal is a data uploading terminal according to the first quality evaluation parameter and the second quality evaluation parameter; under the condition that the target terminal is the data uploading terminal, sending audio data to be identified to a server so that the server can conduct voice identification on the audio data to be identified, wherein the audio data to be identified is audio data collected in a target time period, the target time period comprises target time and a preset time period before the target time, and the target time is the time when the target terminal is determined to be the data uploading terminal;

The processor is further configured to:

Receiving the second quality evaluation parameters sent by the other terminals according to the second quality evaluation indication message; the second quality evaluation parameter is a parameter obtained after the other terminals perform audio quality analysis on the second audio data;

20. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the method of any of claims 1 to 6.

21. A speech recognition device, for use with a server, comprising:

A processor;

A memory for storing processor-executable instructions;

Wherein the processor is configured to: receiving audio data to be identified sent by at least one terminal in a terminal networking; carrying out audio quality analysis on each piece of audio data to be identified to obtain a third quality evaluation parameter; determining target identification audio data from the audio data to be identified sent by at least one terminal according to the third quality evaluation parameter; performing voice recognition on the target recognition audio data; the terminal comprises a data uploading terminal, and the audio data to be identified sent by at least one terminal in the receiving terminal network comprises:

22. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the method of any of claims 7 to 9.