CN115132197A

CN115132197A - Data processing method, data processing apparatus, electronic device, program product, and medium

Info

Publication number: CN115132197A
Application number: CN202210597334.6A
Authority: CN
Inventors: 陈杰; 苏丹
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-09-30
Anticipated expiration: 2042-05-27
Also published as: CN115132197B

Abstract

The embodiment of the application discloses a data processing method, a data processing device, electronic equipment, a program product and a medium, which can be applied to the technical field of data processing. The method comprises the following steps: determining whether the voice data of the target time window hits the command word or not according to the audio features corresponding to the voice data of the K voice frames in the target time window; when the voice data of the target time window hits the command word, determining a verification time window associated with the current voice frame; determining a first confidence degree corresponding to the voice data in the verification time window and each command word, and determining an associated characteristic corresponding to the verification time window; and determining the hit result command word based on the first confidence degree corresponding to each command word and the associated characteristics. By adopting the method and the device, the accuracy of command word detection on the voice data is improved. The embodiment of the application can also be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, driving assistance, intelligent household appliances and the like.

Description

Data processing method, data processing apparatus, electronic device, program product, and medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, an electronic device, a program product, and a medium.

Background

Currently, voice detection technology is widely applied, and many intelligent devices (such as vehicle-mounted systems, intelligent sound boxes, intelligent home appliances, and the like) are provided with a voice detection function, and the intelligent devices can receive instructions issued in a voice form, detect the instructions based on received voice data, and execute corresponding operations. However, the inventors found in practice that the accuracy of detection of command words in voice data is low when instructions in voice data are detected.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device, electronic equipment, a program product and a medium, which are beneficial to improving the accuracy of command word detection of voice data.

In one aspect, an embodiment of the present application discloses a data processing method, where the method includes:

determining a target time window corresponding to a current voice frame, and acquiring audio characteristics corresponding to voice data of K voice frames in the target time window respectively, wherein K is a positive integer;

determining whether the voice data of the target time window hits a command word in a command word set according to audio characteristics corresponding to the voice data of the K voice frames respectively, wherein the command word set comprises at least one command word;

when the voice data of the target time window hits the command word in the command word set, determining a verification time window associated with the current voice frame;

determining a first confidence degree corresponding to the voice data in the verification time window and each command word in the command word set respectively according to the audio characteristics corresponding to the voice data of the voice frames in the verification time window respectively, and determining the associated characteristics corresponding to the verification time window based on the voice data of the voice frames in the verification time window;

and determining the result command words hit by the voice data of the verification time window in the command word set based on the first confidence degree corresponding to each command word and the associated characteristics.

In one aspect, an embodiment of the present application discloses a data processing apparatus, where the apparatus includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for determining a target time window corresponding to a current voice frame and acquiring audio characteristics corresponding to voice data of K voice frames in the target time window respectively, and K is a positive integer;

the processing unit is used for determining whether the voice data of the target time window hits a command word in a command word set according to the audio characteristics corresponding to the voice data of the K voice frames, wherein the command word set comprises at least one command word;

the processing unit is further configured to determine a verification time window associated with the current speech frame when the speech data of the target time window hits a command word in the command word set;

the processing unit is further configured to determine, according to audio features respectively corresponding to the voice data of the multiple voice frames in the verification time window, first confidence degrees respectively corresponding to the voice data in the verification time window and each command word in the command word set, and determine, based on the voice data in the verification time window, an associated feature corresponding to the verification time window;

the processing unit is further configured to determine a result command word hit by the voice data of the verification time window in the command word set based on the first confidence degree corresponding to each command word and the association characteristic.

In one aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor and a memory, where the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to perform the following steps:

determining a first confidence degree corresponding to the voice data in the verification time window and each command word in the command word set respectively according to the audio characteristics corresponding to the voice data of the voice frames in the verification time window respectively, and determining an associated characteristic corresponding to the verification time window based on the voice data in the verification time window;

In one aspect, an embodiment of the present application provides a computer-readable storage medium, in which computer program instructions are stored, and when the computer program instructions are executed by a processor, the computer program instructions are configured to perform the following steps:

and determining a result command word hit by the voice data of the verification time window in the command word set based on the first confidence degree corresponding to each command word and the association characteristics.

In one aspect, the present application provides a computer program product or a computer program, which includes computer instructions that, when executed by a processor, can implement the method provided by the above-mentioned aspect.

The embodiment of the application provides a data processing scheme, which can realize command word detection based on primary detection (verification) and secondary detection (verification). For example, whether the voice data of the target time window hits a command word in the command word set or not may be determined according to the audio features corresponding to the voice data of the K voice frames in the target time window, when the voice data of the target time window hits the command word in the command word set, a verification time window associated with the current voice frame is determined, so as to determine a first confidence degree corresponding to the voice data in the verification time window and each command word in the command word set, and determine an associated feature corresponding to the verification time window, and then determine a result command word hit by the voice data of the verification time window in the command word set based on the first confidence degree and the associated feature corresponding to each command word. Optionally, after the result command word is determined, the operation indicated by the result command word may be further performed. Therefore, after the command word is determined through primary detection, namely the command word is primarily determined to hit the command word by the voice data based on the target time window, secondary detection is carried out, namely a new verification time window is determined to carry out secondary verification on whether the voice data contains the command word, and the association characteristic is added during secondary verification, so that whether the command word is hit by the verification time window can be determined based on more information, and the accuracy of command word detection on the voice data can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a block diagram of a data processing system according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an effect of a target time window provided by an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram of another data processing method provided in the embodiments of the present application;

FIG. 5 is a schematic flow chart diagram illustrating a further data processing method according to an embodiment of the present application;

FIG. 6 is a block diagram of a primary detection network according to an embodiment of the present disclosure;

fig. 7 is a schematic block diagram of a data processing method according to an embodiment of the present application;

FIG. 8 is a schematic flowchart of another data processing method provided in an embodiment of the present application;

FIG. 9 is a block diagram of another data processing method provided in the embodiments of the present application;

fig. 10 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The embodiment of the application provides a data processing scheme which can realize command word detection based on primary detection (verification) and secondary detection (verification). For example, whether the voice data of the target time window hits a command word in the command word set may be determined according to audio features corresponding to the voice data of K voice frames in the target time window, when the voice data of the target time window hits the command word in the command word set, a verification time window associated with the current voice frame may be determined, thereby determining a first confidence corresponding to the voice data in the verification time window and each command word in the command word set, and determining an associated feature corresponding to the verification time window, and further determining a resulting command word hit by the voice data of the verification time window in the command word set based on the first confidence corresponding to each command word and the associated feature. Optionally, after the result command word is determined, the operation indicated by the result command word may be further performed. Therefore, after the command word is determined through primary detection, namely the command word is primarily determined to hit the command word by the voice data based on the target time window, secondary detection is carried out, namely a new verification time window is determined to carry out secondary verification on whether the voice data contains the command word, and the association characteristic is added during secondary verification, so that whether the command word is hit by the verification time window can be determined based on more information, and the accuracy of command word detection on the voice data can be improved.

In a possible implementation manner, the embodiment of the present application can be applied to a data processing system, please refer to fig. 1, and fig. 1 is a schematic structural diagram of a data processing system provided in the embodiment of the present application. As shown in FIG. 1, the data processing system may include a voice-initiated object and a data processing device. The voice initiating object may be used to send voice data to the data processing device, and the voice initiating object may be a user or a device that needs to request the data processing device to respond, and is not limited herein. The data processing device may execute the data processing scheme, and may perform corresponding operations based on the received voice data, for example, the data processing device may be an in-vehicle system, a smart speaker, a smart appliance, or the like. That is to say, after the voice data is output by the voice initiating object, the data processing device may receive the voice data, and then the data processing device may detect a command word in the voice data based on the data processing scheme, and then execute an operation corresponding to the detected command word. It is to be understood that, before the data processing apparatus detects the voice data, a command word set may be preset, where the command word set includes at least one command word, and each command word may be associated with a corresponding operation, for example, the command word "turn on air conditioner" is associated with an operation of turning on air conditioner, and when the data processing apparatus detects the voice data including the command word "turn on air conditioner", the data processing apparatus may perform the operation of turning on air conditioner. According to the data processing scheme, after the command word hit by the voice data is preliminarily determined based on the target time window, a new verification time window is determined to carry out secondary verification on whether the voice data contains the command word or not, so that the accuracy of the data processing equipment in the data processing system for detecting the command word of the voice data can be improved, and a user can conveniently and accurately instruct the data processing equipment to execute corresponding operation through voice.

It should be noted that, before collecting the relevant data of the user and in the process of collecting the relevant data of the user, the present application may display a prompt interface, a popup window, or output voice prompt information, where the prompt interface, the popup window, or the voice prompt information is used to prompt the user to currently collect the relevant data, so that the present application only starts to execute the relevant step of obtaining the relevant data of the user after obtaining the confirmation operation sent by the user to the prompt interface or the popup window, otherwise (that is, when the confirmation operation sent by the user to the prompt interface or the popup window is not obtained), the relevant step of obtaining the relevant data of the user is ended, that is, the relevant data of the user is not obtained. In other words, all user data collected in the present application is collected under the approval and authorization of the user, and the collection, use and processing of the relevant user data need to comply with relevant laws and regulations and standards of relevant countries and regions.

In one possible implementation, the embodiments of the present application may be applied in the field of Artificial Intelligence (AI), which is a theory, method, technique, and application system that simulates, extends, and expands human Intelligence, senses the environment, acquires knowledge, and uses the knowledge to obtain optimal results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In a possible implementation manner, the embodiment of the present application can also be applied in the field of voice technology, such as detecting a command word hit in voice data as described above. Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

The technical scheme of the application can be applied to electronic equipment, such as the data processing equipment. The electronic device may be a terminal, a server, or other devices for performing data processing, and the present application is not limited thereto. And (4) optional. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform. The terminal includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, an aircraft, an intelligent sound box, an intelligent appliance, and the like.

It is to be understood that the foregoing scenarios are only examples, and do not constitute a limitation on application scenarios of the technical solutions provided in the embodiments of the present application, and the technical solutions of the present application may also be applied to other scenarios. For example, as can be known by those skilled in the art, with the evolution of system architecture and the emergence of new service scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

Based on the above description, the embodiments of the present application provide a data processing method. Referring to fig. 2, fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure. The method may be performed by the electronic device described above. The data processing method may include the following steps.

S201, determining a target time window corresponding to the current voice frame, and acquiring audio features corresponding to voice data of K voice frames in the target time window respectively.

The current speech frame may be any speech frame in the acquired speech data. It is understood that the obtained voice data may be real-time voice, and for real-time continuously input voice data, the current voice frame may be a latest voice frame in the continuously input voice data. The obtained voice data may also be non-real-time voice, for example, for a whole pre-generated voice data, each voice frame may also be determined as the current voice frame in sequence according to the sequence of each voice frame in the voice data.

A speech frame may include several sampling points, that is, speech data of successive sampling points form speech data of a speech frame. It will be appreciated that the time difference between adjacent sample points is the same. There may be partially repeated sampling points in two adjacent speech frames, or completely different sampling points, which is not limited herein. For example, in a segment of 10s speech data, one sample is determined every 10ms, and 20 consecutive samples are determined as one speech frame, for example, in the 10s speech data, the 1 st to 20 th samples are determined as one speech frame, the 21 st to 40 th samples are determined as one speech frame, and so on, a plurality of speech frames are obtained. For another example, in order to avoid excessive audio data variation of two adjacent speech frames, there is an overlapping sampling point between two adjacent speech frames, for example, in the 10s speech data, the 1 st to 20 th sampling points are determined as a speech frame, the 15 th to 35 th sampling points are determined as a speech frame, the 30 th to 40 th sampling points are determined as a speech frame, and so on, a plurality of speech frames are obtained.

The target time window corresponding to the current speech frame may be a time window using the current speech frame as a reference speech frame. Optionally, the target time window corresponding to the current speech frame may include the current speech frame. The target time window may include a plurality of speech frames, for example, the target time window may include K speech frames, where K is a positive integer, that is, K may be the number of all speech frames in the target time window. Optionally, the K speech frames may also be selected from all speech frames in the target time window, that is, K may be less than or equal to the number of all speech frames in the target time window, for example, after the target time window is determined, the energy of each speech frame in the target time window is calculated, and then the speech frames with the energy lower than a certain threshold are removed, so as to obtain the K speech frames, thereby filtering out some speech frames with small sound, and reducing the calculation amount in the subsequent processing process. A reference speech frame of a target time window indicates that the time window is divided based on the reference speech frame, for example, the reference speech frame may be the first speech frame, the last speech frame or the speech frame at the center of a time window, which is not limited herein. The description of the first speech frame and the last speech frame is characterized according to the time sequence, wherein the first speech frame represents the speech frame with the earliest input time in the time window, and the last speech frame represents the speech frame with the latest input time in the time window. Then, the target time window corresponding to the current speech frame may be a time window in which the current speech frame is used as the first speech frame, or may be a time window in which the current speech frame is used as the last speech frame, or may be a time window in which the current speech frame is used as a speech frame at the center position, which is not limited herein. K may be preset, or may be determined based on the length of the obtained speech, or may be determined based on the length of the command word in the command word set, such as the maximum length or the average length, and is not limited here.

Optionally, the target time window corresponding to the current speech frame may not include the current speech frame. For example, when the reference speech frame is the first speech frame of a time window, the next speech frame of the current speech frame may be used as the reference speech frame of the target time window, that is, the first speech frame of the target time window is the next speech frame of the current speech frame; for another example, when the reference speech frame is the last speech frame of a time window, the previous speech frame of the current speech frame may be used as the reference speech frame of the target time window, that is, the last speech frame of the target time window is the previous speech frame of the current speech frame, and so on, which is not described herein again.

In the present application, the determination of the subsequent target time window and the verification time window is mainly described by taking a case where a current speech frame is used as a last speech frame (i.e., a reference speech frame) of a corresponding target time window as an example. For example, the continuously input speech data includes n speech frames 1, 2, and 3.. once.n, if the current speech frame is the 200 th speech frame, the reference speech frame is the last speech frame of the time window, and the size of the target time window is 100 speech frames (that is, the target time window corresponding to the current speech frame includes 100 speech frames, that is, K is 100), the time window with the size of 100 and the 200 th speech frame as the last speech frame may be determined as the target time window corresponding to the 200 th speech frame, that is, 100 speech frames (the 100 th and 200 th speech frames) before the 200 th speech frame are determined as the speech frames in the target time window corresponding to the 200 th speech frame.

As another example, the target time window is described by taking an illustration as an example, please refer to fig. 3, and fig. 3 is a schematic diagram of an effect of the target time window provided in the embodiment of the present application. As shown in (1) in fig. 3, in the received speech data, each speech frame may be represented as one of the square blocks, and if the gray square block shown as 301 in fig. 3 is determined as the current speech frame and the size of the preset target time window is 8 speech frames, 8 speech frames before 301 (including the speech frame indicated by 301) may be determined as the target time window corresponding to 301 (as shown by 302 in fig. 3); with continuous input of voice data, if a missed command word is detected based on the time window shown by 302, a new current voice frame may be determined based on a sliding window, for example, when the sliding window is 1, a subsequent voice frame of the voice frame shown by 301 may be determined as a new current voice frame (as shown by 303 in (2) of fig. 3), so that 8 voice frames before 303 (including the voice frame indicated by 303) may be determined as a target time window (as shown by 304 in fig. 3) corresponding to 303, and so on, thereby realizing detection of the command word in the continuously input voice data.

The audio features corresponding to the speech data of the K speech frames in the target time window are obtained, and the corresponding audio features can be determined for the speech data based on each speech frame. In one possible implementation, the audio feature may be an FBank feature (an audio feature of speech data). Specifically, if the voice data of one voice frame is a time domain signal, the FBank feature corresponding to the one voice frame is obtained, the time domain signal of the voice data of the one voice frame may be converted into a frequency domain signal through fourier transform, and then the corresponding FBank feature is determined based on the frequency domain signal obtained through calculation, which is not described herein again. It is understood that the audio features may also be features determined based on other means, such as MFCC features (an audio feature of speech data), without limitation herein.

S202, determining whether the voice data of the target time window hits the command word in the command word set or not according to the audio characteristics corresponding to the voice data of the K voice frames.

Wherein the command word set includes at least one (one or more) command word, as described above. The voice data of the target time window is short for the voice data of the K voice frames in the target time window, for example, the command word hit by the voice data of the target time window in the command word set may be a command word hit by the voice data of the K voice frames of the target time window in the command word set; the command word hit in the command word set by the voice data of the target time window may also be briefly described as the command word hit in the command word set by the target time window.

In one possible implementation, step S202 may include the following steps: and determining a second confidence coefficient corresponding to the voice data of the target time window and each command word in the command word set according to the audio characteristics corresponding to the voice data of the K voice frames. And if command words with the second confidence coefficient larger than or equal to the first threshold exist in the command word set, determining that the voice data of the target time window hits the command words in the command word set. And thirdly, if the command word set does not have the command word with the second confidence coefficient larger than or equal to the first threshold value, determining that the voice data of the target time window does not hit the command word in the command word set. The second confidence level may characterize the probability that the voice data of the target time window is each command word, and each command word may have a corresponding second confidence level. The first threshold may be a preset threshold, and in order to improve the detection accuracy of the command word, a reasonable first threshold may be set to determine whether the voice data of the target time window hits the command word in the command word set. Optionally, for better performance, different first thresholds may be set for command words of different lengths, thereby balancing the detection rate and the false detection rate for command words of different command lengths. It is understood that there may be a plurality of second confidence levels greater than or equal to the first threshold, and each command word corresponding to a second confidence level greater than or equal to the first threshold may be a command word hit in the voice data of the target time window. For convenience of description, the command word hit in the target time window is referred to as a primary command word in the present application.

For example, if the command word set includes command word 1, command word 2, command word 3, and command word 4, a second confidence coefficient corresponding to each command word is obtained according to the audio features of K voice frames in the target time window, where the second confidence coefficient corresponding to command word 1 is 0.3, the second confidence coefficient corresponding to command word 2 is 0.75, the second confidence coefficient corresponding to command word 3 is 0.45, and the second confidence coefficient corresponding to command word 4 is 0.66, and if the first threshold is 0.6, there are command words in the command word set whose second confidence coefficients are greater than or equal to the first threshold, that is, command word 2 and command word 4.

S203, when the voice data of the target time window hits the command word in the command word set, determining a verification time window associated with the current voice frame.

The verification time window may be a time window for performing secondary verification on the command word, and the verification time window may include a plurality of speech frames. The verification time window and the target time window may include repeated speech frames, but the included speech frames may not be completely the same or may be completely the same, and are not limited herein. The range of the verification time window associated with the current speech frame needs to cover as much as possible the speech frame involved in the speech data by the command word hit in the target time window.

In one possible implementation, the authentication time window associated with the current speech frame is determined, a first number of speech frames preceding the current speech frame may be determined, and then the authentication time window is determined based on the first number of speech frames preceding the current speech frame. The size of the first number may be determined in a number of ways. Specifically, the first number may be a preset number; the first number may also be determined based on the length (length for short) of the command word hit in the target time window; the first number may also be determined according to the earliest occurrence of the primary command word in the target time window, and is not limited herein. It is to be understood that the present application is exemplified by the last frame of a target time window being a current speech frame, here a determined verification time window according to a first number of speech frames preceding the current speech frame; if the current speech frame is the first speech frame of the target time window, the verification time window may be determined according to other manners, and if the current speech frame is the first speech frame of the target time window, the verification time window may be determined according to the first number of speech frames after the current speech frame, which is not limited herein.

In a possible implementation manner, when there is no hit command word in the voice data of the target time window, the subsequent operation is not performed, so as to determine the target time window corresponding to the new current voice frame, and further detect whether there is a hit command word in the audio data of the new target time window, and so on, to detect whether there is a hit command word in the audio data of the target time window corresponding to each voice frame. And moreover, when the target time window is detected to have no hit command word, the subsequent secondary verification step is not executed directly, so that the data processing efficiency is improved.

S204, according to the audio characteristics corresponding to the voice data of the voice frames in the verification time window, determining a first confidence degree corresponding to the voice data in the verification time window and each command word in the command word set respectively, and determining the associated characteristics corresponding to the verification time window based on the voice data in the verification time window.

Here, each command word herein refers to each command word in the above-mentioned command word set. The first confidence level may characterize a likelihood that the voice data of the verification time window is each command word, and each command word may have a corresponding first confidence level.

Verifying the audio characteristics corresponding to the speech data of the multiple speech frames in the time window, which may be FBank characteristics, may be determined for the speech data based on each speech frame.

In a possible implementation manner, when the electronic device receives continuously input voice data, the audio features of each voice frame can be extracted and cached in the storage area, and after the verification time window is determined, the audio features corresponding to the voice frames in the verification time window can be directly extracted from the storage area, so that the efficiency of data processing can be improved without repeatedly calculating the audio features of the voice frames. It can be understood that the number of the audio features in the cache storage region can be determined according to the number of the voice frames in the maximum verification time window, so that it can be ensured that the audio features of the voice frames in the verification time window can be quickly acquired from the storage region after the verification time window determined based on any one level of command words. The maximum verification time window may be a verification time window determined based on the command length of the command word of the largest length in the set of command words. It will be appreciated that, in order to avoid buffering too much data, the audio characteristics of the voice frame with the longest input time can be deleted every new input of a voice frame as the voice data is input, thereby avoiding waste of storage space.

The associated feature may refer to a related feature of the voice data in the verification time window, which is different from the audio feature corresponding to each voice frame.

In one possible embodiment, the associated features include at least one (one or more) of the following: a first average energy of the voice data in the verification time window, an effective voice proportion of the voice data in the verification time window, a signal-to-noise ratio of the voice data in the verification time window, and a number of voice frames in the verification time window. It is understood that other characteristics, such as the command word length of the command word hit in the target time window, etc., may also be included in the association characteristic, which is not limited herein.

Specifically, determining the associated features corresponding to the verification time window based on the voice data in the verification time window may include the following steps:

determining a first average energy of the speech data for the verification time window based on the energy of the speech data for each speech frame in the verification time window. Here, the energy of the speech data of each speech frame in the verification time window may be determined first, and then the first average energy may be determined based on the energy of the speech data of each speech frame, for example, the first average energy of the speech data of the verification time window may be determined by the following formulas (formula 1 and formula 2):

wherein p represents the energy of the speech data of any speech frame in the verification time window, N represents the number of sampling points in one speech frame, and x (N) represents the amplitude value of the nth sampling point in one speech frame, so that the energy of the speech data of each speech frame can be calculated according to formula 1.

Where P represents the first average energy of the speech data of the verification time window and T represents the number of speech frames within the verification time window. p (t) representsThe energy of the t-th speech frame in the verification time window can be calculated by the above equation 1.

It represents the sum of the energies of each speech frame in the verification time window. The first average energy of the voice data of the verification time window can be calculated by formula 2.

And secondly, determining the effective voice proportion of the voice data in the verification time window according to the number of the effective voice frames in the verification time window, wherein the effective voice frames are voice frames with energy larger than or equal to the first average energy. Wherein the valid speech proportion is used for that is the proportion of valid speech frames in the verification time window over the verification time window. For example, the effective speech ratio can be determined by the following formula (equation 3):

wherein R represents the effective speech ratio of the speech data of the verification time window, R represents the number of effective speech frames in the verification time window, and T represents the number of speech frames in the verification time window, whereby the effective speech ratio can be calculated by formula 3.

And thirdly, determining the signal-to-noise ratio of the voice data of the verification time window according to the second average energy and the first average energy of the effective voice frames in the verification time window. The snr can be obtained by dividing the second average energy of the valid speech frame in the verification time window by the first average energy, and can be specifically determined by the following formula (formula 4):

wherein, E-SNR represents the signal-to-noise ratio of the voice data of the verification time window, P represents the first average energy of the voice data of the verification time window, and M represents the second average energy of the valid voice frame in the verification time window, so that the signal-to-noise ratio can be calculated by formula 4.

In a possible implementation manner, when determining the first confidence level that the voice data of the verification time window corresponds to each command word, the first confidence level that the voice data of the verification time window corresponds to the garbage class may also be determined, that is, the first confidence level of the garbage class represents the possibility that the voice data of the verification time window is not a command word. It is equivalent to that when determining the command words hit in the verification time window, the categories that can be classified are each command word and garbage.

S205, determining a result command word hit by the voice data of the verification time window in the command word set based on the first confidence degree and the association characteristic corresponding to each command word.

The result command word refers to a command word hit by the voice data in the verification time window, and belongs to the command word set. It can be understood that the premise of determining the result command word hit by the voice data of the verification time window in the command word set is that the voice data of the verification time window has the hit command word in the command word set, and if the voice data of the verification time window does not have the hit command word in the command word set, the result command word hit by the voice data of the verification time window in the command word set cannot be determined. Optionally, after the result command word is determined, the operation indicated by the result command word may be performed. The voice data of the verification time window is short for the voice data of the voice frames in the verification time window, for example, a command word hit by the voice data of the verification time window in the command word set may refer to a command word hit by the voice data of a plurality of voice frames of the verification time window in the command word set; the result command word that is hit in the command word set by the voice data of the verification time window may also be briefly described as the result command word that is hit in the command word set by the verification time window. It can be understood that the command words of the determination result are determined according to the first confidence degrees corresponding to each command word respectively, and the association features are introduced, so that more information is introduced when the command words of the determination result are determined, and the information is used as an effective supplement of the first confidence degrees, thereby improving the detection accuracy of the command words, for example, the command words hit by the voice data under the situations of different signal to noise ratios can be more accurately determined by introducing the signal to noise ratios of the voice data through the verification time window; for another example, by verifying the introduction of the first average energy of the voice data in the time window, the command word hit by the voice data under different average energies can be more accurately determined; for another example, by verifying the introduction of the effective voice proportion of the voice data in the time window, the command word hit by the voice data under different effective voice proportions can be determined more accurately. Moreover, since the result command word can be determined based on the first confidence level and the associated feature determined by the verification time window, it is equivalent to performing secondary verification on whether the command word hits in the continuously input voice data, so that the detection result of the voice data in the verification time window is used as the final detection result, and if the result command word hitting in the command word set is detected, the operation indicated by the result command word is executed. For example, if the result command word "turn on heating" that verifies a hit of voice data of a time window is detected, an operation of turning on heating may be performed.

In one possible implementation, if the voice data of the verification time window does not have a hit resulting command word in the command word set, no action may be performed. And then, the target time window of the new current voice frame can be determined, the steps are repeated until whether the voice data of the verification time window hits the command word in the command word set is determined based on the audio characteristics of the voice frame in the verification time window associated with the new current voice frame, and the like, so that the detection of the time window corresponding to each voice frame is realized.

In a possible embodiment, when the result command word hit in the verification time window is detected, the method may further include using the result command word for other purposes, such as training other models by the extracted command word, storing the extracted command word, and the like, which is not limited herein.

In one possible implementation, some time information, place information, and the like may be further included in the command word, so that the corresponding operation may be performed according to the detected result, the time information of the command word, the time indicated by the place information in the time information, and the place indicated by the place information. For example, when it is detected that the result command word is "10 points turn on the air conditioner", 10 points of which are time information of the command word, the operation of turning on the air conditioner may be performed at 10 points. Alternatively, in one possible embodiment, time information, place information, and the like in the voice may also be acquired, whereby the operation corresponding to the resultant command word may be performed according to the detected time information, place information at the time indicated by the time information, and place indicated by the place information.

How to implement command word detection on voice data is described as an example, please refer to fig. 4, and fig. 4 is a flowchart of another data processing method provided by the embodiment of the present application. Firstly, voice data can be received, a target time window corresponding to a current voice frame in the received voice data is determined (step S401), and then whether the target time window hits a command word in a command word set is determined (step S402), and the determination can be specifically carried out through the audio characteristics of the voice data of each voice frame in the target time window; if the target time window does not hit the command word in the command word set, the operation may not be executed, and the target time window of the new current speech frame is determined (i.e., step S403); if the target time window hits a command word in the command word set, performing secondary verification, which may specifically be determining a verification time window associated with the current voice frame (i.e., step S404), further determining whether the verification time window hits the command word in the command word set (i.e., step S405), if the verification time window does not hit the command word in the command word set, not performing the operation, and determining a target time window of a new current voice frame (i.e., step S406), and if the verification time window hits the command word in the command word set, performing the operation indicated by the hit command word (i.e., step S407). The secondary verification can thus be achieved by determining a verification time window to improve the accuracy of detection of command words in the speech data.

In a possible scenario, the method and the device can be applied to detecting whether the received voice data hits the command word or not when the electronic device is already awakened. That is, after the electronic device has been awakened by the voice initiating object through the awakening word, the hit command word is detected based on the received voice data.

In a possible scenario, the present application may also be applied to a scenario that the electronic device does not need to be woken up, that is, the electronic device directly determines whether a command word is hit according to the received voice data without waking up by a wake-up word, which is equivalent to waking up the electronic device and executing an operation indicated by the command word when it is detected that the received voice data hits the command word in the command word set. The command words in the command word set are preset, the electronic equipment is triggered to execute corresponding operations only when the voice data contains the command words, and the accuracy of command word detection is high, so that the voice initiating object can quickly instruct the electronic equipment to execute the corresponding operations through voice instructions without waking the equipment first and then issuing the instructions. It can be understood that, in order to reduce the error recognition rate of the command words, some less-common words may be set in the command words in the preset command word set, or less-common word groups are added in the command words to reduce the error recognition rate of the command words, so that the interactive experience can be greatly improved.

The embodiment of the application provides a data processing scheme, which can realize command word detection based on primary detection (verification) and secondary detection (verification). For example, whether the voice data of the target time window hits a command word in the command word set may be determined according to audio features corresponding to the voice data of K voice frames in the target time window, when the voice data of the target time window hits the command word in the command word set, a verification time window associated with the current voice frame may be determined, thereby determining a first confidence corresponding to the voice data in the verification time window and each command word in the command word set, and determining an associated feature corresponding to the verification time window, and further determining a resulting command word hit by the voice data of the verification time window in the command word set based on the first confidence corresponding to each command word and the associated feature. Optionally, after the result command word is determined, the operation indicated by the result command word may be further performed. Therefore, after the command word is determined through primary detection, namely the command word is primarily determined to hit the command word by the voice data based on the target time window, secondary detection is carried out, namely a new verification time window is determined to carry out secondary verification on whether the voice data contains the command word, and the association characteristic is added during secondary verification, so that whether the command word is hit by the verification time window can be determined based on more information, and the accuracy of command word detection on the voice data can be improved.

Referring to fig. 5, fig. 5 is a schematic flowchart of another data processing method according to an embodiment of the present application. The method may be performed by the electronic device described above. The data processing method may include the following steps.

S501, determining a target time window corresponding to the current voice frame, and acquiring audio features corresponding to voice data of K voice frames in the target time window.

S502, determining whether the voice data of the target time window hits the command word in the command word set or not according to the audio characteristics corresponding to the voice data of the K voice frames.

In one possible implementation, any command word in the set of command words may have one or more syllables. Syllable is the most natural phonetic unit sensed by hearing, and is formed by combining one or several phonemes according to a certain rule. In mandarin, except for individual cases, a chinese character is a syllable, and for example, the command word "turn on the air conditioner" includes 4 syllables. Each command word in the command word set has a corresponding syllable identification sequence, which refers to a sequence of syllable identifications of syllables that the command word has, which can be used to characterize the syllable. In a possible embodiment, the syllable identification sequence of each command word may be determined by a pronunciation dictionary, which is a pre-processed dictionary and may include a mapping relationship between each word in the command word and the syllable identification of the syllable, so that the syllable identification of the syllable of each command word may be determined according to the pronunciation dictionary, that is, the syllable of the command word is determined. It will be appreciated that different words may have the same syllable, for example, the command words "play song" and "cancel heating" both include the syllable "qu".

In a possible implementation manner, here, whether the voice data of the target time window hits the command word in the command word set is determined, and the determination of whether the command word hits may be implemented by determining a second confidence of each command word by calculating a probability that the voice data of each voice frame of the target time window corresponds to the syllable; whether the command word is hit or not can be determined through a Keyword/Filler HMM Model (a wakeup word detection Model); or the application may also determine whether the voice data of the target time window hits the command word in the command word set by using other methods, which is not limited herein.

In a possible implementation manner, as described above, the command word set includes at least one command word, each command word has a plurality of syllables, and if the determination of whether the command word hits is implemented by determining the second confidence of each command word by determining the probability that the speech data of each speech frame of the target time window corresponds to a syllable, the step S502 may include the following steps:

determining the probability that the K voice frames correspond to each syllable output unit in the syllable output unit set respectively according to the audio characteristics corresponding to the voice data of the K voice frames respectively; the syllable output unit set is determined based on a plurality of syllables that each command word has, and syllables corresponding to different syllable output units are different. The syllable output unit set refers to a set of classification items capable of classifying syllables corresponding to the speech data of each speech frame, and the output unit set includes a plurality of output units. For example, if the syllable output unit A, B, C is included in the syllable output unit set, the voice data representing each voice frame can be classified into A, B or C, so that the probability that K voice frames correspond to the syllable output unit A, B, C, respectively, can be determined. The determined set of syllable output units based on the plurality of syllables each command word may identify the determined set of syllable output units based on syllables of the plurality of syllables each command word has, in particular, a union of syllable identifications of the plurality of syllables each command word has, each syllable identification of each of the union of syllable identifications corresponding to one syllable output unit. In one embodiment, the set of syllable output units further comprises a garbage syllable output unit, so that in the subsequent classification process, the syllables that the command word that does not belong to the command word set has can be classified into the garbage syllable output unit. For example, the command word set includes command word 1, command word 2, and command word 3, the syllable identifiers of the syllables of the command word 1 are S1, S2, S3, and S4, the syllable identifiers of the syllables of the command word 2 are S1, S4, S5, and S5, and the syllable identifiers of the syllables of the command word 3 are S7, S2, S3, and S1, respectively, so that the syllable output units corresponding to the syllables of the command words 1 to 3 can be determined as S1, S2, S3, S4, S5, S5, and S7, respectively, and S1, S2, S3, S4, S5, S5, and S7, respectively, and the syllable output unit corresponding to each syllable and the spam output unit are determined as the output unit set.

Determining a second confidence coefficient corresponding to the voice data of the target time window and each command word according to the probability that the K voice frames respectively correspond to each syllable output unit. The second confidence level for determining any command word can be obtained by determining the maximum value of the product of the probabilities corresponding to each syllable that the command word has, that is, the second confidence level is determined according to the product of the maximum probabilities corresponding to each syllable that the command word has.

And if the command word set has the command word with the second confidence coefficient larger than or equal to the first threshold, determining the command word with the second confidence coefficient larger than or equal to the first threshold as the command word whether the voice data of the target time window hits in the command word set. This step can refer to the above description, and is not described herein again.

In a possible implementation manner, if any command word in the command set is represented as a target command word, determining a second confidence level of the speech data of the target time window corresponding to each command word according to the probability that K speech frames respectively correspond to each syllable output unit may specifically include the following steps:

determining a syllable output unit corresponding to each syllable of the target command word as a target syllable output unit to obtain a plurality of target syllable output units corresponding to the target command word. The target syllable output unit is a syllable output unit corresponding to each syllable of the target command word, and the target syllable output unit can be determined through the syllable identification sequence of the target command word, because each syllable output unit has a corresponding syllable, the target syllable output unit can be determined from the plurality of syllable output units through the syllables in the syllable identification sequence. For example, the target command word is "open-heating", the syllable identifiers of the syllables included in the target command word are determined as s1, s2, s3 and s4 (which may be referred to as a syllable identifier series of the target command word) from the pronunciation dictionary, and the syllable output units corresponding to s1, s2, s3 and s4 are collectively identified from the syllable output units by the syllable identifier series, so that the syllable output units corresponding to s1, s2, s3 and s4 are used as the target syllable output units.

Secondly, determining the probability of the K voice frames corresponding to each target syllable output unit from the probability of the K voice frames corresponding to each syllable output unit respectively, and obtaining K candidate probabilities corresponding to each target syllable output unit respectively. The candidate probability is the probability of the target syllable output unit corresponding to any voice frame. For example, if the target syllable output unit has syllable output units corresponding to s1, s2, s3 and s4 (referred to as syllable output units s1, s2, s3 and s4 here), it is possible to determine the probabilities of s1 corresponding to K speech frames, the probabilities of s2 corresponding to K speech frames, the probabilities of s3 corresponding to K speech frames, and the probabilities of s4 corresponding to K speech frames, that is, the total number of the obtained candidate probabilities corresponds to K × 4.

And thirdly, determining the maximum candidate probability corresponding to each target syllable output unit from the K candidate probabilities corresponding to each target syllable output unit, and determining the second confidence coefficient corresponding to the voice data of the target time window and the target command word according to the maximum candidate probability corresponding to each target syllable output unit. Wherein the target is determined according to the maximum candidate probability corresponding to each target syllable output unitThe second confidence corresponding to the voice data in the time window and the target command word may specifically be a second confidence corresponding to the voice data in the target time window and the target command word determined according to a product of maximum candidate probabilities respectively corresponding to each target syllable output unit, for example, the product of the candidate probabilities may be directly determined as the second confidence, or the second confidence may be obtained through other mathematical calculations, which is not limited herein. For example, s1 has a probability of { G1 } corresponding to each of the K speech frames ₁ 、G1 ₂ 、G1 ₃ ......G1 _K The maximum probability is the probability G1 corresponding to the 10 th speech frame in the target time window ₁₀ (ii) a The probability that s2 corresponds to each of the K speech frames is G2 ₁ 、G2 ₂ 、G2 ₃ ......G2 _K The maximum probability is the probability G2 corresponding to the 25 th speech frame in the target time window ₂₅ (ii) a s3 has a probability of corresponding to each of the K speech frames { G3 } ₁ 、G3 ₂ 、G3 ₃ ......G3 _K The maximum probability is the probability G3 corresponding to the 34 th speech frame in the target time window ₃₄ (ii) a s4 has a probability of corresponding to each of the K speech frames { G4 } ₁ 、G4 ₂ 、G4 ₃ ......G4 _K The maximum probability is the probability G4 corresponding to the 39 th speech frame in the target time window ₃₉ Further, G1 can be mentioned ₁₀ 、G2 ₂₅ 、G3 ₃₄ And G4 ₃₉ The product of the first confidence level and the second confidence level determines a second confidence level corresponding to the voice data of the target time window and the target command word. It is understood that, by performing the above operation on each command word in the command word set, the second confidence corresponding to each command word can be determined.

In one possible embodiment, the second confidence level of the speech data of the target time window corresponding to the target command word is determined according to the maximum candidate probability corresponding to each target syllable output unit, and may be calculated by the following formula (formula 5):

where C may represent a second confidence that the audio data of the target time window corresponds to the target command word. n-1 represents the number of target syllable output units corresponding to the target command word, and n represents the number of target syllable output units and garbage syllable output units. i denotes the ith target syllable output unit, j denotes the jth speech frame of the target time window, then p _ij Indicates the probability of the ith target syllable output unit from the jth speech frame, thus max p _ij Represents the maximum candidate probability of the ith target syllable output unit corresponding to each voice frame,

and the product of the maximum candidate probabilities respectively corresponding to each target syllable output unit is represented, so that the second confidence coefficient of the audio data of the target time window corresponding to the target command word can be obtained based on formula 5.

In one possible implementation, the determination of whether the target time window hits a command word is determined by a trained primary detection network, which determines how the target time window hits a command word. In one implementation, the trained primary detection network may be divided into an acoustic model and a confidence level generation module. The acoustic model is used for executing the step of determining the probability that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames. The acoustic model usually employs deep neural networks, such as a DNN model (a neural network model), a CNN model (a neural network model), an LSTM model (a neural network model), and the like, without limitation. The confidence generating module may be configured to perform the step of determining the second confidence corresponding to the speech data of the target time window and each command word based on the probability corresponding to each syllable output unit according to the K speech frames, and details of the step are not repeated here. Optionally, the dimension of the result output by the secondary detection network is the number of command words in the command word set, and each dimension corresponds to the second confidence of one command word.

For example, please refer to fig. 6, fig. 6 is a schematic diagram of a framework of a primary detection network according to an embodiment of the present disclosure. As shown in fig. 6, firstly, the voice data of K voice frames in the target time window may be obtained (as shown in 601 in fig. 6), then the audio features of each voice frame are determined based on 601 (as shown in 602 in fig. 6), and then the audio features of each voice frame are input into the trained acoustic model in the primary detection network (as shown in 603 in fig. 6), and then the results obtained based on the acoustic model are input into the confidence generating module (as shown in 604 in fig. 6), so that the confidence generating module determines that each command word has a target syllable output unit corresponding to a syllable in combination with the pronunciation dictionary (as shown in 605 in fig. 6), and further determines the second confidence corresponding to each command word, such as the confidence of command word 1, the confidence of command word 2, the confidence of command word m, and so as to determine whether the audio data of the target time window hits the command word, and the hit primary command word may be determined. It is to be understood that if the second confidence of each command word is not greater than or equal to the first threshold, the audio data of the target time window does not have the hit primary command word.

In a possible implementation manner, before determining the primary command word through the trained primary detection network, the training of the primary detection network is required, which may specifically include the following steps: firstly, first sample voice data is obtained, and the first sample voice data carries a syllable output unit label. The first sample voice data is used for training the primary detection network, and the first sample voice data can be voice data containing a command word, namely positive sample data, or voice data not containing the command word, namely negative sample data, so that the training effect is better through training of the positive and negative sample data. The syllable output unit label is to label the syllable output unit actually corresponding to each speech frame in the first sample speech data. It can be understood that, if a speech frame in the first sample speech data actually corresponds to a syllable corresponding to each command word in the command word set, the syllable output unit actually corresponding to the speech frame is the syllable output unit corresponding to the actually corresponding syllable, and if the speech frame actually corresponds to a syllable corresponding to each command word in the command word set, the syllable output unit actually corresponding to the speech frame is the garbage syllable output unit.

And calling an initial primary detection network to determine a predicted syllable output unit corresponding to the voice data of each voice frame in the first sample voice data. The initial primary detection network also includes an acoustic model, where the determined predicted syllable output unit can be determined by the acoustic model in the initial primary detection network, specifically, the probability that each speech frame corresponds to each syllable output unit in the syllable output unit set is determined according to the audio characteristics corresponding to the speech data of each speech frame in the first sample speech data, and the determined predicted syllable output unit is further determined based on the probability that each speech frame corresponds to each syllable output unit. The audio characteristics corresponding to the speech data of each speech frame in the first sample speech data are the same as the audio characteristics corresponding to each speech frame in the target time window, and are not described herein again.

And thirdly, training based on the predicted syllable output unit and the syllable output unit label corresponding to the voice data of each voice frame in the first sample voice data to obtain the trained primary detection network. In the training process, the network parameters of the initial primary detection network are adjusted to enable the predicted syllable output unit corresponding to each voice frame to be gradually close to the actual syllable output unit marked by the syllable output unit label, so that the trained primary detection network can accurately predict the probability of each voice frame corresponding to each syllable output unit. It is understood that the predicted syllable output unit is determined by the acoustic model in the primary detection network, that is, the primary detection network is trained to adjust the model parameters of the acoustic model in the primary detection network.

In one possible implementation, if the determination of whether the command word is hit is implemented by a Keyword/Filler HMM Model, the primary detection network may be the Keyword/Filler HMM Model. Then, according to the audio features corresponding to the speech data of the K speech frames, the probabilities of the K speech frames corresponding to each syllable output unit in the syllable output unit set are determined, then an optimal decoding path is determined based on the probability corresponding to each syllable output unit, and then whether the optimal decoding path passes through an HMM path (hidden markov path) of the command word is determined to determine whether the command word is hit, or a confidence corresponding to each HMM path is determined based on the probability corresponding to each syllable output unit to determine whether the command word is hit, which is not limited herein. The HMM path may be a command word HMM path or a filled HMM path, where each command word HMM path may be formed by connecting HMM states corresponding to a plurality of syllables of a command word in series, and the filled HMM path is formed by a set of well-designed HMM states corresponding to non-command word pronunciation units. The confidence level with each HMM state can thus be determined based on the probability corresponding to each syllable output unit, thereby determining whether a command word is hit, and which command word is hit.

S503, when the voice data of the target time window hits the command word in the command word set, determining a verification time window associated with the current voice frame.

In a possible implementation manner, the first number may be determined according to the length of the command word hit in the target time window, for example, the first number may be determined based on the length of the command word and a target preset value, so as to determine the verification time window according to the first number of speech frames before the current speech frame. Wherein, the command word length refers to the number of syllables in the command word. For a normal chinese command word, one word corresponds to one syllable, for example, the command word "turn on air conditioner" includes four words, and the corresponding 4 syllables, that is, the length of the command word is 4. Specifically, the verification time window may be determined according to the command word length of the primary command word and the target preset value. The method specifically comprises the following steps:

determining a first number according to the command word length of the primary command word and a target preset value. The target preset value may be a preset value, because generally speaking speed and the like may cause that a plurality of speech frames may be involved due to pronunciation (a syllable) of a word, and a command word may have a number of speech frames involved by a plurality of syllables greater than or equal to the number of syllables of the command word, the first number may be determined by determining the target preset value, so that the size of the verification time window covers the speech frames involved by the primary command word as much as possible. In a possible implementation manner, the first number may be obtained by multiplying the command word length of the primary command word by the target preset value, so that the number of the voice frames included in the obtained verification time window is the first number. For example, if the length of the primary command word is 4 and the target preset value is 25, the first number may be 4 × 25 — 100, that is, 100 voice frames are included in the verification time window.

Determining a verification time window associated with the current voice frame according to a first number of voice frames before the current voice frame. The first number of voice frames before the current voice frame comprises the current voice frame, and the verification time window associated with the current voice frame is determined according to the first number of voice frames before the current voice frame, namely the current voice frame is used as the last frame of the verification time window. For example, the continuously input speech data includes 1 st, 2 nd, 3.... n speech frames, and if the current speech frame is the 120 th speech frame and the first number is 100 speech frames, the time window with the size of 100, which takes the 120 th speech frame as the last speech frame, may be determined as the verification time window associated with the 120 th speech frame, that is, 100 speech frames (20 th to 120 th speech frames) before the 120 th speech frame are determined as speech frames in the verification time window associated with the 120 th speech frame.

In a possible embodiment, as mentioned above, the first number may be a preset number, and the preset number should cover the primary command words as much as possible, and the preset number may be set based on the longest command word length in the command word set. Specifically, the preset number may be determined based on the longest command word length and the target preset value, and then the preset number is determined as the first number, and then the verification time window is determined according to the first number of voice frames before the current voice frame.

In a possible implementation, the first number may be further determined according to an earliest occurrence time of the primary command word in the target time window, and then determining the verification time window may specifically include the following steps: obtaining a syllable output unit set, wherein the syllable output unit set is determined based on a plurality of syllables of each command word, and the syllables corresponding to different syllable output units are different. Secondly, according to the audio characteristics corresponding to the voice data of the K voice frames, determining the probability that the K voice frames correspond to each syllable output unit in the syllable output unit set. The relevant description of the first to the second here refers to the above description, and is not described herein again. And thirdly, determining a syllable output unit corresponding to the syllable hit by the voice data of the target time window as a verified syllable output unit, and determining the voice frame with the highest probability corresponding to the verified syllable output unit in the K voice frames as the target voice frame. The target voice frame is equivalent to a voice frame in which any syllable of the primary command word is detected in K voice frames, namely, the occurrence time of the primary command word can be determined. And determining a verification time window associated with the current voice frame according to the voice frame between the target voice frame and the current voice frame. The verification time window associated with the current speech frame may be determined according to the target speech frame with the largest number of speech frames between the current speech frame and the current speech frame, that is, the target speech frame with the largest number of speech frames spaced from the current speech frame is determined, where the target speech frame with the largest number of speech frames spaced from the current speech frame is determined to be used to represent the earliest occurrence time of the first-level command word in the target time window, and the first number is the number of speech frames between the current speech frame and the target speech frame with the largest number of spaced speech frames, so that the speech frame between the current speech frame and the target speech frame with the largest number of spaced speech frames is determined to be the speech frame in the verification time window. It is understood that the speech frames between the current speech frame and the target speech frame include the current speech frame and the target speech frame. By the method, a more accurate verification time window can be determined, and the accuracy is higher when command word detection is carried out on voice data in the verification time window. For example, the continuously input speech data includes 1 st, 2 nd, 3.... n speech frames, if the current speech frame is the 120 th speech frame, the target speech frame with the largest number of speech frames between the current speech frame is the 20 th speech frame, and the 20 th to 120 th speech frames are determined as the speech frames in the verification time window associated with the 120 th speech frame.

In a possible implementation manner, the command word set includes command words with different command word lengths, and there are cases where prefixes are the same or similar words are easily confused, for example, "open heating" and "open heating mode" are two command words with the same prefix but different indicated operations, in an actual processing process, since voice data is input one by one, when a current voice frame is a voice just after input of "open heating", it is very likely that the command word of "open heating" is hit based on the target time window corresponding to the current voice frame, but it is very likely that the command word to be actually triggered is the "open heating mode", a voice frame after "open heating" can also be included in the verification time window, thereby performing more accurate command word detection. Taking the way that the reference voice frame is the last frame of the target time window as an example, when the verification time window associated with the current voice frame is determined, a section of voice frame after the current voice frame can also be determined as the voice frame in the verification time window, namely, a delay waiting strategy is introduced when the verification time window is determined, when the command word is determined through the target time window, the condition of early false recognition occurs, but because the delay waiting strategy is introduced, the determined verification time window can cover a larger time window, and the correct command word can be accurately recognized when secondary verification is performed based on the verification time window, so that the command word recognition accuracy is improved.

Specifically, determining the verification time window associated with the current speech frame may include the following steps: a verification time window associated with the current speech frame is determined based on a first number of speech frames preceding the current speech frame and a second number of speech frames following the current speech frame. The first number of speech frames before the current speech frame includes the current speech frame, the second number of speech frames after the current speech frame also includes the current speech frame, but the plurality of speech frames in the verification time window only includes one current speech frame. The second number may be a preset value, the second number may be an empirical value, or may be determined according to the command word lengths of the longest command word and the first-level command word in the command word set, specifically, a length difference may be obtained by subtracting the command word length of the first-level command word from the command word length of the longest command word, and the length difference is multiplied by the target preset value to obtain the second number. For example, if the command word length of the longest command word is 8, and the command word length of the primary command word degree is 5, the length difference is 8-5 to 3, and if the target preset value is 25, 3 to 25 may be 75, and the second number may be 75. To illustrate how to determine the verification time window, the continuously input speech data includes 1 st, 2 nd, 3.. eta.n speech frames, if the current speech frame is the 120 th speech frame, the first number is 100 speech frames, and the second number is 75, then 100 speech frames before the 120 th speech frame (20 th to 120 th speech frames) and 75 speech frames after the 120 th speech frame (120 th to 195 th speech frames) can be determined as the speech frames in the verification time window associated with the 120 th speech frame, i.e. the speech frames in the verification time window include the 20 th to 195 th speech frames.

S504, according to the audio characteristics corresponding to the voice data of the voice frames in the verification time window, determining a first confidence degree corresponding to the voice data in the verification time window and each command word in the command word set respectively, and determining the correlation characteristics corresponding to the verification time window based on the voice data in the verification time window.

The related description of step S504 may refer to the related description of step S204, which is not described herein again.

And S505, determining a third confidence coefficient corresponding to the voice data of the verification time window and each command word based on the first confidence coefficient and the associated characteristic corresponding to each command word.

The third confidence may be a probability that the voice data characterizing the verification time window is each command word, and each command word may have a corresponding third confidence. It is to be appreciated that the third confidence level corresponds to a confidence level that calibrates the first confidence level, and due to the addition of the correlation feature, the resulting third confidence level can more accurately characterize the likelihood that the voice data of the verification time window is each command word, and the accuracy of determining the hit command word based on the third confidence level is higher compared to determining the hit command word directly from the first confidence level.

The third confidence degree corresponding to the voice data of the verification time window and each command word is determined based on the first confidence degree and the associated feature corresponding to each command word, specifically, the third confidence degree corresponding to the voice data of the verification time window and each command word is determined based on the first confidence degree and the associated feature corresponding to each command word and the verification feature is obtained by splicing the first confidence degree and the associated feature corresponding to each command word, and then the third confidence degree corresponding to each command word is determined based on the verification feature. The verification feature is obtained by splicing the first confidence degree corresponding to each command word with other information features, and the other information features may be association features.

S506, if the command word set has the command word with the third confidence degree larger than or equal to the second threshold, determining the command word with the third confidence degree larger than or equal to the second threshold and the maximum third confidence degree as a result command word hit by the voice data of the verification time window in the command word set, and executing the operation indicated by the result command word.

The second threshold may be a preset threshold, and in order to improve the detection accuracy of the command word, a reasonable second threshold may be set to determine the resulting command word. It is to be understood that if there is no command word in the command word set whose first confidence is greater than or equal to the second threshold, it is determined that there is no hit result command word in the command word set for the voice data of the verification time window. Optionally, after the result command word is determined, the operation indicated by the result command word may be performed.

In a possible embodiment, if the first confidence level is determined, and the first confidence level corresponding to the voice data of the verification time window and the garbage class is also determined, when the third confidence level is determined, the third confidence level corresponding to the voice data of the verification time window and the garbage class may also be determined, and then the maximum third confidence level may be determined among the third confidence levels except the third confidence level corresponding to the garbage class, if the maximum third confidence level is greater than or equal to the second threshold, the command word corresponding to the maximum third confidence level is determined as a hit result command word, and if the maximum third confidence level is less than the second threshold, the voice data of the verification time window is classified as the garbage class, that is, the voice data of the verification time window has no hit result command word in the command word set.

For example, if the command word set includes command word 1, command word 2, command word 3, and command word 4, a third confidence corresponding to each command word is obtained based on the first confidence and the associated features corresponding to each command word, where the third confidence corresponding to command word 1 is 0.3, the third confidence corresponding to command word 2 is 0.73, the third confidence corresponding to command word 3 is 0.42, the third confidence corresponding to command word 4 is 0.58, and the third confidence corresponding to spam is 0.61; if the preset second threshold is 0.6, a command word with a third confidence greater than or equal to the first threshold, that is, the command word 4, exists in the command word set, and the command word 4 is a result command word hit by the voice data in the verification time window in the command word set, that is, the input voice data hits the command word 4, and thus the operation indicated by the command word 4 can be executed. If the preset second threshold is 0.75, no command word with the third confidence degree larger than or equal to the first threshold exists in the command word set, a command word which is not hit by the voice data of the verification time window in the command word set is determined, and then a new current voice frame is determined, so that the steps are repeatedly executed, and the detection of the command word is realized.

In one possible implementation, the resulting command word is determined by a trained secondary detection network, which may include a first confidence generating network and a confidence calibrating network. The first confidence level generating network is configured to perform the step of determining, according to audio features respectively corresponding to the speech data of the multiple speech frames in the verification time window, a first confidence level respectively corresponding to the speech data in the verification time window and each command word in the command word set, where the first confidence level generating network may be a deep neural network, such as a CLDNN model (a neural network model). Optionally, the dimension of the result output by the network generated by the first confidence degree is the dimension of the first confidence degree corresponding to the spam class, which is obtained by adding 1 to the number of command words in the command word set. The confidence calibration network is configured to perform the above step of determining a result command word hit by the voice data in the verification time window in the command word set based on the first confidence and the associated feature corresponding to each command word, and may be a simple multi-layer neural network, such as a multi-layer DNN network (a neural network model). For how the secondary detection network determines the result command word, reference may be made to the related descriptions of steps S504 to S505, which are not described herein again. In one implementation, when the secondary detection network is called to determine a hit result command word according to the audio features corresponding to the voice data of the voice frames in the verification time window, the voice data of the voice frames in the verification time window may be sequentially input, so as to obtain a first confidence level corresponding to the voice data of the verification time window and each command word, and then a third confidence level is determined according to the first confidence level and the associated features, so as to obtain a hit result command word. Optionally, the dimension of the result output by the secondary detection network is the dimension of adding 1 to the number of command words in the command word set, where the added 1 is the dimension of adding the first confidence corresponding to the spam class.

In a possible implementation manner, before determining the result command word by the trained secondary detection network, the training of the secondary detection network is required, which may specifically include the following steps: firstly, second sample voice data are obtained, and the second sample voice data carry a command word label. The second sample voice data may be positive sample data or negative sample data. The positive sample data may be audio data in a verification time window determined based on the trained primary detection network. The negative sample data may be voice data including various non-command words. The negative sample data may also be audio data with interference noise, such as noise added to music and television, synthesized or real audio data in various far-field environments, so that accuracy of command word detection in far-field environments or noisy environments can be improved. It can be understood that, in the training process of the primary detection network, the adopted negative data does not include audio data with various interference noises, because when the primary detection network is trained through the audio data with various interference noises, the classification effect of the primary detection network on the syllable output unit is rather poor, so that the secondary detection network is trained through the audio data with the interference noises during the training of the secondary detection network, the accuracy of command word detection under the condition of more interference factors is improved, the defect of the primary detection network is effectively compensated, and the secondary detection network has good complementarity to the primary detection network. The syllable output unit label is to label the command word actually corresponding to the second sample voice data, and it can be understood that if the second sample voice data actually has the corresponding command word, the syllable output unit label labels the command word actually corresponding to the second sample voice data, and if the second sample voice data does not actually have the corresponding command word, the syllable output unit label labels that the second sample voice data actually belongs to the garbage category.

And secondly, calling a secondary detection network to determine a prediction command word corresponding to the second sample voice data. The determining of the predicted command word may be performed in an initial secondary detection network, and specifically, the determining may be performed by determining, according to audio features corresponding to the voice data of each voice frame in the second sample voice data, a first confidence degree corresponding to the second sample voice data and each command word, and then determining, based on the first confidence degree corresponding to each command word and the associated features, the predicted command word corresponding to the second sample voice data. It can be understood that, if the trained secondary detection network determines the predicted command word corresponding to the second sample voice data based on the first confidence level corresponding to each command word, the associated feature, and the second confidence level corresponding to each command word, when training the secondary detection network, the secondary detection network needs to be trained through the first confidence level corresponding to each command word determined based on the second sample voice data, the associated feature, and the second confidence level corresponding to each command word. The audio features corresponding to the speech data of each speech frame in the second sample speech data are calculated in the same manner as the audio features corresponding to the speech frames in the target time window, which is not repeated herein.

And training based on the predicted command words and the command word labels to obtain a trained secondary detection network. In the training process, the network parameters of the initial secondary detection network are adjusted to enable the predicted command words corresponding to the second sample voice data to be gradually similar to the actual corresponding command words labeled by the command word labels, so that the trained secondary detection network can accurately predict the command words corresponding to the voice data in each verification time window.

It can be understood that, because hardware configurations such as a CPU processor (central processing unit), a memory, and a flash memory, which are generally used by an electronic device that needs to detect an instruction in voice data, are low, there is a relatively strict requirement on resource occupation of each function, in the present application, command word detection in voice data is mainly determined by the trained primary detection network and secondary detection network, the network structure is relatively simple, the resource occupation of the electronic device is relatively small, and command word detection performance can be effectively improved. Compared with the method for recognizing the content of the received voice data based on the voice recognition technology, the method for recognizing the content of the voice data based on the voice recognition technology can achieve a good recognition effect only by using large-scale acoustic models and language models, namely, the method for recognizing the content of the voice data based on the voice recognition technology can achieve the good recognition effect only by occupying more equipment resources.

How to implement command word detection on voice data through secondary verification is explained by using an example, please refer to fig. 7, and fig. 7 is a frame diagram of a data processing method provided by an embodiment of the present application. As shown in fig. 7, the present application may abstract the flow of the entire data processing method into a primary verification and a secondary verification, so that voice data (shown as 701 in fig. 7) may be input for the primary verification, which may specifically include determining an audio feature of the voice data in a target time window corresponding to a current voice frame (shown as 702 in fig. 7), thereby determining a second confidence level (shown as 704 in fig. 7) of each command word based on a trained primary detection network (shown as 703 in fig. 7), and further performing a threshold judgment to determine whether the target time window hits the target command word. If the target time window hits the target command word, thereby entering the second-level verification, specifically, the method may include determining the voice data of the verification time window (as shown in 705 in fig. 7), and further acquiring the audio features of the voice data in the verification time window associated with the current voice frame, and it may be understood that the audio features in the verification time window may be acquired from the audio features of each cached voice frame. And then inputting the audio features corresponding to the verification time window into the trained secondary detection network, in the secondary detection network, obtaining a first confidence degree of each command word based on the audio data of the voice data of the verification time window (as shown in 707 in fig. 7), and based on the associated features determined by the voice data of the verification time window (as shown in 706 in fig. 7), so as to splice the first confidence degree and the associated features of each command word, thereby determining a third confidence degree of each command word (as shown in 708 in fig. 7), so as to determine a final result command word (as shown in 709 in fig. 7), thereby improving the accuracy of command word detection by performing secondary verification on the voice data and adding more feature information during secondary verification.

The embodiment of the application provides a data processing scheme which can realize command word detection based on primary detection (verification) and secondary detection (verification). For example, whether the voice data of the target time window hits a command word in the command word set or not may be determined according to the audio features corresponding to the voice data of the K voice frames in the target time window, when the voice data of the target time window hits the command word in the command word set, a verification time window associated with the current voice frame is determined, so as to determine a first confidence degree corresponding to the voice data in the verification time window and each command word in the command word set, and determine an associated feature of the voice data in the verification time window, and then determine a result command word hit by the voice data in the verification time window in the command word set based on the first confidence degree and the associated feature corresponding to each command word. Optionally, after the result command word is determined, the operation indicated by the result command word may be further performed. Therefore, after the command word is determined through primary detection, namely the command word is primarily determined to hit the command word by the voice data based on the target time window, secondary detection is carried out, namely a new verification time window is determined to carry out secondary verification on whether the voice data contains the command word, and the association characteristic is added during secondary verification, so that whether the command word is hit by the verification time window can be determined based on more information, and the accuracy of command word detection on the voice data can be improved.

Referring to fig. 8, fig. 8 is a schematic flowchart of another data processing method according to an embodiment of the present application. The method may be performed by the electronic device described above. The data processing method may include the following steps.

S801, determining a target time window corresponding to the current voice frame, and acquiring audio features corresponding to voice data of K voice frames in the target time window respectively.

S802, determining whether the voice data of the target time window hits the command word in the command word set or not according to the audio characteristics corresponding to the voice data of the K voice frames.

S803, when the voice data of the target time window hits the command word in the command word set, determining a verification time window associated with the current voice frame.

S804, according to the audio characteristics respectively corresponding to the voice data of the voice frames in the verification time window, determining a first confidence degree respectively corresponding to the voice data in the verification time window and each command word in the command word set, and determining the correlation characteristics corresponding to the verification time window based on the voice data in the verification time window.

The relevant description of steps S801 to S804 may refer to steps S201 to S204, which are not described herein.

And S805, splicing processing is carried out on the basis of the second confidence coefficient corresponding to each command word, the first confidence coefficient corresponding to each command word and the associated features, and verification features are obtained.

The second confidence may be the confidence determined based on the voice data of the target time window. As described above, the verification feature refers to a feature obtained by splicing the first confidence degree corresponding to each command word with other information features, where the other information features may be the association feature and the second confidence degree corresponding to each command word.

And S806, determining a third confidence degree of the voice data of the verification time window corresponding to each command word based on the verification characteristics.

The determining of the third confidence level based on the verification feature may refer to the above-mentioned description of determining the correlation between the voice data of the verification time window and the third confidence level corresponding to each command word based on the association feature and the first confidence level corresponding to each command word, which is not described herein again.

S807, if there is a command word in the command word set whose third confidence is greater than or equal to the second threshold, determining the command word whose third confidence is greater than or equal to the second threshold and whose third confidence is the largest as a result command word hit by the voice data of the verification time window in the command word set, and executing the operation indicated by the result command word.

Step S807 can refer to the related description of step S506, which is not described herein.

How to implement command word detection on voice data through secondary verification is explained by using an example, please refer to fig. 9, and fig. 9 is a schematic diagram of a framework of another data processing method provided by an embodiment of the present application. As shown in fig. 9, the present application may abstract the flow of the entire data processing method into a first-level verification and a second-level verification, so that voice data (shown as 901 in fig. 9) may be input for the first-level verification, which may specifically include determining an audio feature (shown as 902 in fig. 9) of the voice data in a target time window corresponding to a current voice frame, thereby determining a second confidence (shown as 904 in fig. 9) of each command word based on a trained first-level detection network (shown as 903 in fig. 9), and further performing a threshold judgment to determine whether the target time window hits the target command word. If the target time window hits the target command word, so as to enter the second-level verification, specifically, determining the voice data of the verification time window (as shown in 905 in fig. 9), and further obtaining the audio features of the voice data in the verification time window associated with the current voice frame, it can be understood that the audio features in the verification time window may be obtained from the audio features of each cached voice frame. Further, the audio features corresponding to the verification time window are input into the trained secondary detection network, and in the secondary detection network, the first confidence level for each command word may be derived based on the audio data of the voice data of the verification time window (as shown at 907 in fig. 9), and based on the associated features determined by the voice data of the verification time window (as shown at 906 in fig. 9), thereby concatenating the first confidence level of each command word with the associated features and the second confidence level of each command word (indicated as 908 in figure 9), thereby determining a third confidence level for each command word (shown as 909 in fig. 9), and thereby determining the final resulting command word (shown as 910 in fig. 9), whereby the final command word can be determined by a secondary verification of the speech data, and more characteristic information is added during secondary verification, so that the accuracy of command word detection is improved.

In a possible implementation manner, the method and the device can further perform splicing processing based on the second confidence degree corresponding to each command word and the first confidence degree corresponding to each command word to obtain verification features, and further determine the third confidence degree corresponding to the voice data of the verification time window and each command word based on the verification features. It can be understood that when the verification features are determined, the verification features can be obtained by splicing the first confidence coefficient corresponding to each command word with one or more of the second confidence coefficient corresponding to each command word and the associated features, and it can be understood that the verification features can also be obtained by splicing the first confidence coefficient corresponding to each command word with other associated features, so that the result command word can be determined by introducing more feature information of voice data, and the accuracy of command word detection is greatly improved.

Optionally, after the result command word is determined, the operation indicated by the result command word may be executed.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. Alternatively, the data processing apparatus may be disposed in the electronic device. As shown in fig. 10, the data processing apparatus described in the present embodiment may include:

an obtaining unit 1001, configured to determine a target time window corresponding to a current speech frame, and obtain audio features corresponding to speech data of K speech frames in the target time window, where K is a positive integer;

a processing unit 1002, configured to determine, according to audio features corresponding to the voice data of the K voice frames, whether the voice data of the target time window hits a command word in a command word set, where the command word set includes at least one command word;

the processing unit 1002 is further configured to determine a verification time window associated with the current voice frame when the voice data of the target time window hits a command word in the command word set;

the processing unit 1002 is further configured to determine, according to audio features respectively corresponding to voice data of a plurality of voice frames in the verification time window, first confidence degrees respectively corresponding to the voice data in the verification time window and each command word in the command word set, and determine, based on the voice data in the verification time window, an associated feature corresponding to the verification time window;

the processing unit 1002 is further configured to determine a result command word hit by the voice data of the verification time window in the command word set based on the first confidence degree corresponding to each command word and the association characteristic.

In an implementation manner, the processing unit 1002 is specifically configured to:

determining a second confidence coefficient corresponding to the voice data of the target time window and each command word in the command word set according to the audio characteristics corresponding to the voice data of the K voice frames respectively;

if the command word set has a command word with a second confidence degree larger than or equal to a first threshold value, determining that the voice data of the target time window hits the command word in the command word set;

and if no command word with the second confidence degree larger than or equal to the first threshold value exists in the command word set, determining that the voice data of the target time window does not hit the command word in the command word set.

performing splicing processing on the basis of the second confidence degree corresponding to each command word, the first confidence degree corresponding to each command word and the associated features to obtain verification features;

determining a third confidence level that the voice data of the verification time window corresponds to each command word based on the verification features;

and if command words with the third confidence degree larger than or equal to the second threshold exist in the command word set, determining the command words with the third confidence degree larger than or equal to the second threshold and the maximum third confidence degree as result command words hit by the voice data of the verification time window in the command word set.

determining a third confidence degree corresponding to the voice data of the verification time window and each command word based on the first confidence degree corresponding to each command word and the associated characteristics;

In one implementation, each command word in the set of command words has a plurality of syllables; the processing unit 1002 is specifically configured to:

obtaining a syllable output unit set, wherein the syllable output unit set is determined based on a plurality of syllables of each command word, and the syllables corresponding to different syllable output units are different;

determining the probability that the K voice frames correspond to each syllable output unit in the syllable output unit set respectively according to the audio characteristics corresponding to the voice data of the K voice frames respectively;

determining a syllable output unit corresponding to a syllable of a command word hit by the voice data of the target time window as a verified syllable output unit, and determining a voice frame with the highest probability corresponding to the verified syllable output unit from the K voice frames as a target voice frame;

and determining a verification time window associated with the current voice frame according to the voice frame between the target voice frame and the current voice frame.

In one implementation, the associated features include at least one of: a first average energy of the voice data in the verification time window, an effective voice proportion of the voice data in the verification time window, a signal-to-noise ratio of the voice data in the verification time window, and a number of voice frames in the verification time window.

In one implementation, the processing unit 1002 is further configured to:

determining a first average energy of the speech data of the verification time window based on the energy of the speech data of each speech frame in the verification time window;

determining the effective voice proportion of the voice data in the verification time window according to the number of the effective voice frames in the verification time window, wherein the effective voice frames are voice frames with energy larger than or equal to the first average energy;

and determining the signal-to-noise ratio of the voice data of the verification time window according to the second average energy and the first average energy of the valid voice frames in the verification time window.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. The electronic device described in this embodiment includes: a processor 1101, a memory 1102. Optionally, the electronic device may further include a network interface or a power supply module. Data can be exchanged between the processor 1101 and the memory 1102.

The Processor 1101 may be a Central Processing Unit (CPU), or other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The network interface may include an input device, such as a control panel, a microphone, a receiver, etc., and/or an output device, such as a display screen, a transmitter, etc., to name but a few.

The memory 1102, which may include both read-only memory and random-access memory, provides program instructions and data to the processor 1101. A portion of the memory 1102 may also include non-volatile random access memory. Wherein the processor 1101, when calling the program instruction, is configured to:

determining a first confidence coefficient corresponding to the voice data in the verification time window and each command word in the command word set respectively according to the audio characteristics corresponding to the voice data of the voice frames in the verification time window respectively, and determining an associated characteristic corresponding to the verification time window based on the voice data in the verification time window;

In one implementation, the processor 1101 is specifically configured to:

if command words with second confidence degrees larger than or equal to a first threshold value exist in the command word set, determining that the voice data of the target time window hits the command words in the command word set;

In one implementation, the processor 1101 is specifically configured to:

and if the command word with the third confidence coefficient larger than or equal to the second threshold value exists in the command word set, determining the command word with the third confidence coefficient larger than or equal to the second threshold value and the maximum third confidence coefficient as a result command word hit by the voice data of the verification time window in the command word set.

In one implementation, the processor 1101 is specifically configured to:

In one implementation, each command word in the set of command words has a plurality of syllables; the processor 1101 is specifically configured to:

determining the probability that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames;

determining a syllable output unit corresponding to a syllable of a command word hit by the voice data of the target time window as a verified syllable output unit, and determining a voice frame with the highest probability corresponding to the verified syllable output unit in the K voice frames as a target voice frame;

In one implementation, the processor 1101 is further configured to:

determining a first average energy of the speech data for the verification time window based on the energy of the speech data for each speech frame in the verification time window;

determining the effective voice proportion of the voice data in the verification time window according to the number of effective voice frames in the verification time window, wherein the effective voice frames are voice frames with energy larger than or equal to the first average energy;

Optionally, the program instructions may also implement other steps of the method in the above embodiments when executed by the processor, and details are not described here.

The present application further provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the above method, such as performing the above method performed by an electronic device, which is not described herein in detail.

Optionally, the storage medium, such as a computer-readable storage medium, referred to herein may be non-volatile or volatile.

Alternatively, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like. The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art will recognize that the embodiments described in this specification are preferred embodiments and that acts or modules referred to are not necessarily required for this application.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, and the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Embodiments of the present application also provide a computer program product or a computer program, which includes computer instructions that, when executed by a processor, implement some or all of the steps of the above-described method. The computer instructions are stored, for example, in a computer-readable storage medium. The computer instructions are read by a processor of a computer device (i.e., the electronic device described above) from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps performed in the embodiments of the methods described above. For example, the computer device may be a terminal, or may be a server. The foregoing detailed description has provided a data processing method, an apparatus, an electronic device, a program product, and a medium according to embodiments of the present application, and specific examples have been applied in the present application to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of data processing, the method comprising:

determining whether the voice data of the target time window hits a command word in a command word set or not according to audio characteristics corresponding to the voice data of the K voice frames respectively, wherein the command word set comprises at least one command word;

2. The method of claim 1, wherein the determining whether the voice data of the target time window hits a command word in a command word set according to the audio features corresponding to the voice data of the K voice frames, respectively, comprises:

and if no command word with the second confidence coefficient larger than or equal to the first threshold value exists in the command word set, determining that the voice data of the target time window does not hit the command word in the command word set.

3. The method of claim 2, wherein the determining the command word as a result of the hit of the voice data of the verification time window in the command word set based on the first confidence level corresponding to each command word and the associated feature comprises:

splicing the second confidence coefficient corresponding to each command word, the first confidence coefficient corresponding to each command word and the associated features to obtain verification features;

4. The method of claim 1, wherein the determining the command word as a result of the hit of the voice data of the verification time window in the command word set based on the first confidence level corresponding to each command word and the associated feature comprises:

determining a third confidence coefficient corresponding to the voice data of the verification time window and each command word based on the first confidence coefficient corresponding to each command word and the associated features;

5. The method of claim 1, wherein each command word in the set of command words has a plurality of syllables; said determining a verification time window associated with said current speech frame comprises:

6. The method of claim 1, wherein the associated features comprise at least one of: a first average energy of the voice data in the verification time window, an effective voice proportion of the voice data in the verification time window, a signal-to-noise ratio of the voice data in the verification time window, and a number of voice frames in the verification time window.

7. The method of claim 6, further comprising:

8. A data processing apparatus, characterized in that the apparatus comprises:

the processing unit is further configured to determine, according to audio features respectively corresponding to voice data of a plurality of voice frames in the verification time window, first confidence degrees respectively corresponding to the voice data in the verification time window and each command word in the command word set, and determine, based on the voice data of the plurality of voice frames in the verification time window, associated features corresponding to the verification time window;

9. An electronic device comprising a processor, a memory, wherein the memory is configured to store a computer program comprising program instructions, and wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1-7.

10. A computer program product comprising computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 7.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.