CN115132198B

CN115132198B - Data processing method, device, electronic equipment, program product and medium

Info

Publication number: CN115132198B
Application number: CN202210597464.XA
Authority: CN
Inventors: 陈杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2024-03-15
Anticipated expiration: 2042-05-27
Also published as: CN115132198A

Abstract

The embodiment of the application discloses a data processing method, a data processing device, electronic equipment, a program product and a medium, which can be applied to the technical field of data processing. The method comprises the following steps: according to the audio characteristics respectively corresponding to the voice data of the K voice frames in the target time window, determining a first command word hit by the voice data of the target time window in a command word set; and determining a characteristic time window associated with the current voice frame based on the command word length of the first command word, and determining a second command word hit by voice data of the characteristic time window in the command word set based on audio characteristics respectively corresponding to voice data of a plurality of voice frames in the characteristic time window. By adopting the embodiment of the application, the accuracy of command word detection of voice data is improved. The embodiment of the application can also be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, intelligent household appliances and the like.

Description

Data processing method, device, electronic equipment, program product and medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, electronic device, program product, and medium.

Background

Currently, a voice detection technology is widely used, and a voice detection function is provided in many intelligent devices (such as a vehicle-mounted system, an intelligent sound box, an intelligent household appliance and the like), and the intelligent devices can receive instructions issued in a voice form, detect the instructions based on received voice data and execute corresponding operations. However, the inventors found in practice that the accuracy of detection of command words in speech data is low when detecting instructions in speech data.

Disclosure of Invention

The embodiment of the application provides a data processing method, a device, electronic equipment, a program product and a medium, which are beneficial to improving the accuracy of command word detection of voice data.

In one aspect, an embodiment of the present application discloses a data processing method, where the method includes:

determining a target time window corresponding to a current voice frame, and acquiring audio features respectively corresponding to voice data of K voice frames in the target time window, wherein K is a positive integer;

according to the audio characteristics respectively corresponding to the voice data of the K voice frames, determining a first command word hit by the voice data of the target time window in a command word set;

Determining a characteristic time window associated with the current voice frame based on the command word length of the first command word, and acquiring audio characteristics respectively corresponding to voice data of a plurality of voice frames in the characteristic time window;

and determining a second command word hit by the voice data of the characteristic time window in the command word set based on the audio characteristics respectively corresponding to the voice data of the voice frames in the characteristic time window.

In one aspect, an embodiment of the present application discloses a data processing apparatus, the apparatus including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for determining a target time window corresponding to a current voice frame and acquiring audio characteristics respectively corresponding to voice data of K voice frames in the target time window, wherein K is a positive integer;

the processing unit is used for determining a first command word hit by the voice data of the target time window in a command word set according to the audio characteristics respectively corresponding to the voice data of the K voice frames;

the processing unit is further configured to determine a feature time window associated with the current speech frame based on a command word length of the first command word, and obtain audio features corresponding to speech data of a plurality of speech frames in the feature time window respectively;

The processing unit is further configured to determine, based on audio features corresponding to the voice data of the plurality of voice frames in the feature time window, a second command word in which the voice data of the feature time window hits in the command word set.

In one aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, where the memory is configured to store a computer program, the computer program including program instructions, the processor being configured to perform the steps of:

In one aspect, embodiments of the present application provide a computer readable storage medium having stored therein computer program instructions which, when executed by a processor, are configured to perform the steps of:

In one aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions which, when executed by a processor, implement the method provided in one of the aspects above.

According to the data processing scheme, according to the audio features respectively corresponding to the voice data of the K voice frames in the target time window, the first command word hit by the voice data of the target time window in the command word set is determined, which is equivalent to preliminarily determining the command word contained in the continuously input voice data, and further, the characteristic time window is determined based on the characteristic information of the first command word such as the command word length, so that the second command word hit by the voice data of the characteristic time window in the command word set is determined according to the audio features respectively corresponding to the voice data of the voice frames in the characteristic time window, which is equivalent to determining a new characteristic time window, and whether the continuously input voice data contains the command word is secondarily verified again. Alternatively, after the second command word is detected, an operation indicated by the second command word may be performed. Therefore, after the command words of the voice data are primarily determined based on the target time window, a new characteristic time window is determined so as to verify whether the voice data contain the command words or not, and therefore the accuracy of detecting the command words of the voice data can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a data processing system according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of the effect of a target time window according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of another data processing method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a framework of a primary detection network according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a data processing method according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

In one possible implementation, the embodiment of the present application may be applied to a data processing system, and referring to fig. 1, fig. 1 is a schematic structural diagram of a data processing system provided by the embodiment of the present application. As shown in fig. 1, the data processing system may include a voice-initiated object and a data processing device. Wherein the voice-initiated object may be used to send voice data to the data processing device, the voice-initiated object may be a user or device, etc., that needs to request the data processing device to respond, without limitation. The data processing device may execute the above-mentioned data processing scheme, and may perform corresponding operations based on the received voice data, for example, the data processing device may be an in-vehicle system, a smart speaker, a smart home appliance, or the like. That is, after the voice initiating object outputs the voice data, the data processing device may receive the voice data, and further the data processing device may detect a command word in the voice data based on the above data processing scheme, and then perform an operation corresponding to the detected command word. It will be appreciated that before the data processing apparatus detects the voice data, a command word set may be preset, where the command word set includes a command word or a plurality of command words, each command word may be associated with a corresponding operation, for example, an operation of turning on an air conditioner is associated with a command word "turning on an air conditioner", and when the data processing apparatus detects the voice data including the command word "turning on an air conditioner", the data processing apparatus may perform the operation of turning on an air conditioner. According to the data processing scheme, after the command word hit of the voice data is primarily determined based on the target time window, a new characteristic time window is determined to verify whether the voice data contains the command word or not, so that the accuracy of command word detection of the voice data by the data processing equipment in the data processing system can be improved, and a user can conveniently and accurately instruct the data processing equipment to execute corresponding operation through the voice.

It should be noted that, before and during the process of collecting the relevant data of the user, the present application may display a prompt interface, a popup window or output a voice prompt message, where the prompt interface, popup window or voice prompt message is used to prompt the user to collect the relevant data currently, so that the present application only starts to execute the relevant step of obtaining the relevant data of the user after obtaining the confirmation operation of the user to the prompt interface or popup window, otherwise (i.e. when the confirmation operation of the user to the prompt interface or popup window is not obtained), the relevant step of obtaining the relevant data of the user is finished, i.e. the relevant data of the user is not obtained. In other words, all user data collected in the present application is collected with the consent and authorization of the user, and the collection, use and processing of relevant user data requires compliance with relevant laws and regulations and standards of the relevant country and region.

In one possible implementation, the present embodiments may be applied in the field of artificial intelligence technology, where artificial intelligence (Artificial Intelligence, AI) is a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In one possible implementation, the embodiments of the present application may also be applied in the field of speech technology, such as the above-mentioned command word for detecting a speech data hit. Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

The technical scheme of the application can be applied to the electronic equipment, such as the data processing equipment. The electronic device may be a terminal, a server, or other devices for performing data processing, which is not limited in this application. Optionally, the method comprises the steps of. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. Terminals include, but are not limited to, cell phones, computers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, intelligent speakers, intelligent appliances, and the like.

It can be understood that the above scenario is merely an example, and does not constitute a limitation on the application scenario of the technical solution provided in the embodiments of the present application, and the technical solution of the present application may also be applied to other scenarios. For example, as one of ordinary skill in the art can know, with the evolution of the system architecture and the appearance of new service scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.

Based on the above description, the embodiments of the present application provide a data processing method. Referring to fig. 2, fig. 2 is a flow chart of a data processing method according to an embodiment of the present application. The method may be performed by the electronic device described above. The data processing method may include the following steps.

S201, determining a target time window corresponding to the current voice frame, and acquiring audio features respectively corresponding to voice data of K voice frames in the target time window.

The current speech frame may be any speech frame of the acquired speech. It will be appreciated that the acquired speech may be real-time speech, and for speech data that is continuously input in real-time, the current speech frame may be the latest speech frame in the continuously input speech data. The acquired voice may also be non-real-time voice, for example, for a whole segment of voice data generated in advance, or each voice frame may be determined as a current voice frame in sequence according to the sequence of each voice frame in the voice data.

Wherein a speech frame may comprise several sampling points, i.e. speech data of consecutive sampling points constitutes speech data of a speech frame. It will be appreciated that the time difference between adjacent sampling points is the same. There may be partially repeated samples in adjacent two speech frames or completely different samples may be included, without limitation. For example, in a 10s section of input voice data, one sampling point is determined every 10ms, 20 consecutive sampling points are determined as one voice frame, for example, in the 10s voice data, 1 st to 20 th sampling points are determined as one voice frame, 21 st to 40 th sampling points are determined as one voice frame, and so on, a plurality of voice frames are obtained. For another example, in order to avoid that the audio data of two adjacent voice frames is too greatly changed, a section of overlapped sampling points is formed between two adjacent voice frames, for example, the 1 st to 20 th sampling points are determined as a voice frame, the 15 th to 35 th sampling points are determined as a voice frame, the 30 th to 40 th sampling points are determined as a voice frame, and the like, so as to obtain a plurality of voice frames.

The target time window corresponding to the current speech frame may be a time window using the current speech frame as a reference speech frame, i.e. the target time window includes the current speech frame. The target time window may include a plurality of voice frames, for example, K voice frames may be included in the target time window, where K is a positive integer, that is, K may be the number of all voice frames in the target time window. Optionally, the K speech frames may also be selected partial speech frames from all speech frames in the target time window, i.e. K may be less than or equal to the number of all speech frames in the target time window, for example, after determining the target time window, energy of each speech frame in the target time window is calculated, and then speech frames with energy lower than a certain threshold are removed, so as to obtain the K speech frames, so that some speech frames with very small sound can be filtered, and the calculation amount in the subsequent processing process is reduced. The reference speech frame of a target time window indicates that the time window is divided based on the reference speech frame, for example, the reference speech frame may be the first speech frame, the last speech frame, or the speech frame of the center position of a time window, which is not limited herein. The first speech frame and the last speech frame are characterized according to time sequence, wherein the first speech frame represents the speech frame with the earliest input time in the time window, and the last speech frame represents the speech frame with the latest input time in the time window. The target time window corresponding to the current speech frame may be a time window with the current speech frame as the first speech frame, or may be a time window with the current speech frame as the last speech frame, or may be a time window with the current speech frame as the central position of the speech frame, which is not limited herein. K may be preset, or may be determined based on the length of the acquired voice, or may be determined based on the length of the command words in the command word set, such as the maximum length or the average length, and the like, which is not limited herein.

Alternatively, the target time window corresponding to the current speech frame may not include the current speech frame. For example, when the reference speech frame is the first speech frame of a time window, the next speech frame of the current speech frame may be used as the reference speech frame of the target time window, that is, the first speech frame of the target time window is the next speech frame of the current speech frame; for another example, when the reference speech frame is the last speech frame of a time window, the previous speech frame of the current speech frame may be used as the reference speech frame of the target time window, that is, the last speech frame of the target time window is the previous speech frame of the current speech frame, and so on, which will not be described herein.

In this application, the determination of the subsequent target time window and the feature time window will be described mainly by taking the case that the current speech frame is the last speech frame (i.e. the reference speech frame) of the corresponding target time window as an example. For example, the continuously input voice data includes 1 st, 2 nd, 3 rd, and third nth voice frames, if the current voice frame is the 200 th voice frame, the reference voice frame is the last voice frame of the time window, the size of the target time window is 100 voice frames (i.e., the target time window corresponding to the current voice frame includes 100 voice frames, i.e., K is 100), the time window with the 200 th speech frame as the last speech frame and the size of 100 may be determined as the target time window corresponding to the 200 th speech frame, i.e. 100 speech frames (100-200 th speech frames) preceding the 200 th speech frame are determined as speech frames in the target time window corresponding to the 200 th speech frame.

As another example, a target time window is described herein by way of illustration, and referring to fig. 3, fig. 3 is a schematic diagram illustrating an effect of the target time window according to an embodiment of the present application. As shown in (1) in fig. 3, in the received voice data, each voice frame may be represented as one of the blocks, if the gray block shown as 301 in fig. 3 is determined as the current voice frame and the size of the preset target time window is 8 voice frames, then 8 voice frames before 301 (including the voice frame indicated by 301) may be determined as the target time window corresponding to 301 (as shown as 302 in fig. 3); with continuous input of voice data, if a missing command word is detected in the time window indicated by 302, a new current voice frame may be determined based on the sliding window, for example, when the sliding window is 1, a subsequent voice frame of the voice frame indicated by 301 may be determined as a new current voice frame (as indicated by 303 in (2) of fig. 3), so that 8 voice frames before 303 (including the voice frame indicated by 303) may be determined as a target time window (as indicated by 304 in fig. 3) corresponding to 303, and so on, to achieve detection of the command word in the continuously input voice data.

The audio features corresponding to the voice data of the K voice frames in the target time window are acquired, and the corresponding audio features can be determined for the voice data based on each voice frame. In one possible implementation, the audio feature may be an FBank feature (an audio feature of voice data). Specifically, if the voice data of one voice frame is a time domain signal, the corresponding FBank feature of the voice frame is obtained, and the time domain signal of the voice data of the one voice frame can be converted into a frequency domain signal through fourier transform, so that the corresponding FBank feature is determined based on the frequency domain signal obtained through calculation, which is not described herein. It will be appreciated that the audio feature may also be a feature determined based on other means, such as MFCC features (an audio feature of voice data), without limitation.

S202, determining a first command word hit by the voice data of the target time window in the command word set according to the audio features respectively corresponding to the voice data of the K voice frames.

The first command word refers to a command word hit by the voice data in the target time window, and is also called a command word hit by the target time window, and the first command word belongs to the command word set. It can be understood that the premise of determining the first command word hit by the voice data of the target time window in the command word set is that the voice data of the target time window has the hit command word in the command word set, and if the voice data of the target time window does not have the hit command word in the command word set, the first command word hit by the voice data of the target time window in the command word set cannot be determined. The voice data of the target time window is short for voice data of K voice frames in the target time window, for example, the command word hit by the voice data of the target time window in the command word set may refer to the command word hit by the voice data of K voice frames of the target time window in the command word set; the command words that the voice data of the target time window hit in the command word set may also be briefly described as command words that the target time window hit in the command word set.

As described above, the command word set includes at least one command word, and any one of the command words in the command word set may have a plurality of syllables. Syllables are the most natural phonetic units perceived by hearing, and are formed by combining one or more phonemes according to a certain rule. In Mandarin, a Chinese character is a syllable except for individual cases, for example, 4 syllables are included in the command word "turn on air conditioner".

In one possible implementation manner, when determining the first command word hit by the voice data of the target time window in the command word set, a first confidence coefficient corresponding to each command word of the voice data of the target time window may be determined according to the audio features respectively corresponding to the voice data of the K voice frames, and then the hit first command word is determined based on the first confidence coefficient corresponding to each command word. Wherein each command word herein refers to each command word in the command word set described above. The first confidence level may characterize a likelihood that the speech data of the target time window is each command word, and each command word may have a corresponding first confidence level.

Specifically, determining the hit first command word based on the first confidence corresponding to each command word may be: if the command words with the first confidence coefficient being greater than or equal to the first threshold value exist in the command word set, determining the command words with the first confidence coefficient being greater than or equal to the first threshold value as first command words of the voice data of the target time window hit in the command word set. The first threshold may be a preset threshold, and in order to improve the detection accuracy of the command word, a reasonable first threshold may be set to determine the first command word. Optionally, in order to obtain better performance, different first thresholds may be set for command words of different lengths, so as to balance the detection rate and the false detection rate of the command words of different command lengths. It may be understood that, if there are a plurality of first confidence degrees greater than or equal to the first threshold, the command word corresponding to each first confidence degree greater than or equal to the first threshold may be determined as the first command word, that is, the number of the first command words may be a plurality.

If the command word set does not have the command word with the first confidence coefficient larger than or equal to the first threshold value, the voice data of the target time window is not provided with the command word.

For example, the command word set includes a command word 1, a command word 2, a command word 3 and a command word 4, and a first confidence coefficient corresponding to each command word is obtained according to the audio features of K voice frames in the target time window, where the first confidence coefficient corresponding to the command word 1 is 0.3, the first confidence coefficient corresponding to the command word 2 is 0.75, the first confidence coefficient corresponding to the command word 3 is 0.45, the first confidence coefficient corresponding to the command word 4 is 0.66, if the first threshold is 0.6, command words with the first confidence coefficient greater than or equal to the first threshold exist in the command word set, that is, the command words 2 and the command word 4 are the first command words with the first confidence coefficient hit in the target time window of the voice data of the command word 2 and the command word 4 in the command word set.

In one possible implementation manner, if the voice data of the target time window does not have the hit command word, the subsequent operation may not be performed, so as to determine the target time window corresponding to the new current voice frame, further detect whether the audio data of the new target time window has the hit command word, and so on, so as to implement detection of whether the audio data of the target time window corresponding to each voice frame hits the command word. In addition, when no hit command word exists in the target time window, the subsequent step of secondary verification is not directly executed, so that the data processing efficiency is improved.

S203, determining a characteristic time window associated with the current voice frame based on the command word length of the first command word, and acquiring audio characteristics respectively corresponding to voice data of a plurality of voice frames in the characteristic time window.

The characteristic time window may be a time window for performing secondary verification on the command word, and the characteristic time window may include a plurality of voice frames. The characteristic time window and the target time window may include repeated speech frames, but the speech frames may not be identical or identical, which is not limited herein. The audio features corresponding to the voice data of the plurality of voice frames in the feature time window may be corresponding audio features determined based on the voice data of each voice frame, and the audio features may be FBank features, and the detailed description refers to the above description and is not repeated herein.

It can be understood that the precondition for executing step S203 is that the hit command word is detected in the voice data of the target time window, which is equivalent to determining a new time window (i.e. the feature time window) after the hit of the voice data of the target time window in the first command word is detected, so as to implement the secondary verification through the feature time window, and improve the accuracy of detecting the command word.

In one possible implementation, the range of the characteristic time window associated with the current voice frame needs to cover the voice frame of the first command word in the voice data as much as possible, so that the first number of voice frames before the current voice frame can be determined based on the command word length (short length) of the first command word, and the characteristic time window is determined according to the first number of voice frames before the current voice frame. Wherein the command word length refers to the number of syllables in the command word. For a typical chinese command word, one word corresponds to one syllable, for example, the command word "turn on air conditioner" includes four words, and the corresponding 4 syllables, i.e., the command word length is 4. It will be appreciated that, if the number of first command words is plural, the number of speech frames contained in the feature time window may be determined based on the length of the command word of the first command word having the largest length.

Specifically, the characteristic time window may be determined according to the command word length of the first command word and the target preset value. The method specifically comprises the following steps:

(1) and determining the first quantity according to the command word length of the first command word and the target preset value. The target preset value may be a preset value, because in general, the pronunciation of a word (a syllable) may involve a plurality of speech frames due to the pronunciation speed, etc., and the number of the speech frames that may be involved by a plurality of syllables of a command word is greater than or equal to the number of syllables of the command word, the first number may be determined by determining the target preset value, so that the size of the feature time window covers the speech frames involved by the first command word as much as possible. In one possible implementation manner, the first number may be obtained by multiplying the command word length of the first command word by the target preset value, so that the number of voice frames contained in the obtained feature time window is the first number. For example, if the length of the first command word is 4 and the target preset value is 25, the first number may be 4×25=100, i.e. 100 speech frames are included in the feature time window.

(2) A characteristic time window associated with the current speech frame is determined based on a first number of speech frames preceding the current speech frame. The method comprises the steps that a first number of voice frames before a current voice frame comprise the current voice frame, a characteristic time window associated with the current voice frame is determined according to the first number of voice frames before the current voice frame, and the current voice frame is taken as the last frame of the characteristic time window. For example, the continuously input speech data includes 1 st, 2 nd, and 3 rd.

In a possible implementation manner, the command word set includes command words with different command word lengths, and there are situations such as identical prefixes or confusing similar words, for example, two command words with identical prefixes and different indicated operations are used as "open heating" and "open heating mode", in the actual processing process, since voice data are input one by one, it is very likely that a delay waiting policy is introduced when a current voice frame is one voice just after the completion of the input of "open heating", based on a target time window corresponding to the current voice frame, but the command word which is very likely to be actually triggered is "open heating mode", so that a section of voice frame after "open heating" is also included in a characteristic time window, that is, when the characteristic time window associated with the current voice frame is determined, a section of voice frame after the current voice frame is also determined as a voice frame in the characteristic time window, thus more accurate command word detection is performed, that a delay waiting policy is introduced when the characteristic time window is determined, when the command word is determined to be a target time window, the command word is recognized, but the recognition of the command word is still more accurately recognized in advance, and the recognition policy is still more accurate, and the recognition of the command word is performed based on the fact that the delay time window is more accurate.

Specifically, determining the feature time window associated with the current speech frame based on the command word length of the first command word may include the steps of: (1) and determining the first quantity according to the command word length of the first command word and the target preset value. Wherein, step (1) herein may refer to the above related description, and will not be described herein. (2) A feature time window associated with the current speech frame is determined based on a first number of speech frames preceding the current speech frame and a second number of speech frames following the current speech frame. Wherein the first number of speech frames preceding the current speech frame includes the current speech frame, and the second number of speech frames following the current speech frame also includes the current speech frame, but the plurality of speech frames in the characteristic time window includes only one current speech frame. The second number may be a preset value, and the second number may be an empirical value, or may be determined according to the command word length of the longest command word and the command word length of the first command word in the command word set, specifically, the length difference may be obtained by subtracting the command word length of the first command word from the command word length of the longest command word, and then multiplying the length difference by the target preset value. For example, the command word length 8 of the longest command word and the command word length of the first command word degree are 8-5=3, if the target preset value is 25, 3×25=75 may be obtained, and the second number may be 75. Here, how to determine the feature time window is illustrated as an example, the continuously input voice data includes 1 st, 2 nd, and 3 rd.

In a possible implementation manner, when the electronic device receives continuously input voice data, the audio features of each voice frame can be extracted and cached in the storage area, and after the feature time window is determined, the audio features corresponding to the voice frames in the feature time window can be directly extracted from the storage area, so that the efficiency of data processing can be improved, and the audio features of the voice frames do not need to be repeatedly calculated. It can be understood that the number of the audio features in the buffer storage area can be determined according to the number of the voice frames in the maximum feature time window, so that after the feature time window determined based on any first command word, the audio features of the voice frames in the feature time window can be quickly acquired from the storage area. The maximum feature time window may be a feature time window determined based on a command length of a command word having a maximum length among the command words. It will be appreciated that in order to avoid buffering too much data, with the input of speech data, every new speech frame is input, the audio features of the speech frame with the longest input time buffered may be deleted, thereby avoiding the waste of storage space.

S204, determining a second command word hit by the voice data of the characteristic time window in the command word set based on the audio characteristics corresponding to the voice data of the voice frames in the characteristic time window.

The second command word refers to a command word hit by the voice data in the characteristic time window, and the second command word belongs to the command word set. Alternatively, after the second command word is determined, the operation indicated by the second command word may be performed. The voice data of the feature time window is short for voice data of voice frames in the feature time window, for example, a command word hit by the voice data of the feature time window in the command word set may refer to a command word hit by voice data of a plurality of voice frames of the feature time window in the command word set; the second command word that the voice data of the characteristic time window hit in the command word set may also be briefly described as the second command word that the characteristic time window hit in the command word set.

It can be understood that the premise of determining that the voice data of the feature time window hits the second command word in the command word set is that the voice data of the feature time window has the hit command word in the command word set, and if the voice data of the feature time window does not have the hit command word in the command word set, the second command word that the voice data of the feature time window hits in the command word set cannot be determined. Therefore, whether the voice data of the characteristic time window hit the command word set or not can be detected, which is equivalent to carrying out secondary verification on whether the voice data hit the command word set or not, so that the detection result of the voice data of the characteristic time window is taken as a final detection result, and if a second command word hit in the command word set is detected, the operation indicated by the second command word is executed. For example, if the second command word "open heating" of the voice data hits of the characteristic time window is detected, an operation of opening heating may be performed.

In one possible implementation, if the voice data of the feature time window does not have a hit second command word in the command word set, then no operation may be performed. And further, a target time window of the new current voice frame can be determined, and the steps are repeated until whether the voice data of the characteristic time window hit the command word in the command word set or not is determined based on the audio characteristics of the voice frames in the characteristic time window associated with the new current voice frame, so that the detection of the time windows corresponding to the voice frames is realized.

In one possible implementation, when the second command word hit in the feature time window is detected, the second command word may be used for other purposes, such as training other models through the extracted command word, storing the extracted command word, and the like, which are not limited herein.

In a possible embodiment, the command words may further include some time information, location information, etc., so that the corresponding operation may be performed according to the detected time information, location information at the time indicated by the time information, and location indicated by the location information of the second command word. For example, when it is detected that the target command word is "10-point on air conditioner", wherein 10-point is time information of the command word, the operation of turning on air conditioner may be performed at 10-point.

An example is presented here of how command word detection of voice data is implemented. Firstly, voice data can be received, a target time window corresponding to a current voice frame in the received voice data is determined, whether the target time window hits command words in a command word set or not is further determined, and particularly, the voice data of each voice frame in the target time window can be determined through the audio characteristics of the voice data; if the target time window does not hit the command word in the command word set, the operation is not executed, and the target time window of the new current voice frame is determined; if the target time window hits the command words in the command word set, the second verification is performed, specifically, the feature time window associated with the current voice frame is determined, and then whether the feature time window hits the command words in the command word set is determined, if the feature time window does not hit the command words in the command word set, no operation is performed, and if the feature time window hits the command words in the command word set, the target time window of the new current voice frame is determined, and if the feature time window hits the command words in the command word set, the operation indicated by the hit command words is performed. Thereby, the accuracy of the detection of the command words in the speech data can be improved by determining the characteristic time window to realize the secondary verification.

In one possible scenario, the present application may be applied to detecting whether received voice data hits a command word in the event that an electronic device has been awakened. Namely, after the electronic equipment is awakened by the voice initiating object through the awakening word, detecting the hit command word based on the received voice data.

In one possible scenario, the present application may also be applied to a scenario in which the electronic device does not need to be awakened, that is, the electronic device directly determines whether to hit the command word according to the received voice data without being awakened by the wake word, which is equivalent to awakening the electronic device and executing the operation indicated by the command word when detecting that the received voice data hits the command word in the command word set. The method and the device have the advantages that the command words in the command word set are preset, the electronic equipment is triggered to execute the corresponding operation only when the voice data contains the command words, and the accuracy of detecting the command words is high, so that the voice initiating object can instruct the electronic equipment to execute the corresponding operation more quickly through the voice instruction, and the equipment does not need to be awakened first and then the instruction is issued. It can be understood that, in order to reduce the false recognition rate of the command words, some words which are not very commonly used may be set in the preset command words, or the false recognition rate of the command words may be reduced by adding the word groups which are not very commonly used to the command words, so that the interaction experience may be greatly improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating another data processing method according to an embodiment of the present application. The method may be performed by the electronic device described above. The data processing method may include the following steps.

S401, determining a target time window corresponding to the current voice frame, and acquiring audio features respectively corresponding to voice data of K voice frames in the target time window.

The description of step S401 may refer to the description of step S201, which is not described herein.

S402, determining a first command word hit by the voice data of the target time window in the command word set according to the audio features respectively corresponding to the voice data of the K voice frames.

In one possible implementation, each command word in the command word set has a corresponding syllable identification sequence, which refers to a sequence consisting of syllable identifications of syllables that the command word has, which can be used to characterize syllables. In a possible implementation manner, the syllable identification sequence of each command word can be determined through a pronunciation dictionary, wherein the pronunciation dictionary is a dictionary obtained by preprocessing and can comprise mapping relation between each word in the command word and syllable identification of syllables, so that the syllable identification of the syllable of each command word can be determined according to the pronunciation dictionary, namely, the syllable of the command word is determined. It will be appreciated that different words may have the same syllable, e.g. the command word "play song" and "cancel heating" both include the syllable "qu".

In a possible implementation, as described above, the command word set includes at least one command word, each command word having a plurality of syllables, and step S402 may include the steps of:

(1) according to the audio characteristics respectively corresponding to the voice data of the K voice frames, determining the probability that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set; the syllable output unit set is determined based on a plurality of syllables of each command word, and syllables corresponding to different syllable output units are different. The syllable output unit set is a set of classification items capable of classifying syllables corresponding to voice data of each voice frame, and includes a plurality of output units. For example, if the syllable output unit set includes the syllable output unit A, B, C, the voice data representing each voice frame can be classified as A, B or C, so that the probability that K voice frames respectively correspond to the syllable output unit A, B, C can be determined. The syllable output unit set determined based on the syllables of each command word may be a syllable output unit set determined based on the syllable identifications of the syllables of each command word, specifically, a union of syllable identifications of the syllables of each command word is determined, and each syllable identification of the union of syllable identifications corresponds to one syllable output unit.

In one embodiment, the syllable output unit set further includes a garbage syllable output unit, so that syllables possessed by command words not belonging to the command word set can be classified into the garbage syllable output unit in a subsequent classification process. For example, when the command word set includes command words 1, 2, and 3, syllable identifiers of syllables included in the command word 1 are s1, s2, s3, and s4, syllable identifiers of syllables included in the command word 2 are s1, s4, s5, and s6, respectively, syllable identifiers of syllables included in the command word 3 are s7, s2, s3, and s1, respectively, it is clear that a union of syllable identifiers of syllables included in the command word 1-3 is s1, s2, s3, s4, s5, s6, and s7, respectively, and syllable output units corresponding to s1, s2, s3, s4, s5, s6, and s7 are obtained, respectively, and syllable output units corresponding to each syllable and syllable output units are determined as a syllable output unit set.

(2) And determining a first confidence coefficient of the voice data of the target time window and each command word according to the probabilities that the K voice frames respectively correspond to each syllable output unit. Wherein, the first confidence of any command word can be determined by determining the maximum value of the product of the probabilities corresponding to each syllable of the command word, namely, the first confidence is determined according to the product of the maximum probabilities corresponding to each syllable of the command word.

(3) If the command words with the first confidence coefficient being greater than or equal to the first threshold value exist in the command word set, determining the command words with the first confidence coefficient being greater than or equal to the first threshold value as first command words of the voice data of the target time window hit in the command word set. This step may be referred to the above description and will not be repeated here.

In one possible implementation manner, if any command word in the command set is represented as a target command word, determining, according to probabilities that K speech frames respectively correspond to each syllable output unit, a first confidence level of speech data in the target time window corresponding to each command word may specifically include the following steps: (1) and determining syllable output units corresponding to each syllable of the target command word as target syllable output units, and obtaining a plurality of target syllable output units corresponding to the target command word. The target syllable output unit is a syllable output unit corresponding to each syllable of the target command word, and the determined target syllable output unit can be determined through a syllable identification sequence of the target command word, because each syllable output unit has a corresponding syllable, the target syllable output unit can be determined from a plurality of syllable output units through syllables in the syllable identification sequence. For example, the target command word is "on-heating", and syllable identifiers s1, s2, s3, and s4 (syllable identifier sequences of the target command word may be also referred to) of syllables included in the target command word can be specified from the pronunciation dictionary, and syllable output units corresponding to s1, s2, s3, and s4 can be specified from the syllable output unit set as target syllable output units from the syllable identifier sequences.

(2) From the probabilities that K voice frames respectively correspond to each syllable output unit, the probabilities that K voice frames respectively correspond to each target syllable output unit are determined, and K candidate probabilities that K voice frames respectively correspond to each target syllable output unit are obtained. The candidate probability is the probability that the target syllable output unit corresponds to any voice frame. For example, if the target syllable output unit has syllable output units corresponding to s1, s2, s3, and s4 (herein denoted as syllable output units s1, s2, s3, and s 4), the probabilities of s1 and K speech frames respectively, the probabilities of s2 and K speech frames respectively, the probabilities of s3 and K speech frames respectively, and the probabilities of s4 and K speech frames respectively can be determined, and the total number of obtained candidate probabilities corresponds to k×4.

(3) And determining the maximum candidate probability corresponding to each target syllable output unit from K candidate probabilities corresponding to each target syllable output unit, and determining the first confidence coefficient of the voice data of the target time window corresponding to the target command word according to the maximum candidate probability corresponding to each target syllable output unit. The first confidence coefficient of the voice data of the target time window and the target command word is determined according to the maximum candidate probability corresponding to each target syllable output unit, specifically, the first confidence coefficient of the voice data of the target time window and the target command word is determined according to the product of the maximum candidate probabilities corresponding to each target syllable output unit, if the product of the candidate probabilities can be directly determined as the first confidence coefficient, the first confidence coefficient can also be obtained through other mathematical calculations, and the method is not limited herein. For example, the probability that s1 corresponds to K speech frames is { G1 }, respectively ₁ 、G1 ₂ 、G1 ₃ ......G1 _K }, where maximumThe probability is the probability G1 corresponding to the 10 th voice frame in the target time window ₁₀ The method comprises the steps of carrying out a first treatment on the surface of the The probability that s2 corresponds to K speech frames respectively is { G2 ] ₁ 、G2 ₂ 、G2 ₃ ......G2 _K Probability G2 corresponding to the 25 th speech frame in the target time window ₂₅ The method comprises the steps of carrying out a first treatment on the surface of the The probability that s3 corresponds to K speech frames respectively is { G3 ] ₁ 、G3 ₂ 、G3 ₃ ......G3 _K Probability G3 corresponding to the 34 th speech frame in the target time window ₃₄ The method comprises the steps of carrying out a first treatment on the surface of the The probability that s4 corresponds to K speech frames is { G4 }, respectively ₁ 、G4 ₂ 、G4 ₃ ......G4 _K Probability G4 corresponding to the 39 th speech frame in the target time window ₃₉ Further, according to G1 ₁₀ 、G2 ₂₅ 、G3 ₃₄ G4 ₃₉ The product determines a first confidence level for the voice data of the target time window corresponding to the target command word. It will be appreciated that performing the above operations on each command word in the set of command words may determine a first confidence level for each command word.

In one possible implementation manner, the first confidence that the voice data of the target time window corresponds to the target command word is determined according to the maximum candidate probability respectively corresponding to each target syllable output unit, and the calculation can be performed by the following formula (formula 1):

wherein C may represent a first confidence that the audio data of the target time window corresponds to the target command word. n-1 indicates the number of target syllable output units corresponding to the target command word, and n indicates the number of target syllable output units and garbage syllable output units. i represents the ith target syllable output unit, j represents the jth speech frame of the target time window, then p _ij Representing the probability of the ith target syllable output unit and the jth speech frame, max p _ij Representing the maximum candidate probability for the ith target syllable output unit corresponding to each speech frame,the product of the largest candidate probabilities corresponding to each target syllable output unit is expressed, so that the first confidence that the audio data of the target time window corresponds to the target command word can be obtained based on the formula 1.

In a possible implementation manner, the first command word is determined by a trained primary detection network, and how the primary detection network determines the first command word specifically may refer to the above description and will not be repeated herein. In one implementation, the trained primary detection network may be divided into an acoustic model and a confidence generation module. The acoustic model is used for executing the steps of determining the probabilities that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames. The acoustic model is typically implemented using a deep neural network, such as a DNN model (a neural network model), a CNN model (a neural network model), an LSTM model (a neural network model), and the like, without limitation. The confidence coefficient generating module may be configured to perform the above step of determining the first confidence coefficient of the voice data of the target time window corresponding to each command word based on the probabilities that the K voice frames respectively correspond to each syllable output unit, which is not described in detail herein. Optionally, the dimensions of the result output by the secondary detection network are the number of command words in the command word set, and each dimension corresponds to a first confidence coefficient of a command word.

For example, referring to fig. 5, fig. 5 is a schematic diagram of a framework of a first-level detection network according to an embodiment of the application. As shown in fig. 5, the voice data of K voice frames in the target time window may be acquired first (as shown in 501 in fig. 5), then the audio feature of each voice frame is determined based on 501 (as shown in 502 in fig. 5), then the audio feature of each voice frame is input into the acoustic model in the trained primary detection network (as shown in 503 in fig. 5), then the result obtained based on the acoustic model is input into the confidence level generating module (as shown in 504 in fig. 5), so that the confidence level generating module determines that each command word has the target syllable output unit corresponding to syllable in combination with the pronunciation dictionary (as shown in 505 in fig. 5), and further determines the first confidence level corresponding to each command word, such as the command word 1 confidence level, the command word 2 confidence level, the command word m confidence level, and so on, thereby the first command word hit based on the audio data of the target time window may be obtained. It will be appreciated that if the first confidence level of each command word is not greater than or equal to the first threshold value, then the audio data of the target time window does not have the first command word hit.

In one possible implementation manner, before the first command word is determined through the trained primary detection network, the primary detection network needs to be trained, which specifically may include the following steps:

(1) first sample voice data carrying syllable output unit labels are acquired. The first sample voice data is voice data for training the first-stage detection network, and the first sample voice data can be voice data containing command words, namely positive sample data, or voice data not containing command words, namely negative sample data, so that training effect is better through training of the positive and negative sample data. The syllable output unit label is used for labeling the syllable output unit actually corresponding to each voice frame in the voice data of the first sample. It can be understood that if a voice frame in the first sample voice data actually corresponds to a syllable corresponding to each command word in the syllable command word set, the syllable output unit actually corresponding to the voice frame is a syllable output unit corresponding to the syllable actually corresponding to the voice frame, and if the voice frame actually corresponds to a syllable corresponding to each command word in the syllable command word set, the syllable output unit actually corresponding to the voice frame is a garbage syllable output unit.

(2) And calling an initial first-stage detection network to determine a predicted syllable output unit corresponding to the voice data of each voice frame in the first sample voice data. The initial first-level detection network also includes an acoustic model, where the determining predicted syllable output unit may determine, through the acoustic model in the initial first-level detection network, a probability that each speech frame corresponds to each syllable output unit in the syllable output unit set according to an audio feature corresponding to the speech data of each speech frame in the first-sample speech data, and further determine the predicted syllable output unit based on the probability that each speech frame corresponds to each syllable output unit. The audio features corresponding to the voice data of each voice frame in the first sample voice data are the same as the calculation mode of the audio features corresponding to each voice frame in the target time window, and are not described herein.

(3) And training the predicted syllable output unit and the syllable output unit label corresponding to the voice data of each voice frame in the first sample voice data to obtain the trained first-stage detection network. In the training process, the network parameters of the initial primary detection network are adjusted so that the predicted syllable output unit corresponding to each voice frame is gradually similar to the actual syllable output unit marked by the syllable output unit label, and therefore the trained primary detection network can accurately predict the probability of each voice frame corresponding to each syllable output unit. It will be appreciated that the predicted syllable output unit is herein determined by the acoustic model in the primary detection network, that is, training the primary detection network primarily adjusts model parameters of the acoustic model in the primary detection network.

In one possible implementation, if determining whether the command word hits is implemented by a Keyword/Filler HMM Model (a command word detection Model), the primary detection network may be the Keyword/Filler HMM Model. The probability that the K speech frames respectively correspond to each syllable output unit in the syllable output unit set can be determined according to the audio features respectively corresponding to the speech data of the K speech frames, then the optimal decoding path is determined based on the probability that each syllable output unit corresponds to, and further whether the optimal decoding path passes through the HMM path (hidden markov path) of the command word is determined to determine whether the command word is hit, or the confidence level corresponding to each HMM path is determined based on the probability that each syllable output unit corresponds to determine whether the command word is hit, which is not limited herein. It is understood that the HMM path may be a command word HMM path or a filling HMM path, where each command word HMM path may be composed of HMM states corresponding to syllables of a command word in series, and the filling HMM path is composed of HMM states corresponding to a set of carefully designed non-command word pronunciation units. The confidence level corresponding to each HMM state can thus be determined based on the probability corresponding to each syllable output unit, thereby determining whether and which command word is hit.

S403, determining a characteristic time window associated with the current voice frame based on the command word length of the first command word, and acquiring audio characteristics respectively corresponding to voice data of a plurality of voice frames in the characteristic time window.

The description of step S403 may refer to the description of step S203, which is not described herein.

Alternatively, the first number may be determined by other manners, for example, the first number may be a preset number, and the first number may be determined according to the earliest occurrence time of the first command word hit in the target time window, which is not limited herein. And further determining a characteristic time window associated with the current speech frame based on the first number.

In one possible embodiment, as mentioned above, the first number may be a preset number, and the preset number should cover the first command word as much as possible, and the preset number may be set based on the longest command word length in the command word set. Specifically, the preset number may be determined based on the longest command word length and the target preset value, so as to determine the preset number as the first number, and further determine the characteristic time window according to the first number of voice frames before the current voice frame.

In one possible implementation manner, the first number may be further determined according to the earliest occurrence opportunity of the first command word hit in the target time window, and then determining the characteristic time window may specifically include the following steps: (1) a syllable output unit set is acquired, wherein the syllable output unit set is determined based on a plurality of syllables of each command word, and syllables corresponding to different syllable output units are different. (2) And determining the probability that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set according to the audio characteristics respectively corresponding to the voice data of the K voice frames. Wherein, the relevant descriptions of (1) - (2) herein refer to the above descriptions, and are not repeated herein. (3) The syllable output unit corresponding to the syllable of the command word hit by the voice data of the target time window is determined as the verification syllable output unit, and the voice frame with the highest probability corresponding to the verification syllable output unit among the K voice frames is determined as the target voice frame. The target voice frame corresponds to the voice frame in which any syllable of the first command word is detected in the K voice frames, and the occurrence time of the first command word can be determined. (4) And determining a characteristic time window associated with the current voice frame according to the voice frame between the target voice frame and the current voice frame. The feature time window associated with the current voice frame may be determined according to the target voice frame with the largest number of voice frames between the current voice frame and the current voice frame, that is, the target voice frame with the largest number of voice frames between the current voice frame and the current voice frame is determined, the determined target voice frame with the largest number of voice frames between the current voice frame and the current voice frame is used for representing the earliest occurrence time of the first command word in the target time window, the first number is the number of voice frames with the largest number of voice frames between the current voice frame and the interval, and then the voice frame between the current voice frame and the target voice frame with the largest number of voice frames between the interval is determined as the voice frame in the feature time window. It is understood that the speech frame between the current speech frame and the target speech frame includes the current speech frame and the target speech frame. By the method, a more accurate characteristic time window can be determined, and further, the accuracy is higher when command word detection is performed on voice data in the characteristic time window. For example, the continuously input voice data includes 1 st, 2 nd, and 3 rd.

S404, determining a second confidence coefficient of the voice data of the characteristic time window and each command word according to the audio characteristics respectively corresponding to the voice data of the voice frames in the characteristic time window.

Wherein each command word herein refers to each command word in the command word set described above. The second confidence level may characterize a likelihood that the speech data of the feature time window is each command word, and each command word may have a corresponding second confidence level.

In one possible implementation, when determining the second confidence level of the voice data of the feature time window corresponding to each command word, the second confidence level of the voice data of the feature time window corresponding to the garbage class may also be determined, that is, the likelihood that the voice data of the feature time window is not a command word is represented by the second confidence level of the garbage class.

S405, if the command words with the second confidence coefficient being greater than or equal to a second threshold value exist in the command word set, determining the command words with the second confidence coefficient being greater than or equal to the second threshold value and the second confidence coefficient being the largest as the second command words hit in the command word set by the voice data of the characteristic time window, and executing the operation indicated by the second command words.

The second threshold may be a preset threshold, and in order to improve the detection accuracy of the command word, a reasonable second threshold may be set to determine the second command word. It will be appreciated that if no command word with a second confidence level greater than or equal to the second threshold value exists in the command word set, it is determined that the voice data of the feature time window does not have a hit second command word in the command word set. Alternatively, after the second command word is determined, the operation indicated by the second command word may be performed.

In one possible implementation manner, if the second confidence level is determined, the second confidence level of the voice data of the feature time window corresponding to the garbage class is also determined, then the largest second confidence level may be determined in the second confidence levels except for the second confidence level corresponding to the garbage class, if the largest second confidence level is greater than or equal to a second threshold value, the command word corresponding to the largest second confidence level is determined to be the hit second command word, and if the largest second confidence level is less than the second threshold value, the voice data of the feature time window is classified as the garbage class, that is, the second command word where the voice data of the feature time window does not hit in the command word set.

For example, the command word set includes a command word 1, a command word 2, a command word 3 and a command word 4, and based on the audio feature in the feature time window, a second confidence coefficient corresponding to each command word is obtained, wherein the second confidence coefficient corresponding to the command word 1 is 0.3, the second confidence coefficient corresponding to the command word 2 is 0.73, the second confidence coefficient corresponding to the command word 3 is 0.42, the second confidence coefficient corresponding to the command word 4 is 0.58, and the second confidence coefficient corresponding to the garbage is 0.61; if the preset second threshold is 0.6, a command word with a second confidence coefficient greater than or equal to the first threshold exists in the command word set, namely, command word 4, and command word 4 is a second command word with the voice data of the characteristic time window hit in the command word set, namely, the input voice data hits command word 4, so that the operation indicated by command word 4 can be executed. If the preset second threshold value is 0.75, no command word with the second confidence coefficient being greater than or equal to the first threshold value exists in the command word set, and it is determined that the voice data of the characteristic time window does not have the hit command word in the command word set, namely the voice data equivalent to the characteristic time window is classified into garbage, and then a new current voice frame is determined, so that the steps are repeatedly executed, and the detection of the command word is achieved.

In one possible implementation, the second command word is determined by a trained secondary detection network, which may be a deep neural network, such as a CLDNN model (a neural network model). How the secondary detection network determines the second command word can refer to the related descriptions of the steps S404-S405, which are not described herein. In one implementation, when the second order word hit is determined according to the audio features corresponding to the voice data of the plurality of voice frames in the feature time window by calling the second detection network, the voice data of the plurality of voice frames in the feature time window can be sequentially input, so that the second confidence coefficient of the voice data of the feature time window and the corresponding second order word of each order word is obtained. Optionally, the dimension of the result output by the secondary detection network is the number of command words in the command word set plus 1, where the added 1 is the dimension of adding the second confidence coefficient corresponding to the garbage.

In one possible implementation, before determining the second command word through the trained secondary detection network, the secondary detection network needs to be trained, which may specifically include the following steps:

(1) and acquiring second sample voice data, wherein the second sample voice data carries command word labels. The second sample voice number refers to voice data for training the secondary detection network, and the second sample voice data can be positive sample data or negative sample data. The positive sample data may be audio data in a characteristic time window determined based on the trained primary detection network. The negative sample data may be voice data including various non-command words. The negative sample data can also be audio data with interference noise, such as synthesized or real audio data added with noise such as music television and the like under various far-field environments, so that the accuracy of command word detection under the far-field environments or noisy environments can be improved. It can be understood that in the training process of the first-stage detection network, the adopted negative data does not include audio data with various interference noise, because when the first-stage detection network is trained by the audio data with various interference noise, the classification effect of the first-stage detection network on syllable output units is deteriorated, so that the second-stage detection network is trained by entering the audio data with the interference noise when the second-stage detection network is trained, the accuracy of command word detection under the condition of improving the interference factors is improved, the defect of the first-stage detection network is effectively compensated, and the second-stage detection network has good complementarity to the first-stage detection network. The syllable output unit label marks the command word actually corresponding to the second sample voice data, and it can be understood that if the second sample voice data actually has the corresponding command word, the syllable output unit label marks the command word actually corresponding to the second sample voice data, and if the second sample voice data actually does not have the corresponding command word, the syllable output unit label marks that the second sample voice data actually belongs to garbage.

(2) And calling a secondary detection network to determine a predicted command word corresponding to the second sample voice data. The determining the predicted command word may be performed in the initial secondary detection network, specifically, according to the audio features corresponding to the voice data of each voice frame in the second sample voice data, determining a second confidence coefficient corresponding to each command word in the second sample voice data, and further determining the predicted command word corresponding to the second sample voice data based on the second confidence coefficient corresponding to each command word. The audio features corresponding to the voice data of each voice frame in the second sample voice data are the same as the calculation mode of the audio features corresponding to each voice frame in the target time window, and are not described herein.

(3) Training based on the predicted command words and the command word labels to obtain a trained secondary detection network. In the training process, the network parameters of the initial secondary detection network are adjusted so that the predicted command words corresponding to the second sample voice data are gradually similar to the actual corresponding command words marked by the command word labels, and therefore the trained secondary detection network can accurately predict the command words corresponding to the voice data in each characteristic time window.

Optionally, the present application may further determine a third confidence coefficient of the voice data of the feature time window and the voice data of each command word based on the first confidence coefficient corresponding to each command word and the second confidence coefficient corresponding to each command word, and further determine the second command word hit of the voice data of the feature time window based on the third confidence coefficient. If the command words with the second confidence coefficient being greater than or equal to the second threshold value exist in the command word set, determining the command words with the second confidence coefficient being greater than or equal to the second threshold value and the maximum second confidence coefficient as the second command words with the voice data of the characteristic time window hit in the command word set. If the command word set does not have the command word with the second confidence coefficient larger than or equal to the second threshold value, the operation is not executed, and a target time window of the new current voice frame is determined. Therefore, the final hit command word can be determined by combining the first confidence and the second confidence, and the accuracy of command word detection can be improved.

In one possible implementation manner, the determining the third confidence level of the voice data of the feature time window and the corresponding third confidence level of each command word based on the first confidence level corresponding to each command word and the second confidence level corresponding to each command word respectively may specifically be: and performing splicing processing based on the first confidence coefficient corresponding to each command word and the second confidence coefficient corresponding to each command word to obtain verification features, and further determining a third confidence coefficient corresponding to the voice data of the feature time window and each command word based on the verification features. Determining a third confidence level of the speech data of the feature time window corresponding to each command word based on the verification feature may be based on a trained neural network, such as a simple multi-layer DNN network (a neural network model).

In one possible implementation manner, the third confidence coefficient of the voice data corresponding to each command word in the feature time window is determined based on the first confidence coefficient corresponding to each command word and the second confidence coefficient corresponding to each command word, and the third confidence coefficient of the voice data corresponding to each command word in the feature time window may be obtained by performing mathematical calculation on the first confidence coefficient corresponding to each command word and the second confidence coefficient, for example, the third confidence coefficient corresponding to the command word may be determined based on an average value or a weighted average value of the first confidence coefficient and the second confidence coefficient. Alternatively, since the speech frames covered by the feature time window may be more accurate, a higher weight may be determined for the second confidence level when determining a weighted average of the first confidence level and the second confidence level.

It can be understood that, because the hardware configuration such as a CPU (central processing unit), a memory, a flash memory and the like used by the electronic device that needs to detect the instruction in the voice data is generally lower, the resource occupation of each function has strict requirements, in the present application, the command word detection in the voice data is mainly determined through the trained primary detection network and secondary detection network, the network structure is simpler, the resource occupation of the electronic device is smaller, and the command word detection performance can be effectively improved. Compared with the method that when the content of the received voice data is identified based on the voice identification technology, a better identification effect can be achieved only by using a larger-scale acoustic model and a language model, namely, the method and the device can accurately detect hit command words under the condition that resources occupy a smaller amount of resources, so that the method and the device can be applied to various scenes with limited equipment resources, and the application scene of the scheme is expanded, such as offline application scenes with limited resources, such as intelligent sound boxes and intelligent household appliances.

An example is described herein of how command word detection of voice data is achieved through two-level verification, please refer to fig. 6, fig. 6 is a schematic diagram of a data processing method according to an embodiment of the present application. As shown in fig. 6, the flow of the whole data processing method may be abstracted into a first-level verification (as shown by 601 in fig. 6) and a second-level verification (as shown by 602 in fig. 6), so that voice data may be input into the first-level verification, specifically, the voice data may include audio features of the voice data in a target time window corresponding to a current voice frame, so that a first confidence level of each command word is determined based on a trained first-level detection network, and then threshold judgment is performed to determine a hit first command word. Thus, the characteristic time window is determined based on the first command word, and then the audio characteristics of the voice data in the characteristic time window associated with the current voice frame are acquired. And then inputting the audio frequency characteristics corresponding to the characteristic time window into a trained secondary detection network to obtain a second confidence coefficient of each command word, so that the second command word hit by the characteristic time window is determined through threshold judgment, and the accuracy of command word detection can be improved through secondary verification of voice data.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. Alternatively, the data processing device may be provided in the above-described electronic apparatus. As shown in fig. 7, the data processing apparatus described in the present embodiment may include:

An obtaining unit 701, configured to determine a target time window corresponding to a current speech frame, and obtain audio features corresponding to speech data of K speech frames in the target time window, where K is a positive integer;

a processing unit 702, configured to determine, according to audio features corresponding to the voice data of the K voice frames, a first command word hit by the voice data of the target time window in a command word set;

the processing unit 702 is further configured to determine a feature time window associated with the current speech frame based on a command word length of the first command word, and obtain audio features corresponding to speech data of a plurality of speech frames in the feature time window, respectively;

the processing unit 702 is further configured to determine, based on audio features corresponding to the voice data of the plurality of voice frames in the feature time window, a second command word that the voice data of the feature time window hits in the command word set.

In one implementation, the command word set includes at least one command word, each command word having a plurality of syllables; the processing unit 702 is specifically configured to:

according to the audio characteristics respectively corresponding to the voice data of the K voice frames, determining the probability that the K voice frames respectively correspond to each syllable output unit in the syllable output unit set; the syllable output unit set is determined based on a plurality of syllables of each command word, and syllables corresponding to different syllable output units are different;

Determining a first confidence coefficient of the voice data of the target time window corresponding to each command word according to the probabilities that the K voice frames respectively correspond to each syllable output unit;

and if the command words with the first confidence coefficient being greater than or equal to a first threshold value exist in the command word set, determining the command words with the first confidence coefficient being greater than or equal to the first threshold value as the first command words of the voice data of the target time window hit in the command word set.

In one implementation, any command word in the command set is represented as a target command word; the processing unit 702 is specifically configured to:

determining syllable output units corresponding to each syllable of the target command word as target syllable output units, and obtaining a plurality of target syllable output units corresponding to the target command word;

determining the probability that the K voice frames respectively correspond to each target syllable output unit from the probabilities that the K voice frames respectively correspond to each syllable output unit, and obtaining K candidate probabilities respectively corresponding to each target syllable output unit;

and determining the maximum candidate probability corresponding to each target syllable output unit from K candidate probabilities corresponding to each target syllable output unit, and determining the first confidence coefficient of the voice data of the target time window and the target command word according to the maximum candidate probability corresponding to each target syllable output unit.

In one implementation, the command word set includes at least one command word; the processing unit 702 is specifically configured to:

determining a second confidence coefficient corresponding to the voice data of the characteristic time window and each command word according to the audio characteristics respectively corresponding to the voice data of a plurality of voice frames in the characteristic time window;

and if the command words with the second confidence coefficient being greater than or equal to a second threshold value exist in the command word set, determining the command word with the second confidence coefficient being greater than or equal to the second threshold value and the maximum second confidence coefficient as the second command word with the voice data of the characteristic time window hit in the command word set.

In one implementation, the processing unit 702 is specifically configured to:

determining a first quantity according to the command word length of the first command word and a target preset value;

determining a characteristic time window associated with the current speech frame according to the first number of speech frames before the current speech frame and the second number of speech frames after the current speech frame.

In one implementation, the first command word is determined by a trained primary detection network, and the processing unit 702 is further configured to:

Acquiring first sample voice data, wherein the first sample voice data carries syllable output unit labels;

invoking an initial first-level detection network, and determining a predicted syllable output unit corresponding to the voice data of each voice frame in the first sample voice data;

and training the syllable output unit and the syllable output unit label based on the prediction syllable output unit respectively corresponding to the voice data of each voice frame in the first sample voice data to obtain the trained primary detection network.

In one implementation, the second command word is determined by a trained secondary detection network, and the processing unit 702 is further configured to:

acquiring second sample voice data, wherein the second sample voice data carries command word labels;

invoking a secondary detection network to determine a prediction command word corresponding to the second sample voice data;

and training based on the predicted command word and the command word label to obtain the trained secondary detection network.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device described in the present embodiment includes: a processor 801, and a memory 802. Optionally, the electronic device may further include a network interface or a power module. Data may be exchanged between the processor 801 and the memory 802.

The processor 801 may be a central processing unit (Central Processing Unit, CPU) which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The network interface may include input devices, such as a control panel, microphone, receiver, etc., and/or output devices, such as a display screen, transmitter, etc., which are not shown.

The memory 802 may include read only memory and random access memory, and provides program instructions and data to the processor 801. A portion of memory 802 may also include non-volatile random access memory. Wherein the processor 801, when calling the program instructions, is configured to execute:

In one implementation, the command word set includes at least one command word, each command word having a plurality of syllables; the processor 801 is specifically configured to:

In one implementation, any command word in the command set is represented as a target command word; the processor 801 is specifically configured to:

In one implementation, the command word set includes at least one command word; the processor 801 is specifically configured to:

In one implementation, the processor 801 is specifically configured to:

In one implementation, the first command word is determined by a trained primary detection network, and the processor 801 is further configured to:

In one implementation, the second command word is determined by a trained secondary detection network, and the processor 801 is further configured to:

Optionally, the program instructions may further implement other steps of the method in the above embodiment when executed by the processor, which is not described herein.

The present application also provides a computer readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the above method, such as the method performed by the above electronic device, which is not described herein in detail.

Alternatively, a storage medium, such as a computer readable storage medium, to which the present application relates may be nonvolatile or may be volatile.

Alternatively, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like. The blockchain referred to in the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the described order of action, as some steps may take other order or be performed simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions which, when executed by a processor, implement some or all of the steps of the above-described method. For example, the computer instructions are stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device (i.e., the electronic device described above), and executed by the processor, cause the computer device to perform the steps performed in the embodiments of the methods described above. For example, the computer device may be a terminal, or may be a server.

The foregoing describes in detail a data query method, apparatus, electronic device and storage medium provided in the embodiments of the present application, and specific examples are applied to illustrate principles and implementations of the present application, where the foregoing description of the embodiments is only for helping to understand the method and core ideas of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method of data processing, the method comprising:

multiplying the command word length of the first command word by a target preset value to obtain a first number, determining a characteristic time window associated with the current voice frame according to the first number of voice frames before the current voice frame and the second number of voice frames after the current voice frame, and acquiring audio characteristics respectively corresponding to voice data of a plurality of voice frames in the characteristic time window; the target preset value is used for indicating the number of voice frames corresponding to one syllable; the command word length refers to the number of syllables in the command word; the second number is a preset numerical value, or an empirical value, or is obtained by subtracting the command word length of the first command word from the command word length of the longest command word in the command word set to obtain a length difference, and multiplying the length difference by the target preset value;

2. The method of claim 1, wherein the set of command words comprises at least one command word, each command word having a plurality of syllables;

the determining, according to the audio features respectively corresponding to the voice data of the K voice frames, a first command word hit by the voice data of the target time window in a command word set includes:

3. The method of claim 2, wherein any command word in the command set is represented as a target command word;

the determining, according to the probabilities that the K speech frames respectively correspond to the syllable output units, a first confidence level of the speech data of the target time window corresponding to each command word includes:

4. The method of claim 1, wherein the set of command words comprises at least one command word; the determining, based on the audio features corresponding to the voice data of the plurality of voice frames in the feature time window, a second command word hit by the voice data of the feature time window in the command word set includes:

5. The method of claim 1, wherein the first command word is determined by a trained primary detection network, the method further comprising:

6. The method of claim 5, wherein the second command word is determined by a trained secondary detection network, the method further comprising:

7. A data processing apparatus, the apparatus comprising:

The processing unit is further configured to multiply the command word length of the first command word by a target preset value to obtain a first number, determine a feature time window associated with the current voice frame according to the first number of voice frames before the current voice frame and a second number of voice frames after the current voice frame, and obtain audio features corresponding to voice data of a plurality of voice frames in the feature time window respectively; the command word length refers to the number of syllables in the command word; the second number is a preset numerical value, or an empirical value, or is obtained by subtracting the command word length of the first command word from the command word length of the longest command word in the command word set to obtain a length difference, and multiplying the length difference by the target preset value;

8. An electronic device comprising a processor, a memory, wherein the memory is for storing a computer program, the computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-6.