WO2019149108A1 - 语音关键词的识别方法、装置、计算机可读存储介质及计算机设备 - Google Patents

语音关键词的识别方法、装置、计算机可读存储介质及计算机设备 Download PDF

Info

Publication number
WO2019149108A1
WO2019149108A1 PCT/CN2019/072590 CN2019072590W WO2019149108A1 WO 2019149108 A1 WO2019149108 A1 WO 2019149108A1 CN 2019072590 W CN2019072590 W CN 2019072590W WO 2019149108 A1 WO2019149108 A1 WO 2019149108A1
Authority
WO
WIPO (PCT)
Prior art keywords
predetermined
probability
segment
speech
voice
Prior art date
Application number
PCT/CN2019/072590
Other languages
English (en)
French (fr)
Inventor
王珺
苏丹
俞栋
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2020540799A priority Critical patent/JP7005099B2/ja
Priority to EP19747243.4A priority patent/EP3748629B1/en
Publication of WO2019149108A1 publication Critical patent/WO2019149108A1/zh
Priority to US16/884,350 priority patent/US11222623B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method and device for identifying a voice keyword, a computer readable storage medium, and a computer device.
  • the identification of the speech keyword refers to identifying whether there is a predetermined keyword in the continuous speech signal, which has wide applications in the wake-up of the electronic device, the initialization of the dialogue interaction interface, the audio indexing and retrieval, and the voice password verification.
  • the traditional speech keyword recognition method first extracts acoustic features from the speech signal to be recognized, and inputs the acoustic features into a pre-trained deep neural network model, and then based on the probability and artificial design of the depth neural network model output.
  • the decision logic determines whether a predetermined keyword exists in the voice signal.
  • the traditional method is very sensitive to the artificially set decision logic. Usually, whenever the application scene or the predetermined keyword changes, the decision logic needs to be carefully adjusted manually to adapt to the new application scenario. Not very sexual.
  • a method, an apparatus, a computer readable storage medium, and a computer device for identifying a voice keyword are provided.
  • a method for identifying a voice keyword is performed by a user terminal or a server, and includes the steps of:
  • the first probability includes: each of the predetermined segmentation units corresponding to the predetermined keywords of the first speech segments respectively Probability
  • each second speech segment Obtaining, according to the to-be-identified speech signal, each second speech segment, and generating a first prediction feature of each of the second speech segments based on a first probability corresponding to the first speech segment corresponding to each of the second speech segments ;
  • the second probability includes: corresponding to the second voice segment At least one of a probability of the predetermined keyword and a probability of not meeting the predetermined keyword;
  • a device for identifying a voice keyword comprising:
  • a first voice segment acquiring module configured to obtain each first voice segment based on the voice signal to be recognized
  • a first probability acquiring module configured to obtain, by using a preset first classification model, respective first probabilities respectively corresponding to each of the first speech segments; the first probability includes that the first speech segments respectively correspond to a predetermined key Each probability of each predetermined word segmentation unit of the word;
  • a predictive feature generating module configured to obtain each second voice segment based on the to-be-identified voice signal, and generate each of the second voices based on a first probability corresponding to the first voice segment corresponding to each of the second voice segments a first predicted feature of the speech segment;
  • a second probability acquiring module configured to perform, according to each of the first predicted features, a second probability that is respectively corresponding to each of the second voice segments; the second probability Include at least one of a probability that the second voice segment corresponds to the predetermined keyword and a probability that the predetermined keyword does not correspond to the predetermined keyword;
  • a keyword identifying module configured to determine, according to the second probability, whether the predetermined keyword exists in the to-be-identified voice signal.
  • a computer readable storage medium storing a computer program, when executed by a processor, causes the processor to perform the following steps:
  • the first probability includes: each of the predetermined segmentation units corresponding to the predetermined keywords of the first speech segments respectively Probability
  • each second speech segment Obtaining, according to the to-be-identified speech signal, each second speech segment, and generating a first prediction feature of each of the second speech segments based on a first probability corresponding to the first speech segment corresponding to each of the second speech segments ;
  • the second probability includes: corresponding to the second voice segment At least one of a probability of the predetermined keyword and a probability of not meeting the predetermined keyword;
  • a computer device comprising a memory and a processor, the memory storing a computer program, the computer program being executed by the processor, causing the processor to perform the following steps:
  • the first probability includes: each of the predetermined segmentation units corresponding to the predetermined keywords of the first speech segments respectively Probability
  • each second speech segment Obtaining, according to the to-be-identified speech signal, each second speech segment, and generating a first prediction feature of each of the second speech segments based on a first probability corresponding to the first speech segment corresponding to each of the second speech segments ;
  • the second probability includes: corresponding to the second voice segment At least one of a probability of the predetermined keyword and a probability of not meeting the predetermined keyword;
  • FIG. 1 is an application environment diagram of a method for identifying a voice keyword in an embodiment
  • FIG. 2 is a schematic flow chart of a method for identifying a voice keyword in an embodiment
  • FIG. 3 is a schematic diagram of a topology structure of a CNN model in an embodiment
  • FIG. 4 is a schematic structural diagram of a voice keyword recognition system in an embodiment
  • FIG. 5 is a schematic diagram of a frequency spectrum of a voice signal and a corresponding first probability in an embodiment
  • FIG. 6 is a schematic flow chart of preliminary determination based on predetermined decision logic in an embodiment
  • Figure 7 is a flow chart showing the steps added in the embodiment of Figure 6;
  • FIG. 8 is a schematic flow chart of preliminary determination based on predetermined decision logic in an embodiment
  • FIG. 9 is a schematic flow chart of a method for training a first classification model in an embodiment
  • FIG. 10 is a schematic flow chart of a method for training a second classification model in an embodiment
  • FIG. 11 is a schematic flow chart of a method for identifying a voice keyword in another embodiment
  • FIG. 12 is a structural block diagram of an apparatus for identifying a voice keyword in an embodiment
  • Figure 13 is a block diagram showing the structure of a computer device in an embodiment
  • Figure 14 is a block diagram showing the structure of a computer device in an embodiment.
  • the method for identifying a voice keyword provided by each embodiment of the present application can be applied to an application environment as shown in FIG. 1.
  • the application environment may involve user terminal 110 and server 120, and user terminal 110 and server 120 communicate over a network.
  • the user terminal 110 acquires the voice signal to be identified, and then sends the to-be-identified voice signal to the server 120 through the network.
  • the server 120 obtains each first voice segment based on the to-be-identified voice signal, and obtains first first probability corresponding to each first voice segment by using a preset first classification model, where the first probability includes that the first voice segment respectively corresponds to a predetermined key
  • Each probability of each predetermined word segmentation unit of the word then, each second speech segment is obtained based on the speech signal to be recognized, and each of the first speech segments corresponding to each of the second speech segments is generated based on a first probability corresponding to each of the second speech segments a first prediction feature of the second speech segment; further, the second prediction model is used to perform classification based on each of the first prediction features, and each second probability corresponding to each second speech segment is obtained, and the second probability includes the second probability
  • the two speech segments correspond to at least one of a probability of a predetermined keyword and a probability of a predetermined predetermined keyword
  • the step of obtaining the to-be-recognized speech signal to determine whether a predetermined keyword exists in the to-be-identified speech signal based on the second probability may also be performed by the user terminal 110 without the server 120 participating.
  • the user terminal 110 may be a mobile terminal or a desktop terminal, and the mobile terminal may include at least one of a mobile phone, a speaker, a robot, a tablet, a notebook computer, a personal digital assistant, and a wearable device.
  • the server 120 can be implemented by a separate physical server or a server cluster composed of a plurality of physical servers.
  • a method of identifying a voice keyword is provided. This method is described by an example of execution by a computer device (such as user terminal 110 or server 120 in FIG. 1 above). The method may include the following steps S202 to S210.
  • the speech signal to be recognized refers to a speech signal that needs to determine whether a predetermined keyword exists therein.
  • the user may usually emit a sound signal according to actual needs (eg, the user speaks a sentence), the computer device collects the sound signal, and converts the sound signal into an electrical signal to obtain a voice signal to be recognized.
  • the first speech segment refers to a first concatenated frame sequence corresponding to a unit frame in the speech signal to be recognized.
  • the computer device After acquiring the to-be-recognized speech signal, the computer device performs frame-by-frame processing on the to-be-recognized speech signal to obtain each unit frame, that is, the to-be-recognized speech signal is divided into a plurality of small segments, each of which is a frame. a unit frame; further, the computer device may obtain, according to a predetermined first splicing rule, a first spliced frame sequence corresponding to each unit frame, that is, each first voice segment.
  • the framing process can be implemented by moving the window function, for example, the frame window of the window function is 25 ms, and the window is shifted to 10 ms for framing processing, and the obtained unit frames are each 25 ms in length, adjacent to each other. There is an overlap of 15ms between the two frame unit frames.
  • the unit frame of the first preset number of frames appearing before the unit frame, the unit frame itself And the unit frame of the second preset number of frames appearing after the unit frame is subjected to splicing processing, thereby obtaining a first voice segment corresponding to the unit frame.
  • the first preset frame number and the second preset frame number may be set according to a length of a predetermined word segmentation unit of a predetermined keyword corresponding to the preset first classification model.
  • the predetermined keyword is “ear”
  • each predetermined word segmentation unit of the predetermined keyword corresponding to the first classification model is “er” and “duo”, respectively.
  • the first preset frame number may be set to 10
  • the second preset frame number may be set to 5.
  • the first 10 frames of the unit frame, the unit frame itself, and the The last 5 frames of the unit frame are spliced, and the spliced first speech segment corresponding to the unit frame includes the 16-frame unit frame.
  • the N-frame unit frame is the first frame unit frame and the second frame unit frame, respectively, according to the appearance order in the to-be-identified voice signal.
  • the third frame unit frame ..., the Nth frame unit frame.
  • the multi-frame first frame unit frame may be copied to make up the first preset frame number.
  • the first preset frame number is 10, and the second preset frame number is 5.
  • the first voice segment may include 11 frames of the first frame unit frame, and The second to sixth frame unit frames, for a total of 16 frame unit frames; for the first speech segment corresponding to the third frame unit frame, the first speech segment may include 9 frames of the first frame unit frame and the second to eighth frame units Frame, a total of 16 frame unit frames.
  • the multi-frame Nth frame unit frame may be copied to make up the second preset frame number.
  • the first classification model is a pre-trained neural network model.
  • the acoustic features of each of the first speech segments may be input into the first classification model, and then the first classification model is used to classify the first speech segments based on the acoustic features of the first speech segments to obtain a first speech segment.
  • the first probability corresponding to the first voice segment may include each probability that the first voice segment respectively corresponds to each predetermined word segment unit of the predetermined keyword.
  • the first probability can be a posterior probability.
  • the acoustic feature of the first speech segment may include acoustic features of each unit frame included in the first speech segment.
  • the acoustic features of the first speech segment are feature vectors of dimension t ⁇ f, t represents a time frame dimension, ie, the total number of frames of the unit frame included in the first speech segment, and f represents a spectral dimension, ie The dimension of the acoustic features of each unit frame.
  • the acoustic characteristics of the unit frame are obtained by extracting the acoustic features of the unit frame. Specifically, the waveform corresponding to the unit frame is converted into a multi-dimensional vector, which can be used to characterize the content information contained in the unit frame, which can be an acoustic feature of the unit frame.
  • the acoustic characteristics of the unit frame may include any one or any of the Mel spectrum, the log-Mal spectrum (obtained by the logarithm operation of the Mel spectrum), and the Mel Frequency Cepstrum Coefficient (MFCC). A variety of combinations. Taking the acoustic feature of extracting the log-Mel spectrum into a unit frame as an example, a 40-dimensional vector corresponding to the unit frame can be obtained.
  • the predetermined word segmentation unit may perform word segmentation processing on the predetermined keyword based on the predetermined word segmentation unit.
  • the predetermined keyword is "ear"
  • the predetermined word segmentation unit is pinyin as an example, and each predetermined word segmentation unit of the predetermined keyword "ear” may be "er” and "duo", respectively.
  • the first probability that the first classification model outputs corresponding to the first speech segment may include a probability that the first speech segment corresponds to “er” and the first speech segment corresponds to “ The probability of duo”.
  • the predetermined keyword is “small smurf”, and the predetermined word segmentation unit is pinyin, and each predetermined word segmentation unit of the predetermined keyword of “small smurf” may be “xiao”, “lan”, “jing” and "ling”, the first probability corresponding to the first voice segment output by the first classification model may include a probability that the first voice segment corresponds to "xiao”, a probability that the first voice segment corresponds to "lan”, the first The probability that the speech segment corresponds to "jing” and the probability that the first speech segment corresponds to "ling”.
  • the first probability may include a probability that the first speech segment corresponds to the first padding information, in addition to the respective probabilities that the first segment of speech corresponds to each predetermined segmentation unit.
  • the first padding information refers to other information than each predetermined word segmentation unit. For example, for the case where each predetermined word segmentation unit is "er” and “duo”, all the information except “er” and “duo” are the first padding information. For another example, for each of the predetermined word segmentation units being "xiao”, “lan”, “jing”, and “ling”, all information except “xiao”, “lan”, “jing", and “ling” All are the first padding information.
  • the first probability includes each probability of the first speech segment corresponding to each predetermined word segment unit and a probability corresponding to the first padding information, for any first speech segment, the sum of the probabilities included in the corresponding first probability Can be 1.
  • the first classification model may be CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory), and TDNN (Time-Delay Neural Network). ) or thyristor convolutional neural networks.
  • the CNN may include a convolution layer, a max-pooling layer, a fully connected layer, and a softmax layer.
  • the input information of the first classification model is an acoustic feature of the first speech segment (ie, a feature vector having a dimension of t ⁇ f), as shown in FIG. 3, the dimension corresponding to the first speech segment by the convolution layer Convolution processing is performed for the feature vector of t ⁇ f and the convolution kernel of dimension s ⁇ v ⁇ w (ie, the filtering weight matrix), and s feature maps are obtained, where v is the size of each convolution kernel in the time frame dimension.
  • the maximum pooling process is performed on the s feature maps by the max-pooling layer (that is, the processing of the feature points in the neighborhood is the largest, that is, the sampling process), so as to reduce the size of the time-frequency dimension and obtain s dimensionality reduction.
  • the s feature-maps after the dimensionality reduction are classified by the fully-connected layer, and the output of the fully-connected layer is sent to the softmax layer. Then, the output of the fully connected layer is normalized by the softmax layer to obtain a first probability corresponding to the first voice segment.
  • the CNN can also use five fully connected layers, wherein the first four layers contain 512 hidden layers, and the last layer includes 128 hidden layers.
  • the second speech segment refers to a second concatenated frame sequence corresponding to the unit frame in the speech signal to be recognized. Similar to the first speech segment, the computer device can obtain each second splicing frame sequence, that is, each second speech segment, in one-to-one correspondence with each unit frame based on a predetermined second splicing rule.
  • the third preset frame number and the fourth preset frame number may be set based on a length of the predetermined keyword. Taking the predetermined keyword as "ear" as an example, the third preset frame number can be set to 40, and the fourth preset frame number can be set to 20, that is, for any unit frame, it can appear before the unit frame.
  • the 40-frame unit frame, the unit frame itself, and the 20-frame unit frame appearing after the unit frame are spliced, and the spliced second voice segment corresponding to the unit frame includes the 61-frame unit frame.
  • the second speech segment contains more frame frames than the first speech segment.
  • the second speech segment contains more "context" information than the first speech segment.
  • the first prediction feature of the second speech segment may be generated based on a first probability corresponding to each of the first speech segments corresponding to the second speech segment.
  • the first predicted feature of the second voice segment may include respective first probabilities corresponding to the first voice segments that are in one-to-one correspondence with the unit frames included in the second voice segment.
  • the second speech segment includes a 61-frame unit frame, and each of the 61-frame unit frames included in the second speech segment has a first speech segment corresponding thereto, and accordingly, the second speech segment corresponds to 61 frames.
  • a first speech segment, and each of the first speech segments has a first probability corresponding thereto, so the first prediction feature of the second speech segment includes 61 corresponding first speech segments corresponding to the second speech segment.
  • the second speech segment includes 61 frame unit frames, the predetermined keyword is "ear”, each predetermined word segmentation unit is "er” and “duo”, respectively, and the first probability output by the first classification model includes the first speech segment corresponding to " The probability of er", the probability of corresponding "duo", and the probability of corresponding first padding information are taken as an example.
  • the first predicted feature of the second speech segment may comprise a vector having a dimension of 61 x 3.
  • the second classification model is also a pre-trained neural network model.
  • the first prediction feature of each second speech segment may be input into the second classification model, and then the second prediction model is used to classify each second speech segment based on the first prediction feature of each second speech segment to obtain a The second probability that the two speech segments correspond one by one.
  • the second probability corresponding to the second speech segment may include at least one of a probability that the second speech segment corresponds to a predetermined keyword and a probability that the second speech segment does not correspond to the predetermined keyword. Similar to the first probability, the second probability can also be a posterior probability.
  • the second probability may include only the probability that the second speech segment corresponds to a predetermined keyword. Taking the predetermined keyword as "ear” as an example, the second probability corresponding to the second voice segment may include the probability that the second voice segment corresponds to "er duo". Taking the predetermined keyword as "small smurf” as an example, the second probability corresponding to the second voice segment may include the probability that the second voice segment corresponds to "xiao lan jing ling".
  • the second probability may include only the probability that the second speech segment does not correspond (ie, does not correspond) to the predetermined keyword. Taking the predetermined keyword as "ear" as an example, the second probability corresponding to the second voice segment may include only the probability that the second voice segment corresponds to other information than "er duo".
  • the second probability may include both a probability that the second speech segment corresponds to the predetermined keyword and a probability that the second speech segment does not correspond to the predetermined keyword.
  • the sum of the probabilities included in the second probability corresponding to the second speech segment may be 1.
  • the second classification model may be CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory), and TDNN (Time-Delay Neural Network). ), gated convolutional neural networks, or fully connected FCDNN neural networks.
  • CNN Convolutional Neural Network
  • LSTM Long Short-Term Memory
  • TDNN Time-Delay Neural Network
  • the neural network may include two fully connected layers, each of which has 128 nodes, thereby achieving complexity reduction under the premise of ensuring system performance. degree.
  • S210 Determine, according to the second probability, whether a predetermined keyword exists in the to-be-identified voice signal.
  • each of the second probabilities corresponding to each second speech segment may be compared one by one with a predetermined probability threshold.
  • each of the second probabilities corresponding to each second speech segment may be one by one and a predetermined probability according to the order of appearance of the unit frames corresponding to the second speech segments in the to-be-identified speech signal. The thresholds are compared.
  • the second probability is compared with the predetermined probability threshold, and specifically, the probability that the second speech segment included in the second probability corresponds to the predetermined keyword or the included second speech segment does not correspond to the predetermined keyword may be determined. Probability, whether it is greater than the corresponding predetermined probability threshold.
  • determining whether a predetermined keyword exists in the to-be-identified speech signal based on the obtained second probability is as follows:
  • Determining the first first speech segment if a probability that the first second speech segment (the corresponding unit frame appears in the foremost second speech segment of the speech signal to be recognized) corresponds to the predetermined keyword is greater than a predetermined probability threshold There is a predetermined keyword, and the output characterizes the recognition result that the predetermined keyword exists in the to-be-identified voice signal, and ends the recognition process. Conversely, if the probability that the first first speech segment corresponds to the predetermined keyword is less than the predetermined probability threshold, determining that the predetermined keyword is not present in the first second speech segment, and continuing to compare the second second speech segment The magnitude relationship between the probability of a predetermined keyword and a predetermined probability threshold.
  • the method for identifying the speech keyword does not need to be determined based on the artificially determined decision logic to finally determine whether the speech signal to be recognized exists. Determining a keyword, but obtaining each second speech segment based on the to-be-recognized speech signal, and generating a prediction feature of each second speech segment based on a first probability corresponding to each of the first speech segments respectively corresponding to each second speech segment And inputting the predicted feature into the second classification model, obtaining at least one of a probability that each second speech segment corresponds to a predetermined keyword and a probability of not corresponding to the predetermined keyword, and finally determining the waiting based on the probability of the second classification model output Identifying whether the predetermined keyword exists in the voice signal. It can effectively overcome the problem of the artificially set decision logic in the traditional method, thus improving the universality.
  • the traditional solution is sensitive to the predetermined decision logic, and also limits the flexible development and rapid launch of the product, and the system generalization ability is weak.
  • the above method for identifying a voice keyword can also reduce the above limitation and improve the system generalization ability.
  • recall rate is used to characterize the proportion of positive categories that are identified as positive.
  • the false recognition rate is used to characterize the proportion of negative classes that are identified as positive.
  • the false recognition rate is low, which means that when the predetermined keyword is not actually present in the to-be-identified voice signal, the predetermined keyword is present in the to-be-identified voice signal, and the probability of occurrence occurs. low.
  • the predetermined keyword contains at least four syllables or at least five phonemes, similar to "Okay Google", “Tmall Elf”, “Hello Xiaoya”, “ ⁇ ”, “Little Love Classmate” and “Hello TV” and so on.
  • the conventional scheme can only achieve unsatisfactory system performance in the case where the predetermined keyword is long and the background environment of the speech signal to be recognized is quiet.
  • the first classification model and the second classification model are used to perform predetermined keyword recognition step by step, first obtaining a first probability corresponding to each first speech segment, and then based on each second The first probability corresponding to each of the first voice segments corresponding to the voice segment obtains a second probability that is in one-to-one correspondence with each of the second voice segments. Since the second speech segment contains more "context" information, the accuracy of the recognition can be effectively improved. Moreover, the solution in the embodiments of the present application is not only applicable to a case where the keyword is long, and the background environment is quiet, and the predetermined keyword is shorter, and the background environment of the voice signal to be recognized is a real far. In the case of the environment, the system's recall rate and false recognition rate are better balanced.
  • the step of obtaining each of the second speech segments based on the to-be-recognized speech signal is entered.
  • the first probabilities and the predetermined decisions may be first determined.
  • Logic to determine whether there is a predetermined keyword in the recognized speech signal for preliminary judgment. When it is initially determined that there is a predetermined keyword, the step of obtaining each second speech segment based on the speech signal to be recognized is entered. Conversely, when it is initially determined that there is no predetermined keyword, the recognition result indicating that the predetermined keyword does not exist in the speech signal to be recognized may be directly output, and the recognition process is ended.
  • the decision logic can be implemented based on a Hidden Markov Model (HMM).
  • HMM Hidden Markov Model
  • the step of preliminary judgment is added, and the classification process is performed by the second classification model when it is initially determined based on the predetermined decision logic that the predetermined keyword exists.
  • the accuracy of recognition can be improved by double judgment.
  • the recognition process is terminated in advance, and the classification process is not required by the second classification model, thereby avoiding the operation without actual meaning, and the system performance can be effectively optimized.
  • the system can achieve a lower false negative rate by predetermined decision logic (so that when a predetermined keyword is actually present in the speech signal to be recognized, the probability of erroneously recognizing that there is no predetermined keyword occurs is low).
  • the false negative rate of the system can be made less than 0.05. It should be noted that, in the process of making a preliminary judgment based on predetermined decision logic, the false positive rate may be temporarily ignored, and the second classification network optimizes the false positive rate for the structure of the predetermined decision logic.
  • the predetermined keyword is "ear”
  • the voice signal A corresponding to "er duo”
  • the voice signal B corresponding to "ao duo”.
  • the first probability of the output is shown by the ordinate of the left coordinate axis in FIG. 5
  • the voice signal B is classified by the first classification model, and the output is performed.
  • the first probability is shown on the ordinate of the coordinate axis on the right side in FIG.
  • the white line in the spectrogram on the left side in FIG. 5 indicates that the appearance position of the predetermined keyword in the speech signal A is recognized based on the first probability and the predetermined decision logic, and the white line in the spectrogram on the right side in FIG.
  • a probability and predetermined decision logic identify the location of occurrence of the predetermined keyword in the speech signal B. It can be seen that, based on the preliminary identification based on only the first classification model and the predetermined decision logic, misrecognition may still occur (for a speech signal B in which a predetermined keyword does not actually exist, it is recognized that a predetermined keyword exists in the speech signal B) . However, in the embodiment, after the predetermined decision keyword is preliminarily determined by the predetermined decision logic, the second classification model is further identified, which can effectively reduce the false recognition, thereby improving the accuracy of the recognition.
  • each predetermined word segmentation unit of the predetermined keyword exists in the to-be-recognized speech signal, and detects whether the appearance order of each predetermined word segmentation unit in the to-be-recognized speech signal is in a predetermined keyword with each predetermined word segmentation unit.
  • the order of appearance is the same.
  • determining, by using each of the first probability and the predetermined decision logic, a manner in which a predetermined keyword exists in the to-be-identified voice signal may include the following steps S602-S608.
  • the current word segmentation unit to be identified is based on the order of appearance of each predetermined word segmentation unit in the predetermined keyword, and the determined predetermined word segmentation unit that is not present as the word segmentation unit to be identified.
  • the predetermined keyword is “Little Smurf”, and each predetermined word segmentation unit is “xiao”, “lan”, “jing” and “ling” as examples.
  • “xiao”, “lan”, “jing”, and “ling” are all predetermined word segmentation units that are not used as participle elements to be identified. The “xiao” that appears at the top is determined as the current word segmentation unit to be identified.
  • "lan", “jing", and “ling” are predetermined word segmentation units that are not used as the word segmentation unit to be identified, and the first "lan” appearing as the current word segmentation unit to be identified is determined. And so on.
  • the current speech segment to be determined is based on an appearance sequence of each first speech segment in the to-be-recognized speech signal, and the determined first speech segment that is not present as the to-be-determined speech segment is determined. .
  • the N-frame unit frame is included in the to-be-identified voice signal, there are N first voice segments.
  • the first speech segment is the first first speech segment, the second first speech segment, and the first speech segment, respectively, based on the appearance order of the unit frame corresponding to the first speech segment in the speech signal to be recognized.
  • N first speech segments are not regarded as the over-determined speech segment, and the first first speech segment is determined as the current to-be-recognized speech segment.
  • the second first speech segment, the third first speech segment, ... the Nth first speech segment are not regarded as the over-determined speech segment, and will appear at the forefront.
  • the second first speech segment is determined as the current speech segment to be determined, and so on.
  • the step of determining the current word segmentation unit to be recognized is returned when the probability that the current to-be-recognized speech segment corresponds to the current to-be-identified word segmentation unit is greater than a predetermined threshold, and the current to-be-recognized word segmentation unit is not the last predetermined word segmentation unit in the predetermined keyword.
  • S608 determining that a predetermined keyword is present in the to-be-recognized voice signal when the probability that the current to-be-recognized speech segment corresponds to the current to-be-recognized word segmentation unit is greater than a predetermined threshold, and the current to-be-identified word-dividing unit is the last predetermined segmentation-segment unit in the predetermined keyword.
  • the current word segmentation unit to be identified is the last predetermined word segmentation unit that appears in the predetermined keyword. If no, it is only detected that there is a current word segmentation unit to be identified in the to-be-identified voice signal, and it is necessary to further detect whether there are other predetermined word segmentation units in the to-be-identified voice signal, and therefore return to the step of determining the current word segmentation unit to be identified.
  • the step of determining the current to-be-determined speech segment may be returned to determine the next first speech segment as the current to-be-determined speech segment, and to continue to detect whether the current to-be-identified speech segment unit exists in the In the next first voice segment.
  • the currently determined speech segment is the first speech segment corresponding to the last frame unit frame in the to-be-recognized speech signal, and the detection of the absence of the predetermined keyword in the first speech segment occurs In the last predetermined word segmentation unit, it is preliminarily determined that there is no predetermined keyword in the to-be-identified speech signal, and the recognition result indicating that the predetermined keyword does not exist in the to-be-recognized speech signal is directly output, and the recognition process is ended.
  • the system can achieve a lower false negative rate by predetermined decision logic. Accordingly, in the present embodiment, the system can also achieve a lower false negative rate by adjusting the predetermined threshold.
  • the method for identifying a voice keyword may further include the following steps S702 to S704.
  • each predetermined segmentation of the predetermined keyword is to be determined.
  • the predetermined predetermined word segmentation unit appearing in the top is determined as the current word segmentation unit to be recognized, and the step of determining the current speech segment to be determined is returned (S604).
  • each predetermined word segmentation unit of each predetermined keyword exists in the to-be-identified voice signal, and the order of occurrence of each predetermined word segmentation unit in the to-be-identified voice signal
  • the order of appearance of the predetermined word segmentation units in the predetermined keyword is also consistent, but in the to-be-identified speech signal, each predetermined word segmentation unit does not constitute a predetermined keyword in a compact manner, but is blocked by other padding information.
  • the predetermined keyword is “Little Smurf”
  • each predetermined word segmentation unit is “xiao”, “lan”, “jing”, and “ling” respectively.
  • the probability that the current to-be-determined speech segment corresponds to the current to-be-identified word segmentation unit is less than or equal to a predetermined threshold
  • the predetermined predetermined word segmentation unit appearing in each of the predetermined word segmentation units of the predetermined keyword is determined as the current segmentation word segmentation unit, and then the step of determining the current to-be-determined speech segment is returned, for example, the predetermined keyword is “small smurf ", each predetermined word segmentation unit is "xiao", “lan”, “jing” and “ling” respectively, and then the first "xiao” appearing in all the predetermined word segmentation units is determined as the current word segmentation unit to be recognized, and then returning to determine the current The step of judging the speech segment.
  • the current counter value is first set to a predetermined trigger.
  • the initial value (the trigger initial value may be a positive number set based on the business experience, such as 30), and then returns to the step of determining the current word segmentation unit to be identified.
  • the current counter value is subtracted from the predetermined adjustment value (eg, by one) to update the current counter value, and the current count is determined. Whether the value is greater than a predetermined standard value (such as 0). If it is greater than, it indicates that the to-be-identified participle corresponding to the previous determination is greater than the predetermined threshold is in an active state, so the step of determining the current to-be-determined speech segment can be directly returned.
  • a predetermined standard value such as 0
  • the pre-determined word-dividing unit appearing in the predetermined part-of-word unit of the predetermined keyword may be determined as the current to-be-identified word-diving unit. And then return to the step of determining the current speech segment to be judged.
  • the N first speech segments are obtained based on the to-be-identified speech signal, and the index value of the first speech segment is set to n, and the nth first speech segment is played in the speech signal to be recognized. Go to the first speech segment in the nth position, n is less than or equal to N.
  • the predetermined keyword includes M predetermined word segmentation units, and the index value of the predetermined word segmentation unit is set to m, and the mth predetermined word segmentation unit is a predetermined word segmentation unit that is in the order of the predetermined keyword in the order of the mth position. , m is less than or equal to M.
  • the count value is k, and the initial trigger value of the fake design value is 30.
  • the step of initially determining whether a predetermined keyword exists in the to-be-identified voice signal based on the predetermined decision logic may include the following steps S801 to S811.
  • step S803 determining whether n is greater than N. If yes, the process goes to step S804, and if no, the process goes to step S805.
  • step S805. Determine whether a probability that the nth first speech segment corresponds to the mth predetermined word segmentation unit is greater than a predetermined threshold. If yes, the process goes to step S806, and if no, the process goes to step S808.
  • step S806 determining whether m is equal to M. If not, the process goes to step S807, and if so, the process goes to step S811.
  • step S807 let k be equal to 30, m increase by 1, and return to step S802.
  • step S809 determining whether k is greater than zero. If yes, the process returns to step S802, and if no, the process goes to step S810.
  • step S810 let m be equal to 1, and return to step S802.
  • the manner of determining the first classification model may include the following steps S902 to S908.
  • S902 Acquire a sample voice signal based on a predetermined corpus, and the predetermined corpus includes a general corpus.
  • S906 Acquire a first acoustic feature of each third speech segment and a third probability corresponding to each third speech segment, and the third probability includes each probability that the third speech segment respectively corresponds to each predetermined word segmentation unit of the predetermined keyword.
  • the predetermined neural network model needs to be trained based on the sample data to obtain the first classification model.
  • sample speech signals are typically only available based on a proprietary corpus.
  • the special corpus is a corpus that is specially established for a predetermined keyword, and the special corpus includes a speech signal corresponding to a predetermined keyword collected under various acoustic conditions. It can be understood that different dedicated corpora need to be established for different predetermined keywords, and establishing a dedicated corpus is a very time-consuming and labor-intensive work, which limits the flexible development and rapid launch of the product.
  • the sample speech signal can be acquired based on the general corpus, which can effectively reduce the above limitation.
  • the universal corpus has the advantage of covering a wider range of acoustic conditions, having a larger data size, and a more secure quality of the voice signal, so that the recognition of predetermined keywords can be achieved efficiently and robustly.
  • each third speech segment is obtained through framing and splicing processing, and then based on the acoustic characteristics of each sample unit frame included in the third speech segment. Obtaining a first acoustic feature of the third speech segment.
  • frame alignment processing is also performed, and the frame alignment processing is performed to determine the number of sample cell frames from the sample frame to the first frame sample.
  • the label of the unit frame corresponds to the corresponding predetermined word segmentation unit.
  • the first acoustic feature is similar to the acoustic feature of the first speech segment in the foregoing, and is not described herein.
  • each probability of each predetermined segmentation unit of each of the third speech segments corresponding to the predetermined keyword may be obtained based on the annotations in the general corpus.
  • each probability of each predetermined segmentation unit corresponding to the predetermined keyword of each third speech segment and the probability of corresponding second padding information may also be obtained based on the annotations in the general corpus.
  • the second padding information is similar to the first padding information in the foregoing, and is not described here.
  • the predetermined first neural network model is trained, that is, each model parameter involved in the first neural network model is determined, thereby obtaining the first classification model.
  • the manner of training the second classification model may include the following steps S1002 to S1008.
  • S1004 Generate a second prediction feature of each of the fourth voice segments based on a third probability corresponding to the third voice segment corresponding to each fourth voice segment.
  • S1006 Acquire each fourth probability corresponding to each fourth voice segment, where the fourth probability includes at least one of a probability that the fourth voice segment corresponds to a predetermined keyword and a probability that the predetermined candidate keyword does not correspond to the predetermined keyword;
  • S1008 Train a predetermined second neural network model based on the second prediction feature of each fourth speech segment and each fourth probability to determine a second classification model.
  • the predetermined second neural network model needs to be trained based on the sample data to obtain a second classification model.
  • the fourth voice segment is obtained based on the sample voice signal, and is similar to the process of obtaining the second voice segment based on the voice signal to be recognized, and details are not described herein.
  • the fourth probability is different from the second probability of the foregoing, except that the object is different (the second probability is for the second voice segment, and the fourth probability is for the fourth voice segment), and other properties are similar, and are not described here.
  • the method of optimizing the cross entropy can be used, and the distributed asynchronous gradient descent method is used for training, thereby determining the model parameters involved in the first neural network model and the second neural network model.
  • the splicing process is performed to obtain the second voice segment, the third voice segment, and the fourth voice segment, if the total number of unit frames before or after the frame unit frame is less than the corresponding preset frame number, the foregoing text may be referred to
  • the manner of copy processing mentioned in the description of the first voice segment is performed by splicing processing, and the corresponding preset frame number is spliced, and no further description is provided herein.
  • the method may further include the step of: acquiring the second acoustic feature of each second speech segment. According to this, the first prediction feature of the second speech segment is generated based on the second acoustic feature of the second speech segment and the first probability corresponding to each first speech segment corresponding to the second speech segment.
  • the first prediction feature of the second voice segment may include a second probability of the second voice segment in addition to the first probability corresponding to each first voice segment corresponding to the second voice segment.
  • the first prediction feature includes more effective feature information, which can improve the accuracy of the recognition.
  • the second acoustic feature is similar to the acoustic feature of the first speech segment in the foregoing, and is not described herein.
  • the method for acquiring each predetermined word segmentation unit of the predetermined keyword may include the following steps: performing word segmentation processing on the predetermined keyword based on the predetermined word segmentation unit, and obtaining each predetermined word segmentation unit of the predetermined keyword, wherein the predetermined The word segmentation unit includes at least one of the following three items: pinyin, phoneme, and word.
  • pinyin is used as a predetermined word segmentation unit as an example for description.
  • the word segmentation unit can be set based on actual needs (eg, recognition accuracy, system performance, etc.).
  • the phoneme may be a predetermined word segmentation unit, or a word segmentation unit may be used.
  • the first classification model includes sub-category models that are cascaded with each other, the number of stages of the sub-classification model being greater than or equal to two.
  • the step of inputting the acoustic features of the first speech segments into the pre-trained first classification model to obtain the first probability that each of the first speech segments respectively corresponds to each predetermined word segmentation unit of the predetermined keyword may include: step by step The input information corresponding to each sub-classification model is input into each sub-classification model, and the fifth probability of the sub-classification model output is obtained.
  • the input information of the first-level sub-category model includes acoustic characteristics of each first speech segment corresponding to the first-level sub-category model, and input information of each sub-category model other than the first-level sub-category model is based on the previous one.
  • the fifth probability of the output of the class classification model is generated.
  • the fifth probability outputted by the level sub-classification model includes a predetermined participle corresponding to the sub-category model corresponding to each of the first speech segments corresponding to the sub-category model The probability of the unit.
  • the fifth probability of the output of the last-level sub-classification model in the first classification model is the first probability.
  • each sub-classification model corresponds to a first speech signal and a predetermined word segmentation unit, and the first speech signal and the predetermined word segmentation unit corresponding to each sub-classification model are different from each other.
  • the number of stages of the sub-category model included in the first classification model can be set based on actual needs (such as system complexity and system performance requirements).
  • each of the predetermined word segmentation units included in the first group is “xiao” , “lan”, “jing” and “ling”.
  • Each of the predetermined word segmentation units included in the second group is “xiao lan”, “lan jing”, and “jing ling”.
  • Each of the predetermined word segmentation units included in the third group is "xiao lan jing” and "lan jing ling”.
  • each predetermined word segment unit corresponding to the first level classification sub-model is each predetermined word segment unit included in the first group
  • each predetermined word segment unit corresponding to the second level classification sub-model is a predetermined word segment unit included in the second group
  • Each predetermined word segmentation unit corresponding to the third-level classification sub-model is each predetermined word-dividing unit included in the third group.
  • first speech segments corresponding to the first, second, and third sub-classification models are respectively referred to as a first-level first speech segment, a second-level first speech segment, and three-level.
  • the first speech segment is referred to as a first-level first speech segment, a second-level first speech segment, and three-level.
  • the acoustic features of the first-level first speech segment are first input into the first-level sub-category model, and the first-level sub-category model is used to classify the acoustic features based on the first-level first speech segment, and
  • the output first-level first speech segments correspond to the probabilities of "xiao”, “lan”, “jing", and "ling", respectively.
  • a third predicted feature of the second-level first speech segment is generated based on the probability of the first-level sub-category model output.
  • each third prediction feature is input into the second-level sub-category model, and the second-level sub-category model is used to classify each third prediction feature, and the second-level first speech segment is output corresponding to “xiao lan” and “lan jing” respectively.
  • a fourth predicted feature of the third-level first speech segment is generated based on the probability of the second-level sub-classification model output. Then, each fourth prediction feature is input into the third-level sub-category model, and the third-level sub-category model is used to classify each fourth prediction feature, and the third-level first speech segment is output corresponding to “xiao lan jing” and “lan” respectively.
  • the probability of the jing ling", the probability of the output of the third-level sub-category model is the first probability of the output of the first classification model. Further, based on the first probability of each first speech segment corresponding to each second speech segment, respectively generating a first prediction feature of each second speech segment, and then inputting each first prediction feature into the second classification model, and performing corresponding Next steps.
  • the method for identifying a voice keyword may include the following steps S1101 to S1111.
  • S1101 Obtain each first voice segment based on the to-be-identified voice signal, and obtain, by using a preset first classification model, first first probabilities respectively corresponding to each of the first voice segments; the first probability includes the first The speech segments respectively correspond to the respective probabilities of the predetermined word segmentation units of the predetermined keyword.
  • S1102 Determine a current word segmentation unit to be identified, and the current word segmentation unit to be identified is based on an appearance order of each predetermined word segmentation unit in a predetermined keyword, and the determined predetermined word segmentation unit that is not present as the word segmentation unit to be identified.
  • S1103 Determine a current voice segment to be determined, where the current voice segment to be determined is based on an appearance order of each first voice segment in the to-be-recognized voice signal, and the determined first voice segment that is not present as the to-be-determined voice segment is determined to be present. .
  • S1104 Determine whether the probability that the current to-be-determined speech segment corresponds to the current to-be-identified word segmentation unit is greater than a predetermined threshold; if yes, go to S1105, and if no, go to S1107.
  • S1105 Determine whether the current word segmentation unit to be identified is the last predetermined word segmentation unit in the predetermined keyword; if not, return to S1102, and if yes, go to S1106.
  • S1106 Initially determining that a predetermined keyword exists in the to-be-identified voice signal, and jumping to S1109.
  • S1107 Determine whether the word segmentation unit to be identified corresponding to the previous determination is greater than the predetermined threshold is in an active state; if yes, return to S1103, and if no, go to S1108.
  • S1108 Determine the first predetermined word segmentation unit appearing in each of the predetermined participles of the predetermined keyword as the current word segmentation unit to be recognized, and return to S1103.
  • S1109 Generate a first prediction feature of each second speech segment based on a first probability corresponding to the first speech segment corresponding to each second speech segment.
  • S1110 Input each first prediction feature into a preset second classification model, and perform classification according to each first prediction feature by using a preset second classification model, and obtain second probability corresponding to each second speech segment respectively;
  • the second probability includes at least one of a probability that the second speech segment corresponds to a predetermined keyword and a probability that the second speech segment does not correspond to the predetermined keyword.
  • S1111 Determine, according to the second probability, whether a predetermined keyword exists in the to-be-identified voice signal.
  • the method for identifying a voice keyword provided by each embodiment of the present application can be applied to scenarios such as wake-up of an electronic device, initialization of a dialogue interaction interface, audio indexing and retrieval, and voice password verification.
  • the identification method can be used as an important front-end processing module in the automatic speech recognition system, which can greatly save the resource occupation and consumption of the automatic speech recognition system and improve the user experience. More specifically, it can be applied to smart speakers, voice recognition of AI Lab (Artificial Intelligence Lab), and intelligent voice assistant.
  • AI Lab Artificial Intelligence Lab
  • an identification device 1200 for voice keywords is provided, which may include the following modules 1202-1210.
  • the first voice segment acquisition module 1202 is configured to obtain each first voice segment based on the voice signal to be recognized.
  • the first probability acquisition module 1204 is configured to obtain, by using the preset first classification model, respective first probabilities corresponding to the respective first speech segments; the first probability of the first speech segment includes the first speech segments respectively corresponding to the predetermined Each probability of each predetermined word segmentation unit of the keyword.
  • the prediction feature generation module 1206 is configured to obtain each second speech segment based on the to-be-identified speech signal, and generate a second speech segment based on a first probability corresponding to the first speech segment corresponding to each of the second speech segments. The first predicted feature.
  • the second probability acquisition module 1208 is configured to perform, according to each of the first prediction features, the second classification model to obtain each second probability corresponding to each second voice segment; the second probability corresponding to the second voice segment includes the second probability
  • the second speech segment corresponds to at least one of a probability of a predetermined keyword and a probability of not corresponding to the predetermined keyword.
  • the keyword identification module 1210 is configured to determine whether a predetermined keyword exists in the to-be-identified voice signal based on the second probability.
  • the identification device of the voice keyword does not need to determine whether the voice signal to be recognized exists in the voice signal to be recognized based on the artificially determined decision logic. Determining a keyword, but obtaining each second speech segment based on the to-be-identified speech signal, and generating a prediction of each second speech segment based on a first probability corresponding to each first speech segment corresponding to each second speech segment.
  • Feature and inputting the predicted feature into the second classification model, obtaining at least one of a probability that each second speech segment corresponds to a predetermined keyword and a probability of not corresponding to the predetermined keyword, and finally determining the probability based on the probability of the second classification model output Whether the predetermined keyword exists in the voice signal to be recognized. It can effectively overcome the problem of the artificially set decision logic in the traditional method, thus improving the universality.
  • device 1200 can also include a preliminary identification module.
  • the preliminary identification module is configured to invoke the predicted feature generation module when determining that a predetermined keyword exists in the to-be-identified voice signal based on each of the first probability and the predetermined decision logic.
  • the preliminary identification module may further include a current participle determining unit, a current segment identifying unit, a first returning unit, and a preliminary determining unit.
  • the current participle determining unit is configured to determine a current word segmentation unit to be identified, and the current word segmentation unit to be recognized is based on an appearance order of each predetermined word segmentation unit in a predetermined keyword, and the determined first occurrence of the word segmentation unit that is not recognized as a word to be identified Pre-determined word segmentation unit.
  • the current segment identification unit is configured to determine a current speech segment to be determined, and the current to-be-determined speech segment is based on an appearance order of each first speech segment in the to-be-recognized speech signal, and the determined non-determined speech segment that appears in the forefront The first piece of speech.
  • a first calling unit configured to: when the probability that the current to-be-recognized speech segment corresponds to the current to-be-identified word-dividing unit is greater than a predetermined threshold, and the current to-be-identified word-dividing unit is not in the predetermined predetermined word-diving unit in the predetermined keyword, the current participle is invoked Determine the unit.
  • a preliminary determining unit configured to determine, in the to-be-recognized speech signal, that the probability that the current to-be-recognized speech segment corresponds to the current to-be-recognized word segmentation unit is greater than a predetermined threshold, and the current to-be-recognized word-dividing unit is the last predetermined segmentation-segment unit in the predetermined keyword There are predetermined keywords.
  • the preliminary identification module may further include a second invoking unit and a word segmentation resetting unit.
  • the second invoking unit is configured to: when the probability that the current to-be-recognized speech segment corresponds to the current to-be-identified word segmentation unit is less than or equal to a predetermined threshold, and the corresponding to-be-identified word segmentation unit is in a valid state when the previous determination is greater than the predetermined threshold,
  • the current segment identification unit
  • a word segmentation resetting unit configured to: when the probability that the current to-be-recognized speech segment corresponds to the current to-be-identified word segmentation unit is less than or equal to a predetermined threshold, and the corresponding segmentation word-dividing unit corresponding to the previous determination is greater than the predetermined threshold is in an invalid state, the predetermined key is The predetermined predetermined word segmentation unit appearing in each of the predetermined participles of the word is determined as the current word segmentation unit to be recognized, and the current segment recognition unit is called.
  • the apparatus 1200 may further include a sample data acquisition module, a first segment acquisition module, a first sample feature acquisition module, and a first model training module.
  • the sample data obtaining module is configured to acquire a sample voice signal based on a predetermined corpus, and the predetermined corpus includes a general corpus;
  • a first segment obtaining module configured to obtain a third voice segment based on each sample voice signal
  • a first sample feature acquiring module configured to acquire a first acoustic feature of each third speech segment and each third probability corresponding to each third speech segment; a third probability of the third speech segment includes the third speech Each segment corresponds to each probability of each predetermined word segmentation unit of the predetermined keyword;
  • the first model training module is configured to train the predetermined first neural network model based on the first acoustic feature of each third speech segment and each third probability to determine the first classification model.
  • the apparatus 1200 may further include a second segment acquisition module, a second sample feature acquisition module, a sample probability acquisition module, and a second model training module.
  • the second segment obtaining module is configured to obtain a fourth voice segment based on each sample voice signal
  • a second sample feature acquiring module configured to generate a second predicted feature of each fourth voice segment based on a third probability corresponding to the third voice segment corresponding to each fourth voice segment;
  • a sample probability obtaining module configured to acquire each fourth probability corresponding to each fourth voice segment, where the fourth probability includes at least one of a probability that the fourth voice segment corresponds to a predetermined keyword and a probability that the predetermined candidate keyword does not correspond to the predetermined keyword;
  • the second model training module is configured to train the predetermined second neural network model based on the second predicted feature of each fourth voice segment and each fourth probability to determine the second classification model.
  • the apparatus 1200 further includes an acoustic feature acquisition module that acquires a second acoustic feature of each of the second speech segments.
  • the second sample feature acquiring module is configured to generate each second voice segment based on the second acoustic feature of each second voice segment and the first probability corresponding to the first voice segment corresponding to each second voice segment. The first predictive feature.
  • apparatus 1200 can also include a word segmentation processing module.
  • the word segmentation processing module is configured to perform word segmentation processing on a predetermined keyword based on a predetermined word segmentation unit to obtain each predetermined word segmentation unit of the predetermined keyword, and the predetermined word segmentation unit includes at least one of the following three items: pinyin, phoneme and word. .
  • the first classification model includes sub-category models that are cascaded with each other, the number of stages of the sub-classification model being greater than or equal to two.
  • a computer device including a memory and a processor, the memory storing a computer program, and when the computer program is executed by the processor, causing the processor to perform the method for identifying a voice keyword provided by any of the embodiments of the present application A step of.
  • the computer device may be the user terminal 110 of FIG. 1, and its internal structure may be as shown in FIG.
  • the computer device includes a processor, a memory, a network interface, a display screen, an input device, and a sound collection device connected by a system bus.
  • the processor is used to provide calculation and control capabilities.
  • the memory includes a non-volatile storage medium and an internal memory, the non-volatile storage medium of the computer device storing an operating system and a computer program, the computer program being executed by the processor, enabling the processor to implement the embodiments of the present application A method of identifying a voice keyword; the internal memory provides an environment for operation of an operating system and a computer program in a non-volatile storage medium.
  • the network interface is used to communicate with an external terminal through a network connection.
  • the display can be either a liquid crystal display or an electronic ink display.
  • the input device can be a touch layer covered on the display screen, or a button, trackball or touchpad provided on the computer device casing, or an external keyboard, trackpad or mouse.
  • the computer device may be the server 120 shown in FIG. 1, and its internal structure diagram may be as shown in FIG.
  • the computer device includes a processor, a memory, and a network interface connected by a system bus.
  • the processor is used to provide calculation and control capabilities.
  • the memory includes a non-volatile storage medium and an internal memory, the non-volatile storage medium storing an operating system and a computer program that provides an environment for operation of an operating system and a computer program in the non-volatile storage medium.
  • the method for identifying a voice keyword provided by any embodiment of the present application is implemented.
  • the network interface is used to communicate with an external terminal through a network connection.
  • FIG. 13 and FIG. 14 are only a block diagram of a part of the structure related to the solution of the present application, and do not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the computer device may include more or fewer components than those shown in Figure 13, or some components may be combined, or have different component arrangements.
  • the identification device for the speech keyword provided by the present application can be implemented in the form of a computer program that can be run on a computer device as shown in FIG. 13 or FIG.
  • the program modules constituting the device may be stored in a memory of the computer device, such as the first probability acquisition module 1202, the prediction feature generation module 1204, the second probability acquisition module 1206, and the keyword recognition module 1208 shown in FIG.
  • the computer program of each program module causes the processor to perform the steps in the method for identifying the voice keyword provided by any of the embodiments of the present application.
  • the computer device shown in FIG. 13 or FIG. 14 may perform step S202 through the first probability acquisition module 1202 in the recognition device 1200 of the voice keyword shown in FIG. 12, and perform step S204 through the prediction feature generation module 1204. Wait.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM
  • a computer readable storage medium having stored thereon a computer program that, when executed by a processor, causes the processor to perform the steps of the method of any of the embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音关键词的识别方法,包括:基于待识别语音信号获得各第一语音片段;通过预置第一分类模型获得与各第一语音片段分别对应的各第一概率,第一概率包括该第一语音片段分别对应预定关键词的各预定分词单元的各概率;基于待识别语音信号获得各第二语音片段,基于与各第二语音片段对应的第一语音片段所对应的第一概率生成各第二语音片段的第一预测特征,并通过预置的第二分类模型,基于各第一预测特征进行分类,获得与各第二语音片段分别对应的各第二概率,第二概率包括该第二语音片段对应预定关键词的概率和未对应预定关键词的概率中的至少一个;基于第二概率确定待识别语音信号中是否存在预定关键词。该识别方法能提高普适性。

Description

语音关键词的识别方法、装置、计算机可读存储介质及计算机设备
本申请要求于2018年01月31日提交中国专利局,申请号为201810096472.X,申请名称为“语音关键词的识别方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种语音关键词的识别方法、装置、计算机可读存储介质及计算机设备。
背景技术
随着语音技术的发展和交互体验的提升,用户越来越乐于通过语音与电子设备进行交互,例如通过语音控制电子设备完成指定工作。语音关键词的识别是指识别连续语音信号中是否存在预定关键词,其在电子设备唤醒、对话交互界面初始化、音频索引和检索、以及语音密码验证等方面均有着广泛的应用。
传统的语音关键词识别方法,是先从待识别语音信号中提取声学特征,并将该声学特征输入至一个预先训练的深度神经网络模型中,进而基于该深度神经网络模型输出的概率和人为设定的决策逻辑,识别该语音信号中是否存在预定关键词。然而,传统方法对人为设定的决策逻辑十分敏感,通常每当应用场景或预定关键词发生改变时,都需要由人工对决策逻辑进行仔细调校,以此来适应新的应用场景,普适性不高。
发明内容
根据本申请提供的各种实施例,提供一种语音关键词的识别方法、装置、计算机可读存储介质及计算机设备。
一种语音关键词的识别方法,由用户终端或服务器执行,包括步骤:
基于待识别语音信号获得各第一语音片段;
通过预置的第一分类模型,获得与各所述第一语音片段分别对应的各第 一概率;所述第一概率包括所述第一语音片段分别对应预定关键词的各预定分词单元的各概率;
基于所述待识别语音信号获得各第二语音片段,分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各所述第二语音片段的第一预测特征;
通过预置的第二分类模型,基于各所述第一预测特征进行分类,获得与各所述第二语音片段分别对应的各第二概率;所述第二概率包括所述第二语音片段对应所述预定关键词的概率和未对应所述预定关键词的概率中的至少一个;
基于所述第二概率,确定所述待识别语音信号中是否存在所述预定关键词。
一种语音关键词的识别装置,包括:
第一语音片段获取模块,用于基于待识别语音信号获得各第一语音片段;
第一概率获取模块,用于通过预置的第一分类模型,获得与各所述第一语音片段分别对应的各第一概率;所述第一概率包括所述第一语音片段分别对应预定关键词的各预定分词单元的各概率;
预测特征生成模块,用于基于所述待识别语音信号获得各第二语音片段,分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各所述第二语音片段的第一预测特征;
第二概率获取模块,用于通过预置的第二分类模型,基于各所述第一预测特征进行分类,获得与各所述第二语音片段分别对应的各第二概率;所述第二概率包括所述第二语音片段对应所述预定关键词的概率和未对应所述预定关键词的概率中的至少一个;
关键词识别模块,用于基于所述第二概率,确定所述待识别语音信号中是否存在所述预定关键词。
一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如下步骤:
基于待识别语音信号获得各第一语音片段;
通过预置的第一分类模型,获得与各所述第一语音片段分别对应的各第一概率;所述第一概率包括所述第一语音片段分别对应预定关键词的各预定 分词单元的各概率;
基于所述待识别语音信号获得各第二语音片段,分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各所述第二语音片段的第一预测特征;
通过预置的第二分类模型,基于各所述第一预测特征进行分类,获得与各所述第二语音片段分别对应的各第二概率;所述第二概率包括所述第二语音片段对应所述预定关键词的概率和未对应所述预定关键词的概率中的至少一个;
基于所述第二概率,确定所述待识别语音信号中是否存在所述预定关键词。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如下步骤:
基于待识别语音信号获得各第一语音片段;
通过预置的第一分类模型,获得与各所述第一语音片段分别对应的各第一概率;所述第一概率包括所述第一语音片段分别对应预定关键词的各预定分词单元的各概率;
基于所述待识别语音信号获得各第二语音片段,分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各所述第二语音片段的第一预测特征;
通过预置的第二分类模型,基于各所述第一预测特征进行分类,获得与各所述第二语音片段分别对应的各第二概率;所述第二概率包括所述第二语音片段对应所述预定关键词的概率和未对应所述预定关键词的概率中的至少一个;
基于所述第二概率,确定所述待识别语音信号中是否存在所述预定关键词。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中 所需要使用的附图作简单地介绍。显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中语音关键词的识别方法的应用环境图;
图2为一个实施例中语音关键词的识别方法的流程示意图;
图3为一个实施例中CNN模型的拓扑结构示意图;
图4为一个实施例中语音关键词的识别系统的架构示意图;
图5为一个实施例中语音信号的频谱及对应的第一概率的示意图;
图6为一个实施例中基于预定决策逻辑作初步判断的流程示意图;
图7为一个实施例中在图6的基础上增加的步骤的流程示意图;
图8为一个实施例中基于预定决策逻辑作初步判断的流程示意图;
图9为一个实施例中训练第一分类模型的方法的流程示意图;
图10为一个实施例中训练第二分类模型的方法的流程示意图;
图11为另一个实施例中语音关键词的识别方法的流程示意图;
图12为一个实施例中语音关键词的识别装置的结构框图;
图13为一个实施例中计算机设备的结构框图;
图14为一个实施例中计算机设备的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
本申请文件中使用的诸如“第一”和“第二”的术语,是用于对类似的对象作出命名上的区分,但这些对象本身不受这些术语限制。在不脱离本申请的范围的情况下,这些术语在适当的情况下可以互换。
本申请各实施例提供的语音关键词的识别方法,可应用于如图1所示的应用环境中。该应用环境可涉及用户终端110和服务器120,用户终端110 和服务器120通过网络进行通信。
具体地,用户终端110获取待识别语音信号,再通过网络将待识别语音信号发送至服务器120。服务器120基于待识别语音信号获得各第一语音片段,再通过预置的第一分类模型获得与各第一语音片段分别对应的各第一概率,第一概率包括第一语音片段分别对应预定关键词的各预定分词单元的各概率;然后,基于待识别语音信号获得各第二语音片段,分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各第二语音片段的第一预测特征;进而,通过预置的第二分类模型,基于各第一预测特征进行分类,获得与各第二语音片段分别对应的各第二概率,第二概率包括该第二语音片段对应预定关键词的概率和对应预定关键词的概率中的至少一个;而后,基于第二概率确定待识别语音信号中是否存在预定关键词。
在其他实施例中,也可以由用户终端110执行从获取待识别语音信号到基于第二概率确定待识别语音信号中是否存在预定关键词的步骤,而无需服务器120参与。
其中,用户终端110可以是移动终端或者台式终端,移动终端可以包括手机、音箱、机器人、平板电脑、笔记本电脑、个人数字助理和穿戴式设备等中的至少一种。服务器120可以用独立的物理服务器,或者多个物理服务器构成的服务器集群来实现。
在一个实施例中,如图2所示,提供了一种语音关键词的识别方法。以该方法由计算机设备(如上图1中的用户终端110或服务器120)执行为例进行说明。该方法可以包括如下步骤S202~S210。
S202,基于待识别语音信号获得各第一语音片段。
待识别语音信号,是指需要确定其中是否存在预定关键词的语音信号。在实际应用中,通常可由用户根据实际需要发出声音信号(如用户说一句话),计算机设备采集该声音信号,并将该声音信号转化为电信号,以获得待识别语音信号。
第一语音片段,是指与待识别语音信号中的单元帧对应的第一拼接帧序列。具体地,计算机设备获取到待识别语音信号后,先对该待识别语音信号进行分帧处理,得到各单元帧,即将该待识别语音信号切分为若干个小段,每一小段均为一帧单元帧;进而,计算机设备可基于预定的第一拼接规则, 获得与各单元帧一一对应的各第一拼接帧序列,即各第一语音片段。
在一个实施例中,可以通过移动窗函数实现分帧处理,例如以窗函数的帧窗长为25ms、窗移为10ms进行分帧处理,得到的各单元帧的长度均为25ms,相邻的两帧单元帧之间具有15ms的交叠部分。
在一个实施例中,对于任一单元帧,均可基于该单元帧在待识别语音信号中的出现顺序,将出现在该单元帧之前的第一预设帧数的单元帧、该单元帧本身、以及出现在该单元帧之后的第二预设帧数的单元帧进行拼接处理,从而获得与该单元帧对应的第一语音片段。
其中,第一预设帧数和第二预设帧数可基于预置的第一分类模型所对应的预定关键词的预定分词单元的长度进行设定。例如,预定关键词为“耳朵”,且第一分类模型所对应的该预定关键词的各预定分词单元分别为“er”和“duo”。在此情况下,第一预设帧数可设为10,第二预设帧数可设为5,对于任一单元帧,可将该单元帧的前10帧、该单元帧本身、以及该单元帧的后5帧进行拼接处理,拼接得到的与该单元帧对应的第一语音片段则包含这16帧单元帧。
需要说明的是,若待识别语音信号中包括N帧单元帧,按照在待识别语音信号中的出现顺序,该N帧单元帧由前往后分别为第1帧单元帧、第2帧单元帧、第3帧单元帧、…、第N帧单元帧。对于某一帧单元帧,若位于该单元帧之前的单元帧的总数小于第一预设帧数,则可以复制多帧第1帧单元帧,以凑足第一预设帧数。例如,第一预设帧数为10,第二预设帧数为5,对于与第1帧单元帧对应的第一语音片段,该第一语音片段可以包含11帧第1帧单元帧、以及第2~6帧单元帧,共计16帧单元帧;对于与第3帧单元帧对应的第一语音片段,该第一语音片段可以包含9帧第1帧单元帧、以及第2~8帧单元帧,共计16帧单元帧。
类似地,对于某一帧单元帧,若位于该单元帧之后的单元帧的总数小于第二预设帧数,则可以复制多帧第N帧单元帧,以凑足第二预设帧数。
S204,通过预置的第一分类模型,获得与各第一语音片段分别对应的各第一概率。
第一分类模型,是预先训练的神经网络模型。可以将各第一语音片段的声学特征输入第一分类模型,再通过第一分类模型,基于各第一语音片段的 声学特征对各第一语音片段进行分类处理,得到与各第一语音片段一一对应的各第一概率。其中,第一语音片段对应的第一概率可包括该第一语音片段分别对应预定关键词的各预定分词单元的各概率。第一概率可以为后验概率。
其中,第一语音片段的声学特征可包括该第一语音片段包含的各单元帧的声学特征。在一个实施例中,第一语音片段的声学特征为维度为t×f的特征向量,t表示时间帧维度,即第一语音片段所包含的单元帧的总帧数,f表示频谱维度,即各单元帧的声学特征的维度。
单元帧的声学特征,是对单元帧进行声学特征的提取处理得到。具体地,将单元帧对应的波形转换为多维向量,该多维向量可用于表征该单元帧中包含的内容信息,其可以为该单元帧的声学特征。单元帧的声学特征可包括梅尔频谱、对数梅尔频谱(对梅尔频谱进行对数运算获得)、梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)等中的任意一种或任意多种的组合。以对单元帧提取对数梅尔频谱这一声学特征为例,可以得到与该单元帧对应的40维向量。
以各第一语音片段均包含16帧单元帧,且提取的各单元帧的声学特征均为40维对数梅尔频谱特征为例,t=16,f=40,即各第一语音片段的声学特征均包括维度为16×40的向量。
预定分词单元,可基于预定的分词单位对预定关键词进行分词处理获得。以预定关键词是“耳朵”,预定分词单位为拼音为例,“耳朵”这一预定关键词的各预定分词单元可分别为“er”和“duo”。相应地,对于任一第一语音片段,第一分类模型输出的与该第一语音片段对应的第一概率可包括该第一语音片段对应“er”的概率、以及该第一语音片段对应“duo”的概率。再例如,预定关键词为“小蓝精灵”,预定分词单位为拼音,则“小蓝精灵”这一预定关键词的各预定分词单元可分别为“xiao”、“lan”、“jing”和“ling”,第一分类模型输出的与该第一语音片段对应的第一概率可包括该第一语音片段对应“xiao”的概率、该第一语音片段对应“lan”的概率、该第一语音片段对应“jing”的概率、以及该第一语音片段对应“ling”的概率。
在一个实施例中,第一概率除了包括第一语音片段对应各预定分词单元的各概率之外,还可以包括第一语音片段对应第一填充信息的概率。第一填充信息是指除各预定分词单元之外的其他信息。例如,对于各预定分词单元 分别为“er”和“duo”的情况,除“er”和“duo”之外的其他所有信息均为第一填充信息。再例如,对于各预定分词单元分别为“xiao”、“lan”、“jing”和“ling”的情况,除“xiao”、“lan”、“jing”和“ling”之外的其他所有信息均为第一填充信息。
在第一概率包括第一语音片段对应各预定分词单元的各概率和对应第一填充信息的概率的情况下,对于任一第一语音片段,其对应的第一概率中包含的各概率的和可以为1。
在一个实施例中,第一分类模型可以为CNN(Convolutional Neural Network,卷积神经网络)、LSTM(Long Short-Term Memory,长短期记忆网络)、TDNN(Time-Delay Neural Network,时延神经网络)或闸控卷积神经网络等。
以第一分类模型是CNN为例进行说明,CNN可包括卷积层、max-pooling层、全连接层和softmax层。如前文所述,第一分类模型的输入信息为第一语音片段的声学特征(即维度为t×f的特征向量),如图3所示,可由卷积层将第一语音片段对应的维度为t×f的特征向量与维度为s×v×w的卷积核(即过滤权重矩阵)作卷积处理,获得s个特征映射图,v为各卷积核在时间帧维度的大小,v≤t;r为各卷积核在频谱维度的大小,w≤f;s为卷积核的个数,即对于该CNN,一共有s个维度为v×w的卷积核。然后,由max-pooling层分别对这s个特征映射图进行最大池化处理(即邻域内特征点取最大的处理,亦即抽样处理),以减小时频维度的大小,获得s个降维后的特征映射图。进而,通过全连接层对s个降维后的特征映射图进行分类处理,并将全连接层的输出送入softmax层。而后,由softmax层对全连接层的输出进行归一化处理,得到该第一语音片段对应的第一概率。
在一个实施例中,出于权衡网络复杂度和系统性能的考虑,CNN可采用一层卷积层,并且将卷积核的时间帧维度的大小配置为与输入特征的时间帧维度的小相同,即v=t=16。此外,该CNN还可采用5个全连接层,其中,前四层包含512个隐层,最后一层包括128个隐层。
S206,基于待识别语音信号获得各第二语音片段,分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各第二语音片段的第一预测特征。
第二语音片段,是指与待识别语音信号中的单元帧对应的第二拼接帧序列。与第一语音片段类似,计算机设备可基于预定的第二拼接规则,获得与各单元帧一一对应的各第二拼接帧序列,即各第二语音片段。
在一个实施例中,对于任一单元帧,均可基于该单元帧在待识别语音信号中的出现顺序,将出现在该单元帧前面的第三预设帧数的单元帧、该单元帧本身、以及出现在该单元帧后面的第四预设帧数的单元帧进行拼接处理,从而获得与该单元帧对应的第二语音片段。
其中,第三预设帧数和第四预设帧数可基于预定关键词的长度进行设定。以预定关键词为“耳朵”为例,第三预设帧数可设为40,第四预设帧数可设为20,即对于任一单元帧,均可将出现在该单元帧之前的40帧单元帧、该单元帧本身、以及出现在该单元帧之后的20帧单元帧进行拼接处理,拼接得到的与该单元帧对应的第二语音片段则包含这61帧单元帧。
需要说明的是,相较于第一语音片段,第二语音片段包含的单元帧的总帧数更多。因此相较于第一语音片段,第二语音片段包含更多的“上下文”信息。
第二语音片段的第一预测特征,可以基于与该第二语音片段对应的各第一语音片段所对应的第一概率生成。在一个实施例中,第二语音片段的第一预测特征可以包括与该第二语音片段包含的各单元帧一一对应的各第一语音片段所对应的各第一概率。例如,第二语音片段包含61帧单元帧,该第二语音片段包含的61帧单元帧中的每一帧都具有与之对应的第一语音片段,据此,该第二语音片段对应61个第一语音片段,并且,各第一语音片段均具有与之对应的第一概率,因此该第二语音片段的第一预测特征包括该第二语音片段对应的61个第一语音片段所对应的第一概率。
以第二语音片段包含61帧单元帧、预定关键词是“耳朵”、各预定分词单元分别为“er”和“duo”,且第一分类模型输出的第一概率包括第一语音片段对应“er”的概率、对应“duo”的概率、以及对应第一填充信息的概率为例。在此情况下,第二语音片段的第一预测特征可包括维度为61×3的向量。
S208,通过第二分类模型,基于各第一预测特征进行分类,获得与各第二语音片段分别对应的各第二概率。
第二分类模型,也是预先训练的神经网络模型。可以将各第二语音片段 的第一预测特征输入第二分类模型,再通过第二分类模型,基于各第二语音片段的第一预测特征对各第二语音片段进行分类处理,得到与各第二语音片段一一对应的各第二概率。第二语音片段对应的第二概率可包括该第二语音片段对应预定关键词的概率和未对应预定关键词的概率中的至少一个。与第一概率类似,第二概率也可以为后验概率。
在一个实施例中,第二概率可仅包括第二语音片段对应预定关键词的概率。以预定关键词是“耳朵”为例,第二语音片段对应的第二概率,可以包括该第二语音片段对应“er duo”的概率。再以预定关键词是“小蓝精灵”为例,第二语音片段对应的第二概率,可以包括该第二语音片段对应“xiao lan jing ling”的概率。
在另一个实施例中,第二概率可仅包括第二语音片段未对应(即不对应)预定关键词的概率。以预定关键词是“耳朵”为例,第二语音片段对应的第二概率,可以仅包括该第二语音片段对应除“er duo”以外的其他信息的概率。
在又一个实施例中,第二概率可同时包括第二语音片段对应预定关键词的概率和第二语音片段未对应预定关键词的概率。在此情况下,第二语音片段对应的第二概率中包含的各概率的和可以为1。
在一个实施例中,第二分类模型可以为CNN(Convolutional Neural Network,卷积神经网络)、LSTM(Long Short-Term Memory,长短期记忆网络)、TDNN(Time-Delay Neural Network,时延神经网络)、闸控卷积神经网络、或基于全连接的FCDNN神经网络等。
此外,第二分类模型采用基于全连接的FCDNN神经网络时,该神经网络可以包括两个全连接层,各全连接层均包含128个结点,从而实现在保证系统性能的前提下,降低复杂度。
S210,基于第二概率确定待识别语音信号中是否存在预定关键词。
获得第二概率后,可将与各第二语音片段一一对应的各第二概率逐一地与预定概率阈值进行比较。在一个实施例中,可基于各第二语音片段对应的单元帧在待识别语音信号中的出现顺序,由前往后地将与各第二语音片段一一对应的各第二概率逐一与预定概率阈值进行比较。
在一个实施例中,将第二概率与预定概率阈值进行比较,具体可以是判断第二概率中包含的第二语音片段对应预定关键词的概率或者包含的第二语 音片段未对应预定关键词的概率,是否大于相应的预定概率阈值。
以判断第二概率中包含的第二语音片段对应预定关键词的概率是否大于预定概率阈值为例,基于获得的第二概率确定待识别语音信号中是否存在预定关键词的过程如下:
若第一个第二语音片段(其对应的单元帧出现在待识别语音信号的最前面的第二语音片段)对应预定关键词的概率大于预定概率阈值,则判定该第一个第一语音片段中存在预定关键词,输出表征待识别语音信号中存在预定关键词的识别结果,并结束识别流程。相反地,若该第一个第一语音片段对应预定关键词的概率小于预定概率阈值,则判定该第一个第二语音片段中不存在预定关键词,继续比较第二个第二语音片段对应预定关键词的概率与预定的概率阈值的大小关系。以此类推,直至某一个第二语音片段对应预定关键词的概率大于预定的概率阈值时,判定该第二语音片段中存在预定关键词,输出表征待识别语音信号中存在预定关键词的识别结果,并结束识别流程。若直至最后一个第二语音片段对应预定关键词的概率仍小于预定的概率阈值,则判定该待识别语音信号中不存在预定关键词,输出表征待识别语音信号中不存在预定关键词的识别结果,并结束识别流程。
上述语音关键词的识别方法,基于第一分类模型获得与待识别语音信号的各第一语音片段分别对应的第一概率后,无需基于人为设定的决策逻辑最终确定待识别语音信号中是否存在预定关键词,而是基于待识别语音信号获得各第二语音片段,再基于与各第二语音片段分别对应的各第一语音片段所对应的第一概率,生成各第二语音片段的预测特征,并将该预测特征输入第二分类模型,获得各第二语音片段对应预定关键词的概率和未对应预定关键词的概率中的至少一个,进而基于第二分类模型输出的概率最终确定该待识别语音信号中是否存在该预定关键词。能够有效克服传统方法中对人为设定的决策逻辑敏感的问题,从而提高普适性。
此外,传统方案对预定的决策逻辑敏感,还限制了产品的灵活开发和快速上线,且系统泛化能力弱。相应地,上述语音关键词的识别方法还能够减少上述限制以及提高系统泛化能力。
需要说明的是,对于语音关键词识别,召回率和误识别率是评估系统性能的两个重要指标。其中,召回率用于表征正类被识别为正类正确的比例。 误识别率用于表征负类被识别为正类的比例。应用于电子设备的唤醒场景时,误识别率低,意味着待识别语音信号中实际不存在预定关键词时,错误地识别到该待识别语音信号中存在预定关键词,这一现象出现的几率低。
一般来说,为使系统的召回率和误识别率达到较好的均衡,通常需要慎重地设定关键词。其中一个重要设定条件是预定关键词的长度需要足够长,并且预定关键词中包含的音节或音素需要足够丰富。例如,预定关键词中包含至少四个音节或至少五个音素,类似于“Okay Google”、“天猫精灵”、“你好小雅”、“叮咚叮咚”、“小爱同学”和“你好电视”等等。传统方案仅仅能够在预定关键词较长,且待识别语音信号的背景环境安静的情况下,达到差强人意的系统性能。
然而,本申请的各实施例中,采用第一分类模型和第二分类模型逐级进行预定关键词的识别,先获得与各第一语音片段一一对应的第一概率,再基于各第二语音片段对应的各第一语音片段所对应的第一概率,获得与各第二语音片段一一对应的第二概率。由于第二语音片段中包含了更多的“上下文”信息,能够有效地提高识别的精准性。并且,本申请各实施例中的方案不仅能很好地适用于关键词较长,且背景环境安静的情况,还能够在预定关键词较短,且待识别语音信号的背景环境为真实的远讲环境的情况下,使得系统的召回率和误识别率达到较好的均衡。
在一个实施例中,当基于各第一概率和预定的决策逻辑判定待识别语音信号中存在预定关键词时,进入基于待识别语音信号获得各第二语音片段的步骤。
在本实施例中,如图4所示,在获得第一分类模型输出的各第一概率之后,基于待识别语音信号获得各第二语音片段之前,可先基于各第一概率和预定的决策逻辑,对待识别语音信号中是否存在预定关键词进行初步判断。在初步判定存在预定关键词时,才进入基于待识别语音信号获得各第二语音片段的步骤。相反地,初步判定不存在预定关键词时,可以直接输出表征待识别语音信号中不存在预定关键词的识别结果,并结束识别流程。在一个实施例中,决策逻辑可基于隐马尔可夫模型(Hidden Markov Model,HMM)实现。
在本实施例中,增设初步判断的步骤,基于预定的决策逻辑初步判定存 在预定关键词时,才由第二分类模型进行分类处理。一方面,能够通过双重判断提高识别的准确性。另一方面,对于不存在预定关键词的待识别语音信号,提前结束识别流程,无需再由第二分类模型进行分类处理,避免了无实际意义的操作,能够有效地优化系统性能。
此外,可通过预定的决策逻辑使系统达到较低的假阴性率(使得待识别语音信号中真实存在预定关键词时,错误地识别为不存在预定关键词的现象出现的几率低)。例如,在实际应用中,可使系统的假阴性率达到0.05以下。需要说明的是,在基于预定的决策逻辑进行初步判断的过程中,可以暂时先不考虑假阳性率,而由第二分类网络针对预定的决策逻辑的结构来优化假阳性率。
在实际应用中,当预定关键词为“耳朵”时,对于对应“er duo”的语音信号A,以及对应“ao duo”的语音信号B。经过第一分类模型对该语音信号A进行分类后,输出的各第一概率如图5中左边的坐标轴的纵坐标所示,经过第一分类模型对该语音信号B进行分类后,输出的各第一概率如图5中右边的坐标轴的纵坐标所示。此外,图5中左边的频谱图中的白线表示基于第一概率和预定的决策逻辑识别到语音信号A中预定关键词的出现位置,图5中右边的频谱图中的白线表示基于第一概率和预定的决策逻辑识别到语音信号B中预定关键词的出现位置。由此可知,仅基于第一分类模型和预定的决策逻辑进行初步识别,仍可能出现误识别(对于实际上不存在预定关键词的语音信号B,识别到该语音信号B中存在预定关键词)。但在本实施例中,经预定的决策逻辑初步判定待识别语音信号中存在预定关键词后,还基于第二分类模型作进一步识别,能够有效地减少误识别,从而提高识别的准确率。
在一个实施例中,分别检测预定关键词的各预定分词单元是否存在于待识别语音信号中,并且检测各预定分词单元在待识别语音信号中的出现顺序是否与各预定分词单元在预定关键词中的出现顺序一致。
如图6所示,在一个实施例中,基于各第一概率和预定的决策逻辑判定待识别语音信号中存在预定关键词的方式,可以包括如下步骤S602~S608。
S602,确定当前待识别分词单元。
当前待识别分词单元,是基于各预定分词单元在预定关键词中的出现顺序,所确定的出现在最前的未作为过待识别分词单元的预定分词单元。
以预定关键词为“小蓝精灵”,各预定分词单元分别为“xiao”、“lan”、“jing”和“ling”为例。在一次识别过程中,第一次确定当前待识别分词单元时,“xiao”、“lan”、“jing”和“ling”均是未作为过待识别分词单元的预定分词单元,此时,将出现在最前的“xiao”确定为当前待识别分词单元。第二次确定当前待识别分词单元时,“lan”、“jing”和“ling”是未作为过待识别分词单元的预定分词单元,将出现在最前的“lan”确定为当前待识别分词单元,以此类推。
S604,确定当前待判断语音片段,当前待判断语音片段是基于各第一语音片段在待识别语音信号中的出现顺序,所确定的出现在最前的未作为过待判断语音片段的第一语音片段。
若待识别语音信号中包括N帧单元帧,则对应有N个第一语音片段。基于第一语音片段对应的单元帧在待识别语音信号中的出现顺序,由前往后,各第一语音片段分别为第1个第一语音片段、第2个第一语音片段、….、第N个第一语音片段。在一次识别过程中,第一次确定当前待判断语音片段时,这N个第一语音片段均未作为过待判断语音片段,则将第1个第一语音片段确定为当前待识别语音片段。第二次确定当前待判断语音片段时,第2个第一语音片段、第3个第一语音片段….第N个第一语音片段均未作为过待判断语音片段,则将出现在最前的第2个第一语音片段确定为当前待判断语音片段,以此类推。
S606,在当前待判断语音片段对应当前待识别分词单元的概率大于预定阈值,且当前待识别分词单元不是预定关键词中出现在最后的预定分词单元时,返回确定当前待识别分词单元的步骤。
S608,在当前待判断语音片段对应当前待识别分词单元的概率大于预定阈值,且当前待识别分词单元是预定关键词中出现在最后的预定分词单元时,判定待识别语音信号中存在预定关键词。
在本实施例中,确定当前待识别分词单元和当前待判断语音片段后,判断当前待判断语音片段对应当前待识别分词单元的概率是否大于预定阈值。
若大于,说明当前待识别分词单元存在于当前待判断语音片段中。此时,进一步判断当前待识别分词单元是否为预定关键词中出现在最后的预定分词单元。若否,说明目前还只检测到待识别语音信号中存在当前待识别分词单 元,需要进一步检测该待识别语音信号中是否存在其他预定分词单元,因此返回确定当前待识别分词单元的步骤。若是,说明已经检测到待识别语音信号中存在预定关键词的各预定分词单元,因此可初步判定待识别语音信号中存在预定关键词。
若小于或等于,说明当前待识别分词单元不存在于当前待判断语音片段中。在一个实施例中,判定小于或等于时,可以返回确定当前待判断语音片段的步骤,以将下一个第一语音片段确定为当前待判断语音片段,继续检测当前待识别分词单元是否存在于该下一个第一语音片段中。
在本实施例中,若当前待判断语音片段为待识别语音信号中出现在最后的一帧单元帧所对应的第一语音片段,且检测到该第一语音片段中不存在预定关键词中出现在最后的预定分词单元,则可初步判定待识别语音信号中不存在预定关键词,直接输出表征待识别语音信号中不存在预定关键词的识别结果,并结束识别流程。
需要说明的是,如前文所述,可通过预定的决策逻辑使系统达到较低的假阴性率。相应地,在本实施例中,也可通过调整预定阈值,使系统达到较低的假阴性率。
在一个实施例中,在图6所示实施例的基础上,如图7所示,语音关键词的识别方法还可以包括如下步骤S702~S704。
S702,在当前待判断语音片段对应当前待识别分词单元的概率小于或等于预定阈值,且上一次判定大于预定阈值时所对应的待识别分词单元处于有效状态时,返回确定当前待判断语音片段的步骤(S604)。
S704,在当前待判断语音片段对应当前待识别分词单元的概率小于或等于预定阈值,且上一次判定大于预定阈值时所对应的待识别分词单元处于无效状态时,将预定关键词的各预定分词中出现在最前的预定分词单元确定为当前待识别分词单元,并返回确定当前待判断语音片段的步骤(S604)。
需要说明的是,对于待识别语音信号而言,可能出现如下情况:该待识别语音信号中存在各预定关键词的各预定分词单元,且各预定分词单元在该待识别语音信号中的出现顺序与该各预定分词单元在预定关键词中的出现顺序也是一致的,但在该待识别语音信号中,各预定分词单元并不能紧凑相连地构成预定关键词,而是被其他填充信息隔断。例如,预定关键词为“小蓝 精灵”,各预定分词单元分别为“xiao”、“lan”、“jing”和“ling”,然而,在待识别语音信号中,出现的不是“xiao lan jing ling”,而是“xiao peng you ai lan jing ling”,即被“peng you ai”隔断。在此情况下,待识别语音信号中实际上是不存在预定关键词的,但仍可能识别为该待识别语音信号中存在预定关键词,即出现误识别。
基于此,在本实施例中,判定当前待判断语音片段对应当前待识别分词单元的概率小于或等于预定阈值时,进一步判断上一次判定大于预定阈值时所对应的待识别分词是否处于有效状态。若是,则直接返回确定当前待判断语音片段的步骤。若否,则将预定关键词的各预定分词单元中出现在最前的预定分词单元确定为当前待识别分词单元,再返回确定当前待判断语音片段的步骤,例如,预定关键词为“小蓝精灵”,各预定分词单元分别为“xiao”、“lan”、“jing”和“ling”,则将所有预定分词单元中出现在最前的“xiao”确定为当前待识别分词单元,再返回确定当前待判断语音片段的步骤。
在一个实施例中,可以通过计数值判断上一次判定大于预定阈值时所对应的待识别分词是否处于有效状态。每当判定当前待判断语音片段对应当前待识别分词单元的概率大于预定阈值,但当前待识别分词单元不是预定关键词中出现在最后的预定分词单元时,先将当前计数值设置为预定的触发初始值(该触发初始值可以为基于业务经验设定的正数,如30),再返回确定当前待识别分词单元的步骤。
并且,每当判定当前待判断语音片段对应当前待识别分词单元的概率小于或等于预定阈值时,将当前计数值减去预定调整值(例如减1),以更新当前计数值,并判断当前计数值是否大于预定的标准值(如0)。若大于,说明上一次判定大于预定阈值时所对应的待识别分词处于有效状态,因此可直接返回确定当前待判断语音片段的步骤。若小于或等于,说明上一次判定大于预定阈值时所对应的待识别分词已经处于无效状态,因此可将预定关键词的各预定分词单元中出现在最前的预定分词单元确定为当前待识别分词单元,再返回确定当前待判断语音片段的步骤。
在一个实施例中,基于待识别语音信号获得N个第一语音片段,设定第一语音片段的索引值为n,则第n个第一语音片段为在该待识别语音信号中出场顺序由前往后排在第n位的第一语音片段,n小于或等于N。并且,预 定关键词包括M个预定分词单元,设定预定分词单元的索引值为m,则第m个预定分词单元为在预定关键词中出场顺序由前往后排在第m位的预定分词单元,m小于或等于M。此外,计数值为k,且假设计数值的初始触发值为30。如图8所示,在本实施例中,基于预定的决策逻辑初步判断待识别语音信号中是否存在预定关键词的步骤,可以包括如下步骤S801~S811。
S801,令n等于0,m等于1,以及k等于0。
S802,令n增加1。
S803,判断n是否大于N。若是,则跳转至步骤S804,若否,则跳转至步骤S805。
S804,初步判定待识别语音信号中不存在预定关键词,并结束流程。
S805,判断第n个第一语音片段对应第m个预定分词单元的概率是否大于预定阈值。若是,则跳转至步骤S806,若否,则跳转至步骤S808。
S806,判断m是否等于M。若否,则跳转至步骤S807,若是,则跳转至步骤S811。
S807,令k等于30,m增加1,并且返回步骤S802。
S808,令k减1。
S809,判断k是否大于0。若是,则返回步骤S802,若否,则跳转至步骤S810。
S810,令m等于1,并返回步骤S802。
S811,初步判定待识别语音信号中存在预定关键词,并结束流程。
在一个实施例中,如图9所示,确定第一分类模型的方式,可以包括如下步骤S902~S908。
S902,基于预定语料库获取样本语音信号,预定语料库包括通用语料库。
S904,基于各样本语音信号获得第三语音片段。
S906,获取各第三语音片段的第一声学特征和各第三语音片段对应的第三概率,第三概率包括该第三语音片段分别对应预定关键词的各预定分词单元的各概率。
S908,基于各第三语音片段的第一声学特征和各第三概率对预定的第一神经网络模型进行训练,确定第一分类模型。
可以理解,在通过第一分类模型进行分类处理之前,需要先基于样本数 据对预定的神经网络模型进行训练,得到第一分类模型。
在传统方案中,通常只能基于专用语料库获取样本语音信号。专用语料库是指针对预定关键词专门建立的语料库,专用语料库中包括在各种不同声学条件下采集的对应预定关键词的语音信号。可以理解,对于不同的预定关键词,需要建立不同的专用语料库,并且建立专用语料库是非常耗时耗力的工作,这限制了产品的灵活开发和快速上线。
基于此,在本实施例中,可基于通用语料库获取样本语音信号,可有效减少上述限制。并且,通用语料库具备覆盖更广的声学条件、具有更大的数据规模、以及语音信号的质量更有保障的优势,因此能够高效且保持鲁棒性地实现预定关键词的识别。
可以理解,在通用语料库中,对于各语音信号均有对应的标注,该标注用于表征各相应语音信号的内容信息。在本实施例中,获得样本语音信号后,与对待识别语音信号的处理过程类似,通过分帧及拼接处理获得各第三语音片段,再基于第三语音片段包含的各样本单元帧的声学特征,获得第三语音片段的第一声学特征。但与待识别语音信号的处理有所区别的是,对样本语音信号进行处理时,还需进行帧对齐处理,通过帧对齐处理,确定样本语音信号从第多少帧样本单元帧到第多少帧样本单元帧的标注,对应于相应的预定分词单元。此外,第一声学特征与前文中的第一语音片段的声学特征类似,此处不加赘述。
在一个实施例中,可以基于通用语料库中的标注,获得各第三语音片段对应预定关键词的各预定分词单元的各概率。在另一个实施例中,也可以基于通用语料库中的标注,获得各第三语音片段对应预定关键词的各预定分词单元的各概率、以及对应第二填充信息的概率。其中,第二填充信息与前文的第一填充信息类似,此处不加赘述。
进而,基于各第三语音片段的第一声学特征和第三概率,对预定的第一神经网络模型进行训练,即确定第一神经网络模型涉及的各模型参数,从而获得第一分类模型。
在一个实施例中,在图9所示实施例的基础上,如图10所示,训练第二分类模型的方式,可以包括如下步骤S1002~S1008。
S1002,基于各样本语音信号获得第四语音片段。
S1004,分别基于与各第四语音片段对应的第三语音片段所对应的第三概率,生成各所述第四语音片段的第二预测特征;
S1006,获取与各第四语音片段分别对应的各第四概率,第四概率包括该第四语音片段对应预定关键词的概率和未对应预定关键词的概率中的至少一个;
S1008,基于各第四语音片段的第二预测特征和各第四概率对预定的第二神经网络模型进行训练,确定第二分类模型。
与第一分类模型类似,在通过第二分类模型进行分类处理之前,需要先基于样本数据对预定的第二神经网络模型进行训练,得到第二分类模型。
在本实施例中,基于样本语音信号获得第四语音片段,与基于待识别语音信号获得第二语音片段的处理过程类似,此处不加赘述。此外,第四概率与前文的第二概率,除针对对象不同之外(第二概率针对第二语音片段,第四概率针对第四语音片段),其他性质均类似,此处也不加赘述。
需要说明的是,可以以优化交叉熵为目标,并采用分布式异步梯度下降的方法来进行训练,从而确定第一神经网络模型和第二神经网络模型涉及的模型参数。
此外,进行拼接处理得到第二语音片段、第三语音片段、以及第四语音片段时,若位于某一帧单元帧之前或之后的单元帧的总数不足相应的预设帧数,则可以参照前文对进行拼接处理得到第一语音片段的描述中提到的复制处理的方式,凑足相应的预设帧数,在此不加赘述。
在一个实施例中,在生成各第二语音片段的第一预测特征之前,还可以包括步骤:获取各第二语音片段的第二声学特征。据此,第二语音片段的第一预测特征基于该第二语音片段的第二声学特征、以及与该第二语音片段对应的各第一语音片段所对应的第一概率生成。
在本实施例中,第二语音片段的第一预测特征,除了包括该第二语音片段对应的各第一语音片段所对应的第一概率之外,还可包括该第二语音片段的第二声学特征。第一预测特征中包含更多的有效特征信息,能够提高识别的准确率。此外,第二声学特征与前文中的第一语音片段的声学特征类似,此处不加赘述。
在一个实施例中,预定关键词的各预定分词单元的获取方法,可以包括 如下步骤:基于预定的分词单位对预定关键词进行分词处理,获得预定关键词的各预定分词单元,其中,预定的分词单位包括下述三项中的至少一项:拼音、音素和字。
需要说明的是,前文的实施例中,均以预定的分词单位为拼音为例进行说明。但在本申请中,分词单位可基于实际需求(例如识别准确度、系统性能等)进行设定。例如,还可以以音素为预定的分词单位,或者以字为预定的分词单位。
在一个实施例中,第一分类模型包括相互级联的各子分类模型,子分类模型的级数大于或等于2。
据此,将各第一语音片段的声学特征输入预先训练的第一分类模型,获得各第一语音片段分别对应预定关键词的各预定分词单元的第一概率的步骤,可以包括:逐级将各级子分类模型对应的输入信息输入各级子分类模型,获得各级子分类模型输出的第五概率。
其中,首级子分类模型的输入信息包括与该首级子分类模型对应的各第一语音片段的声学特征,除首级子分类模型以外的各级子分类模型的输入信息均基于其上一级子分类模型输出的第五概率生成。
并且,针对任一级子分类模型,该级子分类模型输出的第五概率包括与该级子分类模型对应的各第一语音片段分别对应预定关键词的与该级子分类模型对应的预定分词单元的概率。此外,第一分类模型中的最后一级子分类模型输出的第五概率即为第一概率。
需要说明的是,各级子分类模型各自对应有第一语音信号和预定分词单元,且各级子分类模型对应的第一语音信号和预定分词单元互不相同。此外,可基于实际需求(如系统复杂度以及系统性能的要求),设定第一分类模型中包含的子分类模型的级数。
以预定关键词为“小蓝精灵”,预定分词单元为拼音为例,对预定关键词进行分词处理,可获得如下三组预定分词单元:第一组包括的各预定分词单元分别为“xiao”、“lan”、“jing”和“ling”。第二组包括的各预定分词单元分别为“xiao lan”、“lan jing”和“jing ling”。第三组包括的各预定分词单元分别为“xiao lan jing”和“lan jing ling”。
在此情况下,第一分类模型包括的子分类模型的级数可以为3。相应地, 第一级分类子模型对应的各预定分词单元为第一组包括的各预定分词单元,第二级分类子模型对应的各预定分词单元为第二组包括的各预定分词单元,第三级分类子模型对应的各预定分词单元为第三组包括的各预定分词单元。
此外,为便于描述,下文将第一级、第二级以及第三级子分类模型各自对应的第一语音片段,分别称之为一级第一语音片段、二级第一语音片段、三级第一语音片段。
基于此,在本实施例中,先将一级第一语音片段的声学特征输入第一级子分类模型,通过第一级子分类模型,基于一级第一语音片段的声学特征进行分类,并输出的一级第一语音片段分别对应“xiao”、“lan”、“jing”和“ling”的概率。
然后,基于第一级子分类模型输出的概率生成二级第一语音片段的第三预测特征。再将各第三预测特征输入第二级子分类模型,通过第二级子分类模型,基于各第三预测特征进行分类,并输出二级第一语音片段分别对应“xiao lan”、“lan jing”和“jing ling”的概率。
进一步地,基于第二级子分类模型输出的概率生成三级第一语音片段的第四预测特征。再将各第四预测特征输入第三级子分类模型,通过第三级子分类模型,基于各第四预测特征进行分类,并输出三级第一语音片段分别对应“xiao lan jing”和“lan jing ling”的概率,第三级子分类模型输出的概率即为第一分类模型输出的第一概率。进而,基于与各第二语音片段对应的各第一语音片段的第一概率,分别生成各第二语音片段的第一预测特征,再将各第一预测特征输入第二分类模型,以及执行相应的后续步骤。
在一个实施例中,如图11所示,语音关键词的识别方法可包括如下步骤S1101~S1111。
S1101,基于待识别语音信号获得各第一语音片段,并通过预置的第一分类模型,获得与各所述第一语音片段分别对应的各第一概率;所述第一概率包括该第一语音片段分别对应预定关键词的各预定分词单元的各概率。
S1102,确定当前待识别分词单元,当前待识别分词单元是基于各预定分词单元在预定关键词中的出现顺序,所确定的出现在最前的未作为过待识别分词单元的预定分词单元。
S1103,确定当前待判断语音片段,当前待判断语音片段是基于各第一语 音片段在待识别语音信号中的出现顺序,所确定的出现在最前的未作为过待判断语音片段的第一语音片段。
S1104,判断当前待判断语音片段对应当前待识别分词单元的概率是否大于预定阈值;若是,则跳转至S1105,若否,则跳转至S1107。
S1105,判断当前待识别分词单元是否是预定关键词中出现在最后的预定分词单元;若否,则返回S1102,若是,则跳转至S1106。
S1106,初步判定待识别语音信号中存在预定关键词,并跳转至S1109。
S1107,判断上一次判定大于预定阈值时所对应的待识别分词单元是否处于有效状态;若是,则返回S1103,若否,则跳转至S1108。
S1108,将预定关键词的各预定分词中出现在最前的预定分词单元确定为当前待识别分词单元,并返回S1103。
S1109,分别基于与各第二语音片段对应的第一语音片段所对应的第一概率,生成各第二语音片段的第一预测特征。
S1110,将各第一预测特征输入预置的第二分类模型,通过预置的第二分类模型,基于各第一预测特征进行分类,获得与各第二语音片段分别对应的各第二概率;第二概率包括该第二语音片段对应预定关键词的概率和未对应预定关键词的概率中的至少一个。
S1111,基于第二概率,确定待识别语音信号中是否存在预定关键词。
需要说明的是,本实施例中的各步骤的技术特征可与上文的各实施例中的对应步骤的技术特征相同,此处不加赘述。
在合理条件下应当理解,虽然前文各实施例涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,各流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
需要说明的是,本申请各实施例提供的语音关键词的识别方法,可应用于电子设备唤醒、对话交互界面初始化、音频索引和检索、以及语音密码验 证等场景。此外,该识别方法可用作自动语音识别系统中一个重要的前端处理模块,能够极大地节省自动语音识别系统的资源占用和消耗,提升用户体验。更具体地,可应用于智能音箱、AI Lab(人工智能实验室)的语音识别以及智能语音助手等。
在一个实施例中,如图12所示,提供了一种语音关键词的识别装置1200,可以包括如下模块1202~1210。
第一语音片段获取模块1202,用于基于待识别语音信号获得各第一语音片段。
第一概率获取模块1204,用于通过预置的第一分类模型,获得与各第一语音片段分别对应的各第一概率;第一语音片段的第一概率包括该第一语音片段分别对应预定关键词的各预定分词单元的各概率。
预测特征生成模块1206,用于基于待识别语音信号获得各第二语音片段,分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各第二语音片段的第一预测特征。
第二概率获取模块1208,用于通过第二分类模型,基于各第一预测特征进行分类,获得与各第二语音片段分别对应的各第二概率;第二语音片段对应的第二概率包括该第二语音片段对应预定关键词的概率和未对应预定关键词的概率中的至少一个。
关键词识别模块1210,用于基于第二概率确定待识别语音信号中是否存在预定关键词。
上述语音关键词的识别装置,基于第一分类模型获得与待识别语音信号的各第一语音片段分别对应的第一概率后,无需基于人为设定的决策逻辑最终确定待识别语音信号中是否存在预定关键词,而是基于待识别语音信号获得各第二语音片段,再基于与各第二语音片段一一对应的各第一语音片段所对应的第一概率,生成各第二语音片段的预测特征,并将该预测特征输入第二分类模型,获得各第二语音片段对应预定关键词的概率和未对应预定关键词的概率中的至少一个,进而基于第二分类模型输出的概率最终确定该待识别语音信号中是否存在该预定关键词。能够有效克服传统方法中对人为设定的决策逻辑敏感的问题,从而提高普适性。
在一个实施例中,装置1200还可以包括初步识别模块。该初步识别模块, 用于在基于各第一概率和预定的决策逻辑判定待识别语音信号中存在预定关键词时,调用预测特征生成模块。
在一个实施例中,初步识别模块还可以包括当前分词确定单元、当前片段识别单元、第一返回单元和初步判定单元。
其中,当前分词确定单元,用于确定当前待识别分词单元,当前待识别分词单元是基于各预定分词单元在预定关键词中的出现顺序,所确定的出现在最前的未作为过待识别分词单元的预定分词单元。
当前片段识别单元,用于确定当前待判断语音片段,当前待判断语音片段是基于各第一语音片段在待识别语音信号中的出现顺序,所确定的出现在最前的未作为过待判断语音片段的第一语音片段。
第一调用单元,用于在当前待判断语音片段对应当前待识别分词单元的概率大于预定阈值,且当前待识别分词单元不是预定关键词中出现在最后的预定分词单元时,调用所述当前分词确定单元。
初步判定单元,用于在当前待判断语音片段对应当前待识别分词单元的概率大于预定阈值,且当前待识别分词单元是预定关键词中出现在最后的预定分词单元时,判定待识别语音信号中存在预定关键词。
在一个实施例中,初步识别模块还可以包括第二调用单元和分词重置单元。
其中,第二调用单元,用于在当前待判断语音片段对应当前待识别分词单元的概率小于或等于预定阈值,且上一次判定大于预定阈值时所对应的待识别分词单元处于有效状态时,调用所述当前片段识别单元;
分词重置单元,用于在当前待判断语音片段对应当前待识别分词单元的概率小于或等于预定阈值,且上一次判定大于预定阈值时所对应的待识别分词单元处于无效状态时,将预定关键词的各预定分词中出现在最前的预定分词单元确定为当前待识别分词单元,并调用所述当前片段识别单元。
在一个实施例中,装置1200还可以包括样本数据获取模块、第一片段获取模块、第一样本特征获取模块和第一模型训练模块。
其中,样本数据获取模块,用于基于预定语料库获取样本语音信号,预定语料库包括通用语料库;
第一片段获取模块,用于基于各样本语音信号获得第三语音片段;
第一样本特征获取模块,用于获取各第三语音片段的第一声学特征和与各第三语音片段分别对应的各第三概率;第三语音片段的第三概率包括该第三语音片段分别对应预定关键词的各预定分词单元的各概率;
第一模型训练模块,用于基于各第三语音片段的第一声学特征和各第三概率对预定的第一神经网络模型进行训练,确定第一分类模型。
在一个实施例中,装置1200还可以包括第二片段获取模块、第二样本特征获取模块、样本概率获取模块和第二模型训练模块。
其中,第二片段获取模块,用于基于各样本语音信号获得第四语音片段;
第二样本特征获取模块,用于分别基于与各第四语音片段对应的第三语音片段所对应的第三概率,生成各第四语音片段的第二预测特征;
样本概率获取模块,用于获取与各第四语音片段分别对应的各第四概率,第四概率包括该第四语音片段对应预定关键词的概率和未对应预定关键词的概率中的至少一个;
第二模型训练模块,用于基于各第四语音片段的第二预测特征和各第四概率对预定的第二神经网络模型进行训练,确定第二分类模型。
在一个实施例中,装置1200还包括:声学特征获取模块,获取各第二语音片段的第二声学特征。据此,第二样本特征获取模块用于分别基于各第二语音片段的第二声学特征、以及与各第二语音片段对应的第一语音片段所对应的第一概率,生成各第二语音片段的第一预测特征。
在一个实施例中,装置1200还可以包括分词处理模块。该分词处理模块,用于基于预定的分词单位对预定关键词进行分词处理,获得预定关键词的各预定分词单元,预定的分词单位包括下述三项中的至少一项:拼音、音素和字。
在一个实施例中,第一分类模型包括相互级联的各子分类模型,子分类模型的级数大于或等于2。
在一个实施例中,提供一种计算机设备,包括存储器和处理器,存储器存储有计算机程序,计算机程序被处理器执行时,使得处理器执行本申请任一实施例提供的语音关键词的识别方法的步骤。
在一个实施例中,该计算机设备可以是图1中的用户终端110,其内部结构可以如图13所示。该计算机设备包括通过系统总线连接的处理器、存储 器、网络接口、显示屏、输入装置和声音采集装置。其中,处理器用于提供计算和控制能力。存储器包括非易失性存储介质和内存储器,该计算机设备的非易失性存储介质存储有操作系统和计算机程序,该计算机程序被处理器执行时,可使得处理器实现本申请各实施例提供的语音关键词的识别方法;该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。网络接口用于与外部的终端通过网络连接通信。显示屏可以是液晶显示屏或者电子墨水显示屏。输入装置可以为显示屏上覆盖的触摸层、或者计算机设备外壳上设置的按键、轨迹球或触控板、或者外接的键盘、触控板或鼠标。
在另一个实施例中,该计算机设备可以是图1中示出的服务器120,其内部结构图可以如图14所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口。其中,该处理器用于提供计算和控制能力。该存储器包括非易失性存储介质和内存储器,该非易失性存储介质存储有操作系统和计算机程序,该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境,该计算机程序被处理器执行时以实现本申请任一实施例提供的语音关键词的识别方法。该网络接口用于与外部的终端通过网络连接通信。
本领域技术人员可以理解,图13和图14中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图13中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的语音关键词的识别装置可以实现为一种计算机程序的形式,计算机程序可在如图13或图14所示的计算机设备上运行。计算机设备的存储器中可存储组成该装置的各个程序模块,比如,图12所示的第一概率获取模块1202、预测特征生成模块1204、第二概率获取模块1206和关键词识别模块1208。各个程序模块构成的计算机程序使得处理器执行本申请任一实施例提供的语音关键词的识别方法中的步骤。
例如,图13或图14所示的计算机设备,可以通过如图12所示的语音关键词的识别装置1200中的第一概率获取模块1202执行步骤S202、通过预测特征生成模块1204执行步骤S204等等。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
据此,在一个实施例中,提供一种计算机可读存储介质,存储有计算机程序,计算机程序被处理器执行时,使得处理器执行本申请任一实施例方法的步骤。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种语音关键词的识别方法,由用户终端或服务器执行,包括步骤:
    基于待识别语音信号获得各第一语音片段;
    通过预置的第一分类模型,获得与各所述第一语音片段分别对应的各第一概率;所述第一概率包括所述第一语音片段分别对应预定关键词的各预定分词单元的各概率;
    基于所述待识别语音信号获得各第二语音片段,分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各所述第二语音片段的第一预测特征;
    通过预置的第二分类模型,基于各所述第一预测特征进行分类,获得与各所述第二语音片段分别对应的各第二概率;所述第二概率包括所述第二语音片段对应所述预定关键词的概率和未对应所述预定关键词的概率中的至少一个;
    基于所述第二概率,确定所述待识别语音信号中是否存在所述预定关键词。
  2. 根据权利要求1所述的方法,其特征在于,当基于各所述第一概率和预定的决策逻辑判定所述待识别语音信号中存在所述预定关键词时,进入所述基于所述待识别语音信号获得各第二语音片段的步骤。
  3. 根据权利要求2所述的方法,其特征在于,基于各所述第一概率和预定的决策逻辑判定所述待识别语音信号中存在所述预定关键词的方式,包括:
    确定当前待识别分词单元;所述当前待识别分词单元是基于各所述预定分词单元在所述预定关键词中的出现顺序,所确定的出现在最前的未作为过待识别分词单元的预定分词单元;
    确定当前待判断语音片段;所述当前待判断语音片段是基于各所述第一语音片段在所述待识别语音信号中的出现顺序,所确定的出现在最前的未作为过待判断语音片段的第一语音片段;
    在所述当前待判断语音片段对应所述当前待识别分词单元的概率大于预定阈值,且所述当前待识别分词单元不是所述预定关键词中出现在最后的所述预定分词单元时,返回所述确定当前待识别分词单元的步骤;
    在所述当前待判断语音片段对应所述当前待识别分词单元的概率大于所 述预定阈值,且所述当前待识别分词单元是所述预定关键词中出现在最后的所述预定分词单元时,判定所述待识别语音信号中存在所述预定关键词。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    在所述当前待判断语音片段对应所述当前待识别分词单元的概率小于或等于所述预定阈值,且上一次判定大于所述预定阈值时所对应的待识别分词单元处于有效状态时,返回所述确定当前待判断语音片段的步骤;
    在所述当前待判断语音片段对应所述当前待识别分词单元的概率小于或等于所述预定阈值,且上一次判定大于所述预定阈值时所对应的待识别分词单元处于无效状态时,将所述预定关键词的各所述预定分词中出现在最前的所述预定分词单元确定为当前待识别分词单元,并返回所述确定当前待判断语音片段的步骤。
  5. 根据权利要求1所述的方法,其特征在于,确定所述第一分类模型的方式,包括:
    基于预定语料库获取样本语音信号,所述预定语料库包括通用语料库;
    基于各所述样本语音信号获得第三语音片段;
    获取各所述第三语音片段的第一声学特征和与各所述第三语音片段分别对应的各第三概率;所述第三概率包括所述第三语音片段分别对应所述预定关键词的各预定分词单元的各概率;
    基于各所述第三语音片段的第一声学特征和各所述第三概率对预定的第一神经网络模型进行训练,确定所述第一分类模型。
  6. 根据权利要求5所述的方法,其特征在于,确定所述第二分类模型的方式,包括:
    基于各所述样本语音信号获得第四语音片段;
    分别基于与各第四语音片段对应的第三语音片段所对应的第三概率,生成各所述第四语音片段的第二预测特征;
    获取与各所述第四语音片段分别对应的各第四概率;所述第四概率包括该第四语音片段对应所述预定关键词的概率和未对应所述预定关键词的概率中的至少一个;
    基于各所述第四语音片段的第二预测特征和各所述第四概率对预定的第二神经网络模型进行训练,确定第二分类模型。
  7. 根据权利要求1所述的方法,其特征在于,在所述分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各所述第二语音片段的第一预测特征之前,还包括:
    获取各所述第二语音片段的第二声学特征;
    所述分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各所述第二语音片段的第一预测特征,包括:
    分别基于各所述第二语音片段的第二声学特征、以及与各第二语音片段对应的第一语音片段所对应的第一概率,生成各所述第二语音片段的第一预测特征。
  8. 根据权利要求1所述的方法,其特征在于,获取所述预定关键词的各预定分词单元的方式,包括:
    基于预定的分词单位对所述预定关键词进行分词处理,获得所述预定关键词的各预定分词单元;所述预定的分词单位包括拼音、音素及字中的至少一项。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述第一分类模型包括相互级联的各子分类模型,所述子分类模型的级数大于或等于2。
  10. 一种语音关键词的识别装置,包括:
    第一语音片段获取模块,用于基于待识别语音信号获得各第一语音片段;
    第一概率获取模块,用于通过预置的第一分类模型,获得与各所述第一语音片段分别对应的各第一概率;所述第一概率包括所述第一语音片段分别对应预定关键词的各预定分词单元的各概率;
    预测特征生成模块,用于基于所述待识别语音信号获得各第二语音片段,分别基于与各所述第二语音片段对应的第一语音片段所对应的第一概率,生成各所述第二语音片段的第一预测特征;
    第二概率获取模块,用于通过预置的第二分类模型,基于各所述第一预测特征进行分类,获得与各所述第二语音片段分别对应的各第二概率;所述第二概率包括所述第二语音片段对应所述预定关键词的概率和未对应所述预定关键词的概率中的至少一个;
    关键词识别模块,用于基于所述第二概率,确定所述待识别语音信号中 是否存在所述预定关键词。
  11. 根据权利要求10所述的装置,其特征在于,还包括:
    初步识别模块,用于在基于各第一概率和预定的决策逻辑判定待识别语音信号中存在预定关键词时,调用所述预测特征生成模块。
  12. 根据权利要求11所述的装置,其特征在于,所述初步识别模块包括:
    当前分词确定单元,用于确定当前待识别分词单元,当前待识别分词单元是基于各预定分词单元在预定关键词中的出现顺序,所确定的出现在最前的未作为过待识别分词单元的预定分词单元;
    当前片段识别单元,用于确定当前待判断语音片段,当前待判断语音片段是基于各第一语音片段在待识别语音信号中的出现顺序,所确定的出现在最前的未作为过待判断语音片段的第一语音片段;
    第一调用单元,用于在当前待判断语音片段对应当前待识别分词单元的概率大于预定阈值,且当前待识别分词单元不是预定关键词中出现在最后的预定分词单元时,调用所述当前分词确定单元;
    初步判定单元,用于在当前待判断语音片段对应当前待识别分词单元的概率大于预定阈值,且当前待识别分词单元是预定关键词中出现在最后的预定分词单元时,判定待识别语音信号中存在预定关键词。
  13. 根据权利要求12所述的装置,其特征在于,所述初步识别模块包括:
    第二调用单元,用于在当前待判断语音片段对应当前待识别分词单元的概率小于或等于预定阈值,且上一次判定大于预定阈值时所对应的待识别分词单元处于有效状态时,调用所述当前片段识别单元;
    分词重置单元,用于在当前待判断语音片段对应当前待识别分词单元的概率小于或等于预定阈值,且上一次判定大于预定阈值时所对应的待识别分词单元处于无效状态时,将预定关键词的各预定分词中出现在最前的预定分词单元确定为当前待识别分词单元,并调用所述当前片段识别单元。
  14. 根据权利要求10所述的装置,其特征在于,还包括:
    样本数据获取模块,用于基于预定语料库获取样本语音信号,预定语料库包括通用语料库;
    第一片段获取模块,用于基于各样本语音信号获得第三语音片段;
    第一样本特征获取模块,用于获取各第三语音片段的第一声学特征和与 各第三语音片段分别对应的各第三概率;第三概率包括所述第三语音片段分别对应预定关键词的各预定分词单元的各概率;
    第一模型训练模块,用于基于各第三语音片段的第一声学特征和各第三概率对预定的第一神经网络模型进行训练,确定第一分类模型。
  15. 根据权利要求14所述的装置,其特征在于,还包括:
    第二片段获取模块,用于基于各样本语音信号获得第四语音片段;
    第二样本特征获取模块,用于分别基于与各第四语音片段对应的第三语音片段所对应的第三概率,生成各第四语音片段的第二预测特征;
    样本概率获取模块,用于获取与各第四语音片段分别对应的各第四概率,第四概率包括该第四语音片段对应预定关键词的概率和未对应预定关键词的概率中的至少一个;
    第二模型训练模块,用于基于各第四语音片段的第二预测特征和各第四概率对预定的第二神经网络模型进行训练,确定第二分类模型。
  16. 根据权利要求10所述的装置,其特征在于,还包括:
    声学特征获取模块,获取各第二语音片段的第二声学特征;
    所述第二样本特征获取模块,用于分别基于各所述第二语音片段的第二声学特征、以及与各第二语音片段对应的第一语音片段所对应的第一概率,生成各所述第二语音片段的第一预测特征。
  17. 根据权利要求10所述的装置,其特征在于,还包括:
    分词处理模块,用于基于预定的分词单位对预定关键词进行分词处理,获得预定关键词的各预定分词单元;预定的分词单位包括拼音、音素及字中的至少一项。
  18. 根据权利要求10至17任一项所述的装置,其特征在于,第一分类模型包括相互级联的各子分类模型,子分类模型的级数大于或等于2。
  19. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至9任一项所述方法的步骤。
  20. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至9任一项所述方法的步骤。
PCT/CN2019/072590 2018-01-31 2019-01-22 语音关键词的识别方法、装置、计算机可读存储介质及计算机设备 WO2019149108A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2020540799A JP7005099B2 (ja) 2018-01-31 2019-01-22 音声キーワードの認識方法、装置、コンピュータ読み取り可能な記憶媒体、及びコンピュータデバイス
EP19747243.4A EP3748629B1 (en) 2018-01-31 2019-01-22 Identification method for voice keywords, computer-readable storage medium, and computer device
US16/884,350 US11222623B2 (en) 2018-01-31 2020-05-27 Speech keyword recognition method and apparatus, computer-readable storage medium, and computer device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810096472.XA CN108305617B (zh) 2018-01-31 2018-01-31 语音关键词的识别方法和装置
CN201810096472.X 2018-01-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/884,350 Continuation US11222623B2 (en) 2018-01-31 2020-05-27 Speech keyword recognition method and apparatus, computer-readable storage medium, and computer device

Publications (1)

Publication Number Publication Date
WO2019149108A1 true WO2019149108A1 (zh) 2019-08-08

Family

ID=62850811

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/072590 WO2019149108A1 (zh) 2018-01-31 2019-01-22 语音关键词的识别方法、装置、计算机可读存储介质及计算机设备

Country Status (5)

Country Link
US (1) US11222623B2 (zh)
EP (1) EP3748629B1 (zh)
JP (1) JP7005099B2 (zh)
CN (3) CN110444193B (zh)
WO (1) WO2019149108A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081241A (zh) * 2019-11-20 2020-04-28 Oppo广东移动通信有限公司 设备误唤醒的数据检测方法、装置、移动终端和存储介质
CN111508493A (zh) * 2020-04-20 2020-08-07 Oppo广东移动通信有限公司 语音唤醒方法、装置、电子设备及存储介质
CN111951807A (zh) * 2020-08-21 2020-11-17 上海依图网络科技有限公司 语音内容检测方法及其装置、介质和系统
CN112435691A (zh) * 2020-10-12 2021-03-02 珠海亿智电子科技有限公司 在线语音端点检测后处理方法、装置、设备及存储介质
CN113724698A (zh) * 2021-09-01 2021-11-30 马上消费金融股份有限公司 语音识别模型的训练方法、装置、设备及存储介质
CN114937450A (zh) * 2021-02-05 2022-08-23 清华大学 一种语音关键词识别方法及系统

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110444193B (zh) * 2018-01-31 2021-12-14 腾讯科技(深圳)有限公司 语音关键词的识别方法和装置
JP6911785B2 (ja) * 2018-02-02 2021-07-28 日本電信電話株式会社 判定装置、判定方法及び判定プログラム
EP3811360A4 (en) 2018-06-21 2021-11-24 Magic Leap, Inc. PORTABLE SYSTEM VOICE PROCESSING
CN110752973B (zh) * 2018-07-24 2020-12-25 Tcl科技集团股份有限公司 一种终端设备的控制方法、装置和终端设备
CN109065046A (zh) * 2018-08-30 2018-12-21 出门问问信息科技有限公司 语音唤醒的方法、装置、电子设备及计算机可读存储介质
EP3931827A4 (en) 2019-03-01 2022-11-02 Magic Leap, Inc. INPUT DETERMINATION FOR A VOICE PROCESSING ENGINE
GB201904185D0 (en) * 2019-03-26 2019-05-08 Sita Information Networking Computing Uk Ltd Item classification system, device and method therefor
US11043218B1 (en) * 2019-06-26 2021-06-22 Amazon Technologies, Inc. Wakeword and acoustic event detection
CN110335592B (zh) * 2019-06-28 2022-06-03 腾讯科技(深圳)有限公司 语音音素识别方法和装置、存储介质及电子装置
CN110334244B (zh) * 2019-07-11 2020-06-09 出门问问信息科技有限公司 一种数据处理的方法、装置及电子设备
US11328740B2 (en) 2019-08-07 2022-05-10 Magic Leap, Inc. Voice onset detection
CN110364143B (zh) * 2019-08-14 2022-01-28 腾讯科技(深圳)有限公司 语音唤醒方法、装置及其智能电子设备
CN110570861B (zh) * 2019-09-24 2022-02-25 Oppo广东移动通信有限公司 用于语音唤醒的方法、装置、终端设备及可读存储介质
CN110992929A (zh) * 2019-11-26 2020-04-10 苏宁云计算有限公司 一种基于神经网络的语音关键词检测方法、装置及系统
CN111477223A (zh) * 2020-03-04 2020-07-31 深圳市佳士科技股份有限公司 焊机控制方法、装置、终端设备及计算机可读存储介质
CN111445899B (zh) * 2020-03-09 2023-08-01 咪咕文化科技有限公司 语音情绪识别方法、装置及存储介质
US11917384B2 (en) 2020-03-27 2024-02-27 Magic Leap, Inc. Method of waking a device using spoken voice commands
CN111768764B (zh) * 2020-06-23 2024-01-19 北京猎户星空科技有限公司 语音数据处理方法、装置、电子设备及介质
CN111833856B (zh) * 2020-07-15 2023-10-24 厦门熙重电子科技有限公司 基于深度学习的语音关键信息标定方法
CN111798840B (zh) * 2020-07-16 2023-08-08 中移在线服务有限公司 语音关键词识别方法和装置
CN112634870B (zh) * 2020-12-11 2023-05-30 平安科技(深圳)有限公司 关键词检测方法、装置、设备和存储介质
CN112883375A (zh) * 2021-02-03 2021-06-01 深信服科技股份有限公司 恶意文件识别方法、装置、设备及存储介质
CN115148197A (zh) * 2021-03-31 2022-10-04 华为技术有限公司 语音唤醒方法、装置、存储介质及系统
CN113192501B (zh) * 2021-04-12 2022-04-22 青岛信芯微电子科技股份有限公司 一种指令词识别方法及装置
CN113838467B (zh) * 2021-08-02 2023-11-14 北京百度网讯科技有限公司 语音处理方法、装置及电子设备
EP4156179A1 (de) * 2021-09-23 2023-03-29 Siemens Healthcare GmbH Sprachsteuerung einer medizinischen vorrichtung
CN114141239A (zh) * 2021-11-29 2022-03-04 江南大学 基于轻量级深度学习的语音短指令识别方法及系统
CN116030792B (zh) * 2023-03-30 2023-07-25 深圳市玮欧科技有限公司 用于转换语音音色的方法、装置、电子设备和可读介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831891A (zh) * 2011-06-13 2012-12-19 富士通株式会社 一种语音数据处理方法及系统
CN105679310A (zh) * 2015-11-17 2016-06-15 乐视致新电子科技(天津)有限公司 一种用于语音识别方法及系统
CN105679316A (zh) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 一种基于深度神经网络的语音关键词识别方法及装置
CN107123417A (zh) * 2017-05-16 2017-09-01 上海交通大学 基于鉴别性训练的定制语音唤醒优化方法及系统
US9754584B2 (en) * 2014-12-22 2017-09-05 Google Inc. User specified keyword spotting using neural network feature extractor
US20170301341A1 (en) * 2016-04-14 2017-10-19 Xerox Corporation Methods and systems for identifying keywords in speech signal
CN108305617A (zh) * 2018-01-31 2018-07-20 腾讯科技(深圳)有限公司 语音关键词的识别方法和装置

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639507B2 (en) * 2007-12-25 2014-01-28 Nec Corporation Voice recognition system, voice recognition method, and program for voice recognition
KR20120072145A (ko) * 2010-12-23 2012-07-03 한국전자통신연구원 음성 인식 방법 및 장치
CN102915729B (zh) * 2011-08-01 2014-11-26 佳能株式会社 语音关键词检出系统、创建用于其的词典的系统和方法
CN102982024B (zh) * 2011-09-02 2016-03-23 北京百度网讯科技有限公司 一种搜索需求识别方法及装置
CN103177721B (zh) * 2011-12-26 2015-08-19 中国电信股份有限公司 语音识别方法和系统
US10304465B2 (en) * 2012-10-30 2019-05-28 Google Technology Holdings LLC Voice control user interface for low power mode
US9390708B1 (en) * 2013-05-28 2016-07-12 Amazon Technologies, Inc. Low latency and memory efficient keywork spotting
CN104143329B (zh) * 2013-08-19 2015-10-21 腾讯科技(深圳)有限公司 进行语音关键词检索的方法及装置
CN103943107B (zh) * 2014-04-03 2017-04-05 北京大学深圳研究生院 一种基于决策层融合的音视频关键词识别方法
US9484022B2 (en) * 2014-05-23 2016-11-01 Google Inc. Training multiple neural networks with different accuracy
JP6604013B2 (ja) * 2015-03-23 2019-11-13 カシオ計算機株式会社 音声認識装置、音声認識方法及びプログラム
KR102386854B1 (ko) * 2015-08-20 2022-04-13 삼성전자주식회사 통합 모델 기반의 음성 인식 장치 및 방법
US20170061959A1 (en) * 2015-09-01 2017-03-02 Disney Enterprises, Inc. Systems and Methods For Detecting Keywords in Multi-Speaker Environments
CN106856092B (zh) * 2015-12-09 2019-11-15 中国科学院声学研究所 基于前向神经网络语言模型的汉语语音关键词检索方法
CN106940998B (zh) * 2015-12-31 2021-04-16 阿里巴巴集团控股有限公司 一种设定操作的执行方法及装置
US10373612B2 (en) * 2016-03-21 2019-08-06 Amazon Technologies, Inc. Anchored speech detection and speech recognition
CN106328147B (zh) * 2016-08-31 2022-02-01 中国科学技术大学 语音识别方法和装置
US10311863B2 (en) * 2016-09-02 2019-06-04 Disney Enterprises, Inc. Classifying segments of speech based on acoustic features and context
CN106448663B (zh) * 2016-10-17 2020-10-23 海信集团有限公司 语音唤醒方法及语音交互装置
CN106547742B (zh) * 2016-11-30 2019-05-03 百度在线网络技术(北京)有限公司 基于人工智能的语义解析结果处理方法和装置
JP6968012B2 (ja) 2017-03-30 2021-11-17 株式会社バルカー 積層体及びその製造方法、並びにゲートシール
CN107221326B (zh) * 2017-05-16 2021-05-28 百度在线网络技术(北京)有限公司 基于人工智能的语音唤醒方法、装置和计算机设备
CN107230475B (zh) * 2017-05-27 2022-04-05 腾讯科技(深圳)有限公司 一种语音关键词识别方法、装置、终端及服务器
CN107274888B (zh) * 2017-06-14 2020-09-15 大连海事大学 一种基于倍频程信号强度和差异化特征子集的情感语音识别方法
CN107622770B (zh) * 2017-09-30 2021-03-16 百度在线网络技术(北京)有限公司 语音唤醒方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831891A (zh) * 2011-06-13 2012-12-19 富士通株式会社 一种语音数据处理方法及系统
US9754584B2 (en) * 2014-12-22 2017-09-05 Google Inc. User specified keyword spotting using neural network feature extractor
CN105679310A (zh) * 2015-11-17 2016-06-15 乐视致新电子科技(天津)有限公司 一种用于语音识别方法及系统
CN105679316A (zh) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 一种基于深度神经网络的语音关键词识别方法及装置
US20170301341A1 (en) * 2016-04-14 2017-10-19 Xerox Corporation Methods and systems for identifying keywords in speech signal
CN107123417A (zh) * 2017-05-16 2017-09-01 上海交通大学 基于鉴别性训练的定制语音唤醒优化方法及系统
CN108305617A (zh) * 2018-01-31 2018-07-20 腾讯科技(深圳)有限公司 语音关键词的识别方法和装置

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081241A (zh) * 2019-11-20 2020-04-28 Oppo广东移动通信有限公司 设备误唤醒的数据检测方法、装置、移动终端和存储介质
CN111081241B (zh) * 2019-11-20 2023-04-07 Oppo广东移动通信有限公司 设备误唤醒的数据检测方法、装置、移动终端和存储介质
CN111508493A (zh) * 2020-04-20 2020-08-07 Oppo广东移动通信有限公司 语音唤醒方法、装置、电子设备及存储介质
CN111508493B (zh) * 2020-04-20 2022-11-15 Oppo广东移动通信有限公司 语音唤醒方法、装置、电子设备及存储介质
CN111951807A (zh) * 2020-08-21 2020-11-17 上海依图网络科技有限公司 语音内容检测方法及其装置、介质和系统
CN112435691A (zh) * 2020-10-12 2021-03-02 珠海亿智电子科技有限公司 在线语音端点检测后处理方法、装置、设备及存储介质
CN112435691B (zh) * 2020-10-12 2024-03-12 珠海亿智电子科技有限公司 在线语音端点检测后处理方法、装置、设备及存储介质
CN114937450A (zh) * 2021-02-05 2022-08-23 清华大学 一种语音关键词识别方法及系统
CN113724698A (zh) * 2021-09-01 2021-11-30 马上消费金融股份有限公司 语音识别模型的训练方法、装置、设备及存储介质
CN113724698B (zh) * 2021-09-01 2024-01-30 马上消费金融股份有限公司 语音识别模型的训练方法、装置、设备及存储介质

Also Published As

Publication number Publication date
JP2021512362A (ja) 2021-05-13
CN108305617A (zh) 2018-07-20
EP3748629B1 (en) 2023-09-06
US11222623B2 (en) 2022-01-11
EP3748629C0 (en) 2023-09-06
CN110444193A (zh) 2019-11-12
CN110444195B (zh) 2021-12-14
CN110444193B (zh) 2021-12-14
JP7005099B2 (ja) 2022-01-21
CN110444195A (zh) 2019-11-12
US20200286465A1 (en) 2020-09-10
EP3748629A1 (en) 2020-12-09
CN108305617B (zh) 2020-09-08
EP3748629A4 (en) 2021-10-27

Similar Documents

Publication Publication Date Title
WO2019149108A1 (zh) 语音关键词的识别方法、装置、计算机可读存储介质及计算机设备
EP3770905B1 (en) Speech recognition method, apparatus and device, and storage medium
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
CN109155132B (zh) 说话者验证方法和系统
CN110364143B (zh) 语音唤醒方法、装置及其智能电子设备
WO2018227780A1 (zh) 语音识别方法、装置、计算机设备及存储介质
WO2019154107A1 (zh) 基于记忆性瓶颈特征的声纹识别的方法及装置
CN110097870B (zh) 语音处理方法、装置、设备和存储介质
CN106875936B (zh) 语音识别方法及装置
KR20090123396A (ko) 실시간 호출명령어 인식을 이용한 잡음환경에서의음성구간검출과 연속음성인식 시스템
US20230089308A1 (en) Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering
CN110992942B (zh) 一种语音识别方法、装置和用于语音识别的装置
CN112331207B (zh) 服务内容监控方法、装置、电子设备和存储介质
CN114155839A (zh) 一种语音端点检测方法、装置、设备及存储介质
CN112767921A (zh) 一种基于缓存语言模型的语音识别自适应方法和系统
CN115827854A (zh) 语音摘要生成模型训练方法、语音摘要生成方法及装置
CN116775873A (zh) 一种多模态对话情感识别方法
US12100388B2 (en) Method and apparatus for training speech recognition model, electronic device and storage medium
Tabibian A survey on structured discriminative spoken keyword spotting
KR20210052563A (ko) 문맥 기반의 음성인식 서비스를 제공하기 위한 방법 및 장치
KR20240089301A (ko) 언어 애그노스틱 다국어 엔드-투-엔드 스트리밍 온-디바이스 asr 시스템
CN114756662A (zh) 基于多模态输入的任务特定文本生成
TW202129628A (zh) 細粒度解碼之語音辨識系統
CN111951807A (zh) 语音内容检测方法及其装置、介质和系统
WO2024008215A2 (zh) 语音情绪识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19747243

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020540799

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019747243

Country of ref document: EP

Effective date: 20200831