CA3162745A1 - Method of detecting speech keyword based on neutral network, device and system - Google Patents

Method of detecting speech keyword based on neutral network, device and system

Info

Publication number
CA3162745A1
CA3162745A1 CA3162745A CA3162745A CA3162745A1 CA 3162745 A1 CA3162745 A1 CA 3162745A1 CA 3162745 A CA3162745 A CA 3162745A CA 3162745 A CA3162745 A CA 3162745A CA 3162745 A1 CA3162745 A1 CA 3162745A1
Authority
CA
Canada
Prior art keywords
speech
keyword
speech feature
basic
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA3162745A
Other languages
French (fr)
Inventor
Sukui XU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
10353744 Canada Ltd
Original Assignee
10353744 Canada Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 10353744 Canada Ltd filed Critical 10353744 Canada Ltd
Publication of CA3162745A1 publication Critical patent/CA3162745A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A neural network-based voice keyword detection method and device, and a system. Said method comprises the following steps: pre-receiving a voice to be detected, and extracting voice features of the voice (S21); inputting the voice features into a pre-trained neural network model of a target language by frames, and outputting a basic phoneme corresponding to each frame of voice feature (S22); mapping each preset candidate keyword to the corresponding basic phoneme (S23); according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword, calculating the score of the voice being each candidate keyword (S24); and determining whether a keyword is activated according to the score (S25). The voice keyword detection method saves system resources, and reduces the time and costs required for retraining a model.

Description

METHOD OF DETECTING SPEECH KEYWORD BASED ON NEURAL NETWORK, DEVICE AND SYSTEM
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention pertains to the field of computer speech recognition technology, and more particularly, relates to a method of detecting a speech keyword based on a neural network, and corresponding device and system.
Description of Related Art
[0002] With respect to a task to detect speech keywords, it is traditionally modus operandi to introduce a complete decoder for speech recognition to decode once the speech in which keywords to be detected are input, to generate plural candidate results, and to store the same in a certain mode, such as a Lattice structure; an inverted index is further generated, and it is thereafter quickly retrieved from the inverted index as to whether the speech to be detected contains the designated keyword. Such a keyword policy based on Lattice usually has a very high recall rate because the plural candidates can all be expressed in Lattice. The deficiency thereof is its undue complexity, as the entire recognition system should be introduced, the complicated Lattice should also be processed, it is usual to further introduce the operation related to finite-state transducer (FST) for the generation of the inverted index, and this is usually very difficult to manipulate, and the deployment is also very complicated.
[0003] Under the newest framework of keyword detection based on a neural network, it is usual to establish a neural network for each keyword, and each neural network judges whether the keyword is activated through the summation of scores output from each frame.

Date Recue/Date Received 2022-05-24 However, the establishment of a neural network for each keyword to judge whether it is activated requires a great deal of speeches that contain this keyword to train the model on the one hand, and collection of data is extremely troublesome; on the other hand, when keywords are increased or modified, it is further required to collect data again and train the model again, the whole process is also very complicated. Moreover, the false-alarm rate of such model is also very high, and the system is inadvertently activated many times when it is not desired so.
SUMMARY OF THE INVENTION
[0004] In view of the deficiency concerning undue complexity in prior-art technology, the present invention proposes a method of detecting a speech keyword based on a neural network, and corresponding device and system. The present invention makes it possible to reduce network model resource required by the keyword retrieval system, besides, there is no need to train the model again when the keyword is modified, whereby it is made possible to save time required for retraining the model, and to economize on the cost required for retraining the model.
[0005] According to one aspect, the present invention discloses a method of detecting a speech keyword based on a neural network, and the method comprises the following steps:
[0006] receiving a speech to be detected, and extracting a speech feature of the speech;
[0007] inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds;
[0008] mapping each preset candidate keyword to corresponding basic phonemes;
[0009] calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword;
and
[0010] judging whether any keyword is activated according to the scores.

Date Recue/Date Received 2022-05-24
[0011] Preferably, the step of outputting basic phonemes to which each frame of the speech feature corresponds includes:
[0012] outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.
[0013] Preferably, the neural network model is obtained through the following steps:
[0014] obtaining a sample dataset for training, wherein the sample dataset includes a sample speech and a sample basic phoneme marking result corresponding to the sample speech;
[0015] extracting a sample speech feature of the sample speech; and
[0016] taking the sample speech feature as input, and taking the sample basic phoneme marking result to which the sample speech corresponds as output to train the neural network model.
[0017] Preferably, the step of inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds includes:
[0018] inputting the speech feature by frames into a previously well-trained GMM-HMM model of a target language to forcefully align the speech feature, and obtaining at least one basic phoneme to which each frame of the speech feature corresponds.
[0019] Preferably, the step of calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword includes:
[0020] obtaining a plurality of scores calculated by a plurality of score calculating policies according to basic phonemes of the speech feature and basic phonemes of the candidate keyword, and merging the plurality of scores to obtain a final score.
[0021] Preferably, the score calculating policies include at least two of dynamic programming, limited constrained maximal sequence scoring, and optimal path scoring after violent Date Recue/Date Received 2022-05-24 exhaustion in N x M matrix space.
[0022] Preferably, the step of judging whether any keyword is activated according to the scores includes:
[0023] sequentially judging, according to a descending order of the scores, relations between candidate keyword scores and a scoring threshold predefined for the candidate keywords, until it is judged there is a score of the candidate keyword greater than the scoring threshold predefined for the candidate keywords, and stopping judging after the candidate keyword is activated.
[0024] According to another aspect, the present invention discloses a device for detecting a speech keyword based on a neural network, and the device comprises:
[0025] a speech feature extracting unit, for receiving a speech to be detected, and extracting a speech feature of the speech;
[0026] a basic phoneme predicting unit, for inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds;
[0027] a candidate word mapping unit, for mapping each candidate keyword to corresponding basic phonemes;
[0028] a score calculating unit, for calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword; and
[0029] a judging unit, for judging whether any keyword is activated according to the scores.
[0030] Preferably, the basic phoneme predicting unit is employed for outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N
equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.

Date Recue/Date Received 2022-05-24
[0031] According to still another aspect, the present invention discloses a computer system that comprises:
[0032] one or more processor(s); and
[0033] a memory, associated with the one or more processor(s) and storing a program instruction that executes a terminal when it is read and executed by the one or more processor(s), wherein the terminal includes a memory and a processor, of which the processor reads a computer program instruction stored in the memory, so that the processor is enabled to execute the method as recited above.
[0034] According to the specific embodiments of the present application, the present application has made public the following technical effects.
[0035] With respect to different keywords, there is no need to train different neural network models, as it suffices to complete the detection of all the keywords by a single model alone. Under the traditional policy, one keyword requires one specific neural network model, whereby lots of resources are occupied.
[0036] When the keyword is modified, it is also not required to train the model again, as it suffices to modify the corresponding phoneme sequence. Under the traditional policy, when a keyword is modified, the model is necessarily retrained with specific speeches.
However, it is merely required in the present invention to train the network once with a speech containing the entire phonemes of the target language, whereby the cost in retraining the model is greatly reduced, operation is made simple, and deployment is rendered convenient.
[0037] It suffices for the aforementioned product of the present invention to achieve one of the aforementioned effects.
[0038] Through detailed description with reference to the following accompanying drawings and Date Recue/Date Received 2022-05-24 specific modes of execution of the present invention, the characteristics and advantages of the present invention will be made clear.
BRIEF DESCRIPTION OF THE DRAWINGS
[0039] Fig. 1 is a flowchart illustrating the method of detecting a speech keyword according to the present invention;
[0040] Fig. 2 is a flowchart illustrating the method of Embodiment 1 in the present invention;
[0041] Fig. 3 is a view illustrating the structure of the device in Embodiment 2 of the present invention; and
[0042] Fig. 4 is a view illustrating the structure of the computer system according to the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0043] In order to make the technical solutions of the present invention more lucid and clear, the present invention is described in greater detail below in conjunction with accompanying drawings. As should be understood, the specific embodiments described here are merely meant to explain the present invention, rather than to restrict the present invention.
[0044] The present invention employs the mode of addressing the task of detecting speech keywords based on a neural network. Specifically, the modeling units of the neural network according to the present invention are not complete keywords or single characters in the keywords, but are basic phoneme units of the language to which the keywords pertain. Taking the Chinese language for example, output nodes of the neural network according to the present invention are to model the entire initial consonants and vowels of the Chinese Pinyin (the Chinese phonetic alphabets), and to combine and join Date Recue/Date Received 2022-05-24 out the desired keywords according to sequences of the initial consonants and vowels.
[0045] In addition, since the neural network according to the present invention is relatively small, scores of the same speech obtained through plural neural networks can be further merged to further enhance the performance, so that the scores better reflect the confidence of the keywords, to enhance the recall rate of the keyword detection system and to lower the false-alarm rate.
[0046] Fig. 1 is a flowchart illustrating the method of detecting a speech keyword according to the present invention. As shown in Fig. 1, the method of detecting a speech keyword according to the present invention can be divided into two portions, one is to train a neural network model, and the other is to utilize the well-trained neural network model to detect speech keywords.
[0047] Training the neural network model includes the following steps.
[0048] Step 1 - obtaining a sample training set, which includes a sample speech for training and a sample basic phoneme marking result of the speech. With respect to speeches of a target language, certain quantities of marked speeches are collected, forming at best a phonetic training set of at least 500 hours.
[0049] Step 2 - extracting a sample speech feature.
[0050] Step 3 ¨ training a neural network model. A GMM-HMM model required for speech recognition is trained by means of sample speeches having basic phoneme marking results, and the speeches are forcefully aligned with this model to obtain information as to which basic phoneme or which basic phonemes of the target language each frame of the feature-extracted speech belongs (if each frame belongs to plural basic phonemes, a sum of probabilities of the plural basic phonemes is 1). In actual operation, phoneme Date Recue/Date Received 2022-05-24 information to which one sentence corresponds can be obtained through mapping of resources of an existent dictionary, but it is impossible to determine the phoneme information of a certain frame, so it is required to train a GMM-HMM model by which it is possible to further obtain the phoneme information of each frame.
[0051] Output nodes of the neural network indicate the basic phonemes of the target language, so the number of output nodes of the neural network can simply equal a sum of the numbers of basic phonemes of the target language. Taking the Chinese language for example, it can be a sum of the numbers of the entire initial consonants plus the vowels;
taking the English language for example, it can be a sum of the number of the international phonetic alphabets. In addition, as can be extended, with respect to a language with intonations, such as the Chinese language, the vowels can be intoned, and there are altogether 5 intonations (the four basic intonations plus the soft intonation), then the total number of nodes is the number of the initial consonants plus the number of vowels multiplied by 5. Moreover, it is further possible to add some additional nodes to take into consideration those parts in the speech that do not pertain to any phoneme, such as noises, abnormal sounds, and the coughing sounds, etc.
[0052] The neural network model according to the present invention is not directed to complete keywords or single characters in the keywords, but directed to basic phoneme units of the language to which the keywords pertain. Taking the Chinese language for example, output nodes of the neural network according to the present invention are to model the entire initial consonants and vowels of the Chinese Pinyin, and to combine and join out the desired keywords according to sequences of the initial consonants and vowels.
[0053] By way of example, the keyword is "xiaohuoxiaohuo (the Chinese phonetic transliteration of 'young man young man')", then a combination of sequences of its corresponding initial consonants plus vowels is "xiao3 huo3 xiao3 huo3". Basic phoneme units of the common language do not exceed 100 in number, even in the case of such an Date Recue/Date Received 2022-05-24 intonated language as the Chinese language, the modeling units generally also do not exceed 500 in number including the intonations, and this makes the neural network model not to be so large, and to be easily deployable in such an imbeddable equipment as a mobile phone, a camera, etc. The aforementioned network according to the present application can be embodied as a simple full connection feedforward neural network, and can also be embodied as such a relatively complex network as a time delay neural network, a convolution neural network, a recurrent neural network, etc., all of which fall within the protection scope of the present invention.
[0054] Detection of the speech keyword by means of the well-trained neural network model includes the following steps.
[0055] Step 4 ¨ receiving speech information to be detected as input by a user, and extracting a speech feature of the speech.
[0056] Step 5¨ inputting the speech feature by frames into the neural network model well trained in the foregoing step, and outputting corresponding phonemes. With respect to each frame, the neural network obtains a vector of the size as the number of network output nodes.
Suppose the speech has N frames altogether, and there are M network output nodes, then a phoneme distribution matrix sized as NxM will be obtained.
[0057] Corresponding to different target languages, the numbers of N and M are different.
[0058] Step 6¨ calculating the score of each candidate keyword, namely calculating the possible score of the aforementioned NxM matrix being each candidate keyword. Each candidate keyword is mapped to a phenome sequence via its pronunciation dictionary, since each phoneme can correspond to one node output from the network, the score of the phoneme sequence of the candidate keyword in the NxM matrix can be calculated. Such scoring mode includes, but is not limited to, dynamic programming, limited constrained maximal Date Recue/Date Received 2022-05-24 sequence scoring, or optimal path scoring after violent exhaustion in the NxM
matrix space. To facilitate discussion, all score calculating methods possibly used in this process are collectively referred to as "score calculating policy".
[0059] The present invention makes it possible to train plural neural networks for score calculation, with respect to one candidate keyword, plural scores are obtained by employing different "score calculating policies" in different score calculating neural networks, and these scores can be merged by means of different methods, such as weighted averaging, etc., so as to obtain better score expression.
[0060] As should be noted, since the candidate keyword can be certainly mapped to a phoneme sequence, so a score can be certainly calculated for any random candidate keyword in step 6, and it is also not required to train the neural network again. In addition, because the phoneme sequence of the candidate keyword is only considered here, candidate keywords identical in pronunciation but different in characters are equivalently regarded.
[0061] Step 7 ¨ judging whether any candidate keyword is activated. The candidate keyword with the highest score is selected from a collection of candidate keywords, if the score exceeds a predefined threshold for the candidate keywords, this candidate keyword is activated; otherwise, the candidate keyword with the second highest score is considered as to whether its score exceeds the predefined threshold for the candidate keywords, so on and so forth according to such sequence. Once a candidate keyword is activated, control information concerning activation of the candidate keyword is returned, thus completing recognition of a sentence. If the scores of all candidate keywords are lower than the threshold, information is returned to the effect that no candidate keyword is activated. The whole process ends here. Taking for example an app for financial payment on the mobile phone, after the app has been enabled, the user speaks out "show the collection code" and "show the payment code", whereupon the system judges that specific keywords are received according to the speeches of the user, and thereafter Date Recue/Date Received 2022-05-24 automatically shows the corresponding 2D code for use by the user.
[0062] Since this embodiment is based on a scenario of the Chinese language, a certain amount of Chinese language corpus is collected in advance, and it is very easy to find online a well-marked Chinese language corpus over 500 hours. An open-source tool is employed to train a Chinese language GMM-HMM model, and the well-trained model is used to further forcefully align the Chinese language corpus to obtain Chinese language phonemes, i.e., markings of the Chinese Pinyin level, namely phoneme information of each frame.
[0063] Subsequently, the markings and corpus of the phoneme level are used to train one or more neural network(s), which can be a full connection feedforward neural network and a time delay neural network, and the network output nodes are precisely the total number of phonemes. Thus, training of the neural network(s) is considered to be complete. Neural network resources are stored offline, packaged together with the mobile phone app and deployed on the mobile phone, and loaded into the memory of the mobile phone when the app is enabled. In the app are simultaneously stored the extracting policy of speech features, and the collection of such candidate keywords as "show collection code" and "show payment code", etc.
[0064] When the user completes a sentence, such as "please show my collection code", the microphone of the mobile phone collects the sampling point of this sentence, performs feature extraction, disposes the same in the neural network inside the memory, obtains a phoneme distribution matrix output, and then calculates scores of the phoneme distribution matrix of this sentence with respect to different candidate keywords. With respect to outputs of plural neural networks, a more precise score is obtained through merging by a certain policy, such as weighted averaging. For instance, as obtained through calculation, "please show my collection code" spoken out by the user has a score of 90 with respect to the candidate keyword "show collection code", has a score of 40 Date Recue/Date Received 2022-05-24 with respect to the candidate keyword "show payment code", and the thresholds of the candidate keywords are all 80, then it is found by inspecting whether each keyword score exceeds the threshold predefined for the keyword according to a descending order of the keyword scores that the score of the candidate keyword "show collection code"
exceeds the threshold, then this keyword is activated, and it suffices to use this activated keyword to perform subsequent operations.
[0065] Specifically speaking, all keywords supported by the application are written in one file, and read by system memory as required. When it is needed to modify or increase keywords, it is not required to collect speeches again or train the model again, as it is merely required to modify the file. Modified keywords or newly added keyword speeches are required in common keyword policies to retrain the model, whereas such operation is dispensed with in the present invention, thereby greatly saving both cost and time.
[0066] Embodiment 1
[0067] Corresponding to the description above, Embodiment 1 of the present application discloses a method of detecting a speech keyword based on a neural network, as shown in Fig. 2, the method comprises the following steps.
[0068] S21 - receiving a speech to be detected, and extracting a speech feature of the speech.
[0069] S22 - inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds.
[0070] This step specifically includes:
[0071] inputting the speech feature by frames into a previously well-trained GMM-HMM model of a target language to forcefully align the speech feature, obtaining at least one basic Date Recue/Date Received 2022-05-24 phoneme to which each frame of the speech feature corresponds, and outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.
[0072] S23 - mapping each preset candidate keyword to corresponding basic phonemes.
Mapping can be effected via a pronunciation dictionary in this step.
[0073] S24 - calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword.
[0074] This step specifically includes:
[0075] obtaining a plurality of scores calculated by a plurality of score calculating policies according to basic phonemes of the speech feature and basic phonemes of the candidate keyword, and merging the plurality of scores to obtain a final score.
[0076] Preferably, the score calculating policies include at least two of dynamic programming, limited constrained maximal sequence scoring, and optimal path scoring after violent exhaustion in N x M matrix space.
[0077] S25 - judging whether any keyword is activated according to the scores.
[0078] Specifically, relations between candidate keyword scores and a scoring threshold predefined for the candidate keywords can be sequentially judged according to a descending order of the scores, until it is judged there is a score of the candidate keyword greater than the scoring threshold predefined for the candidate keywords, and judging is stopped after the candidate keyword is activated.
[0079]
The neural network model can be obtained through the following steps:

Date Recue/Date Received 2022-05-24
[0080] obtaining a sample dataset for training, wherein the sample dataset includes a sample speech and a sample basic phoneme marking result corresponding to the sample speech;
[0081] extracting a sample speech feature of the sample speech; and
[0082] taking the sample speech feature as input, and taking the sample basic phoneme marking result to which the sample speech corresponds as output to train the neural network model.
[0083] Embodiment 2
[0084] Corresponding to the aforementioned method, Embodiment 2 of the present application further discloses a device for detecting a speech keyword based on a neural network, as shown in Fig. 3, the device comprises the following.
[0085] A speech feature extracting unit 31 is employed for receiving a speech to be detected, and extracting a speech feature of the speech.
[0086] A basic phoneme predicting unit 32 is employed for inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds.
[0087] Specifically, the basic phoneme predicting unit 32 is employed for:
[0088] inputting the speech feature by frames into a previously well-trained GMM-HMM model of a target language to forcefully align the speech feature, obtaining at least one basic phoneme to which each frame of the speech feature corresponds, and outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.
[0089] A candidate word mapping unit 33 is employed for mapping each candidate keyword to corresponding basic phonemes. Mapping can be specifically effected via a pronunciation Date Recue/Date Received 2022-05-24 dictionary.
[0090] A score calculating unit 34 is employed for calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword.
[0091] Specifically, the score calculating unit 34 is employed for obtaining a plurality of scores calculated by a plurality of score calculating policies according to basic phonemes of the speech feature and basic phonemes of the candidate keyword, and merging the plurality of scores to obtain a final score.
[0092] Wherein, the score calculating policies include at least two of dynamic programming, limited constrained maximal sequence scoring, and optimal path scoring after violent exhaustion in N x M matrix space.
[0093] A judging unit 35 is employed for judging whether any keyword is activated according to the scores.
[0094] Specifically, the judging unit 35 is employed for sequentially judging, according to a descending order of the scores, relations between candidate keyword scores and a scoring threshold predefined for the candidate keywords, until it is judged there is a score of the candidate keyword greater than the scoring threshold predefined for the candidate keywords, and stopping judging after the candidate keyword is activated.
[0095] Embodiment 3
[0096] Corresponding to the aforementioned method, Embodiment 3 of the present invention discloses a computer system that comprises:
[0097] one or more processor(s); and Date Recue/Date Received 2022-05-24
[0098] a memory, associated with the one or more processor(s) and storing a program instruction that executes a terminal when it is read and executed by the one or more processor(s), wherein the terminal includes a memory and a processor, of which the processor reads a computer program instruction stored in the memory, so that the processor is enabled to execute the method as recited above.
[0099] Embodiment 4 of the present application provides a computer system that comprises:
[0100] one or more processor(s); and
[0101] a memory, associated with the one or more processor(s) and storing a program instruction that executes the following operations when it is read and executed by the one or more processor(s):
[0102] receiving a speech to be detected, and extracting a speech feature of the speech;
[0103] inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds;
[0104] mapping each preset candidate keyword to corresponding basic phonemes;
[0105] calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword;
and
[0106] judging whether any keyword is activated according to the scores.
[0107] Preferably, the step of outputting basic phonemes to which each frame of the speech feature corresponds includes:
[0108] outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.
[0109] Preferably, the neural network model is obtained through the following steps:
[0110] obtaining a sample dataset for training, wherein the sample dataset includes a sample speech and a sample basic phoneme marking result corresponding to the sample speech;

Date Recue/Date Received 2022-05-24
[0111] extracting a sample speech feature of the sample speech; and
[0112] taking the sample speech feature as input, and taking the sample basic phoneme marking result to which the sample speech corresponds as output to train the neural network model.
[0113] Preferably, the step of inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds includes:
[0114] inputting the speech feature by frames into a previously well-trained GMM-HMM model of a target language to forcefully align the speech feature, and obtaining at least one basic phoneme to which each frame of the speech feature corresponds.
[0115] Preferably, the step of calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword includes:
[0116] obtaining a plurality of scores calculated by a plurality of score calculating policies according to basic phonemes of the speech feature and basic phonemes of the candidate keyword, and merging the plurality of scores to obtain a final score.
[0117] Preferably, the score calculating policies include at least two of dynamic programming, limited constrained maximal sequence scoring, and optimal path scoring after violent exhaustion in N x M matrix space.
[0118] Preferably, the step of judging whether any keyword is activated according to the scores includes:
[0119] sequentially judging, according to a descending order of the scores, relations between candidate keyword scores and a scoring threshold predefined for the candidate keywords, until it is judged there is a score of the candidate keyword greater than the scoring threshold predefined for the candidate keywords, and stopping judging after the candidate keyword is activated.

Date Recue/Date Received 2022-05-24
[0120] Fig. 4 exemplarily illustrates the framework of a computer system that can specifically include a processor 1510, a video display adapter 1511, a magnetic disk driver 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, the video display adapter 1511, the magnetic disk driver 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 can be communicably connected with one another via a communication bus 1530.
[0121] The processor 1510 can be embodied as a general CPU (Central Processing Unit), a microprocessor, an ASIC (Application Specific Integrated Circuit), or one or more integrated circuit(s) for executing relevant program(s) to realize the technical solutions provided by the present application.
[0122] The memory 1520 can be embodied in such a form as an ROM (Read Only Memory), an RAM (Random Access Memory), a static storage device, or a dynamic storage device.
The memory 1520 can store an operating system 1521 for controlling the running of a computer system 1500, and a basic input/output system (BIOS) for controlling lower-level operations of the computer system 1500. In addition, the memory 1520 can also store a web browser 1523, a data storage administration system 1524, and an icon font processing system 1525, etc. The icon font processing system 1525 can be an application program that specifically realizes the aforementioned various step operations in the embodiments of the present application. To sum it up, when the technical solutions provided by the present application are to be realized via software or firmware, the relevant program codes are stored in the memory 1520, and invoked and executed by the processor 1510.
[0123] The input/output interface 1513 is employed to connect with an input/output module to realize input and output of information. The input/output module can be equipped in the device as a component part (not shown in the drawings), and can also be externally Date Recue/Date Received 2022-05-24 connected with the device to provide corresponding functions. The input means can include a keyboard, a mouse, a touch screen, a microphone, and various sensors etc., and the output means can include a display screen, a loudspeaker, a vibrator, an indicator light etc.
[0124] The network interface 1514 is employed to connect to a communication module (not shown in the drawings) to realize intercommunication between the current device and other devices. The communication module can realize communication in a wired mode (via USB, network cable, for example) or in a wireless mode (via mobile network, WIFI, Bluetooth, etc.).
[0125] The bus 1530 includes a passageway transmitting information between various component parts of the device (such as the processor 1510, the video display adapter 1511, the magnetic disk driver 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).
[0126] Additionally, the computer system 1500 may further obtain information of specific collection conditions from a virtual resource object collection condition information database 1541 for judgment on conditions, and so on.
[0127] As should be noted, although merely the processor 1510, the video display adapter 1511, the magnetic disk driver 1512, the input/output interface 1513, the network interface 1514, the memory 1520, and the bus 1530 are illustrated for the aforementioned device, the device may further include other component parts prerequisite for realizing normal running during specific implementation. In addition, as can be understood by persons skilled in the art, the aforementioned device may as well only include component parts necessary for realizing the solutions of the present application, without including the entire component parts as illustrated.

Date Recue/Date Received 2022-05-24
[0128] As can be known through the description to the aforementioned embodiments, it is clearly learnt by person skilled in the art that the present application can be realized through software plus a general hardware platform. Based on such understanding, the technical solutions of the present application, or the contributions made thereby over the state of the art, can be essentially embodied in the form of a software product, and such a computer software product can be stored in a storage medium, such as an ROM/RAM, a magnetic disk, an optical disk etc., and includes plural instructions enabling a computer equipment (such as a personal computer, a server, or a network device etc.) to execute the methods described in various embodiments or some sections of the embodiments of the present application.
[0129] The various embodiments are progressively described in the Description, identical or similar sections among the various embodiments can be inferred from one another, and each embodiment stresses what is different from other embodiments.
Particularly, with respect to the system or system embodiment, since it is essentially similar to the method embodiment, its description is relatively simple, and the relevant sections thereof can be inferred from the corresponding sections of the method embodiment. The system or system embodiment as described above is merely exemplary in nature, units therein described as separate parts can be or may not be physically separate, parts displayed as units can be or may not be physical units, that is to say, they can be located in a single site, or distributed over a plurality of network units. It is possible to base on practical requirements to select partial modules or the entire modules to realize the objectives of the embodied solutions. It is understandable and implementable by persons ordinarily skilled in the art without spending creative effort in the process.
[0130] The method of detecting a speech keyword based on a neural network, and corresponding device and system provided by the present application are described in detail above, specific examples are used in this paper to enunciate the principles and modes of execution of the present application, and descriptions of the aforementioned Date Recue/Date Received 2022-05-24 embodiments are merely meant to help understand the method and kernel conception of the present application; at the same time, to persons ordinarily skilled in the art, there may be variations in both the specific modes of execution and the range of application based on the conception of the present application. To sum it up, the contents of the current Description shall not be understood to restrict the present application. In summary, the present invention employs a very simple mode to achieve the same functions, and the present invention changes the requirement in the traditional technology of plural neural network models for plural keywords to the requirement of only one neural network model for plural keywords, whereby it is made possible to make the size of the neural network to be extremely small, as it suffices for a model of 10M to achieve very excellent performance, so as to be adapted for deployment in an imbeddable equipment, and to complete functions with very low resources occupied. In addition, the keywords are randomly configurable, and there is no need to collect data and retrain the model with respect to specific keywords; at the same time, it is not required to retrain the model when keywords are modified, thus dispensing with the troublesome step of collecting specific keyword corpus, and saving the time required for training the model again.
[0131] What the above describes is merely directed to preferred embodiments of the present invention, and the patent scope of the present invention is not restricted thereby. Any equivalent structural change makeable by employing the contents of the Description and drawings of the present invention under the conception of the present invention, or any direct/indirect application to other related technical fields shall all be covered within the patent protection scope of the present invention.

Date Recue/Date Received 2022-05-24

Claims (10)

What is claimed is:
1. A method of detecting a speech keyword based on a neural network, characterized in comprising the following steps:
receiving a speech to be detected, and extracting a speech feature of the speech;
inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds;
mapping each preset candidate keyword to corresponding basic phonemes;
calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword;
and judging whether any keyword is activated according to the scores.
2. The method according to Claim 1, characterized in that the step of outputting basic phonemes to which each frame of the speech feature corresponds includes:
outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.
3. The method according to Claim 1, characterized in that the neural network model is obtained through the following steps:
obtaining a sample dataset for training, wherein the sample dataset includes a sample speech and a sample basic phoneme marking result corresponding to the sample speech;
extracting a sample speech feature of the sample speech; and taking the sample speech feature as input, and taking the sample basic phoneme marking result to which the sample speech corresponds as output to train the neural network model.
4. The method according to Claim 1, characterized in that the step of inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds includes:
inputting the speech feature by frames into a previously well-trained GMM-HMM
model of a target language to forcefully align the speech feature, and obtaining at least one basic phoneme to which each frame of the speech feature corresponds.
5. The method according to Claim 1, characterized in that the step of calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword includes:
obtaining a plurality of scores calculated by a plurality of score calculating policies according to basic phonemes of the speech feature and basic phonemes of the candidate keyword, and merging the plurality of scores to obtain a final score.
6. The method according to Claim 5, characterized in that the score calculating policies include at least two of dynamic programming, limited constrained maximal sequence scoring, and optimal path scoring after violent exhaustion in NxM matrix space.
7. The method according to anyone of Claims 1 to 6, characterized in that the step of judging whether any keyword is activated according to the scores includes:
sequentially judging, according to a descending order of the scores, relations between candidate keyword scores and a scoring threshold predefined for the candidate keywords, until it is judged there is a score of the candidate keyword greater than the scoring threshold predefined for the candidate keywords, and stopping judging after the candidate keyword is activated.
8. A device for detecting a speech keyword based on a neural network, characterized in that the device comprises:
a speech feature extracting unit, for receiving a speech to be detected, and extracting a speech feature of the speech;

a basic phoneme predicting unit, for inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds;
a candidate word mapping unit, for mapping each candidate keyword to corresponding basic phonemes;
a score calculating unit, for calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword; and a judging unit, for judging whether any keyword is activated according to the scores.
9. The device according to Claim 8, characterized in that the basic phoneme predicting unit is employed for outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M
equals the number of basic phonemes of the target language.
10. A computer system, characterized in comprising:
one or more processor(s); and a memory, associated with the one or more processor(s) and storing a program instruction that executes the method as recited in Claims 1 to 7 when it is read and executed by the one or more processor(s).
CA3162745A 2019-11-26 2020-08-28 Method of detecting speech keyword based on neutral network, device and system Pending CA3162745A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201911173619.1A CN110992929A (en) 2019-11-26 2019-11-26 Voice keyword detection method, device and system based on neural network
CN201911173619.1 2019-11-26
PCT/CN2020/111940 WO2021103712A1 (en) 2019-11-26 2020-08-28 Neural network-based voice keyword detection method and device, and system

Publications (1)

Publication Number Publication Date
CA3162745A1 true CA3162745A1 (en) 2021-06-03

Family

ID=70087106

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3162745A Pending CA3162745A1 (en) 2019-11-26 2020-08-28 Method of detecting speech keyword based on neutral network, device and system

Country Status (3)

Country Link
CN (1) CN110992929A (en)
CA (1) CA3162745A1 (en)
WO (1) WO2021103712A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110992929A (en) * 2019-11-26 2020-04-10 苏宁云计算有限公司 Voice keyword detection method, device and system based on neural network
CN111489737B (en) * 2020-04-13 2020-11-10 深圳市友杰智新科技有限公司 Voice command recognition method and device, storage medium and computer equipment
CN111797607B (en) * 2020-06-04 2024-03-29 语联网(武汉)信息技术有限公司 Sparse noun alignment method and system
CN111933124B (en) * 2020-09-18 2021-04-30 电子科技大学 Keyword detection method capable of supporting self-defined awakening words
CN113506584B (en) * 2021-07-06 2024-05-14 腾讯音乐娱乐科技(深圳)有限公司 Data processing method and device
CN113724710A (en) * 2021-10-19 2021-11-30 广东优碧胜科技有限公司 Voice recognition method and device, electronic equipment and computer readable storage medium
CN114978866B (en) * 2022-05-25 2024-02-20 北京天融信网络安全技术有限公司 Detection method, detection device and electronic equipment

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8321218B2 (en) * 2009-06-19 2012-11-27 L.N.T.S. Linguistech Solutions Ltd Searching in audio speech
CN103971678B (en) * 2013-01-29 2015-08-12 腾讯科技(深圳)有限公司 Keyword spotting method and apparatus
CN105374352B (en) * 2014-08-22 2019-06-18 中国科学院声学研究所 A kind of voice activated method and system
CN106297776B (en) * 2015-05-22 2019-07-09 中国科学院声学研究所 A kind of voice keyword retrieval method based on audio template
JP6679898B2 (en) * 2015-11-24 2020-04-15 富士通株式会社 KEYWORD DETECTION DEVICE, KEYWORD DETECTION METHOD, AND KEYWORD DETECTION COMPUTER PROGRAM
US10199037B1 (en) * 2016-06-29 2019-02-05 Amazon Technologies, Inc. Adaptive beam pruning for automatic speech recognition
US10403268B2 (en) * 2016-09-08 2019-09-03 Intel IP Corporation Method and system of automatic speech recognition using posterior confidence scores
CN108615525B (en) * 2016-12-09 2020-10-09 中国移动通信有限公司研究院 Voice recognition method and device
CN107331384B (en) * 2017-06-12 2018-05-04 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN107358951A (en) * 2017-06-29 2017-11-17 阿里巴巴集团控股有限公司 A kind of voice awakening method, device and electronic equipment
CN108182937B (en) * 2018-01-17 2021-04-13 出门问问创新科技有限公司 Keyword recognition method, device, equipment and storage medium
CN110444195B (en) * 2018-01-31 2021-12-14 腾讯科技(深圳)有限公司 Method and device for recognizing voice keywords
CN108932943A (en) * 2018-07-12 2018-12-04 广州视源电子科技股份有限公司 Command word sound detection method, device, equipment and storage medium
CN109243460A (en) * 2018-08-15 2019-01-18 浙江讯飞智能科技有限公司 A method of automatically generating news or interrogation record based on the local dialect
CN110223673B (en) * 2019-06-21 2020-01-17 龙马智芯(珠海横琴)科技有限公司 Voice processing method and device, storage medium and electronic equipment
CN110428809B (en) * 2019-06-28 2022-04-26 腾讯科技(深圳)有限公司 Speech phoneme recognition method and device, storage medium and electronic device
CN110992929A (en) * 2019-11-26 2020-04-10 苏宁云计算有限公司 Voice keyword detection method, device and system based on neural network

Also Published As

Publication number Publication date
WO2021103712A1 (en) 2021-06-03
CN110992929A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
CA3162745A1 (en) Method of detecting speech keyword based on neutral network, device and system
US10991366B2 (en) Method of processing dialogue query priority based on dialog act information dependent on number of empty slots of the query
US11367434B2 (en) Electronic device, method for determining utterance intention of user thereof, and non-transitory computer-readable recording medium
KR102596446B1 (en) Modality learning on mobile devices
US20190005947A1 (en) Speech recognition method and apparatus therefor
JP5901001B1 (en) Method and device for acoustic language model training
US11967315B2 (en) System and method for multi-spoken language detection
US20220108080A1 (en) Reinforcement Learning Techniques for Dialogue Management
CN111833845A (en) Multi-language speech recognition model training method, device, equipment and storage medium
EP3707703A1 (en) Training and/or using a language selection model for automatically determining language for speech recognition of spoken utterance
CN110263218B (en) Video description text generation method, device, equipment and medium
CN112562723B (en) Pronunciation accuracy determination method and device, storage medium and electronic equipment
KR102469712B1 (en) Electronic device and Method for generating Natural Language thereof
CN112331229A (en) Voice detection method, device, medium and computing equipment
KR20200095947A (en) Electronic device and Method for controlling the electronic device thereof
CN112329433A (en) Text smoothness detection method, device and equipment and computer readable storage medium
CN114547252A (en) Text recognition method and device, electronic equipment and medium
CN111862963A (en) Voice wake-up method, device and equipment
US20120253804A1 (en) Voice processor and voice processing method
CN112559725A (en) Text matching method, device, terminal and storage medium
JP2002297181A (en) Method of registering and deciding voice recognition vocabulary and voice recognizing device
CN114758649B (en) Voice recognition method, device, equipment and medium
KR20200140171A (en) Electronic device and Method for controlling the electronic device thereof
CN114218356A (en) Semantic recognition method, device, equipment and storage medium based on artificial intelligence
CN113707178B (en) Audio evaluation method and device and non-transient storage medium

Legal Events

Date Code Title Description
EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916