CA3162745A1

CA3162745A1 - Method of detecting speech keyword based on neutral network, device and system

Info

Publication number: CA3162745A1
Application number: CA3162745A
Authority: CA
Inventors: Sukui XU
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2019-11-26
Filing date: 2020-08-28
Publication date: 2021-06-03
Also published as: WO2021103712A1; CN110992929A

Abstract

A neural network-based voice keyword detection method and device, and a system. Said method comprises the following steps: pre-receiving a voice to be detected, and extracting voice features of the voice (S21); inputting the voice features into a pre-trained neural network model of a target language by frames, and outputting a basic phoneme corresponding to each frame of voice feature (S22); mapping each preset candidate keyword to the corresponding basic phoneme (S23); according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword, calculating the score of the voice being each candidate keyword (S24); and determining whether a keyword is activated according to the score (S25). The voice keyword detection method saves system resources, and reduces the time and costs required for retraining a model.

Description

METHOD OF DETECTING SPEECH KEYWORD BASED ON NEURAL NETWORK, DEVICE AND SYSTEM
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention pertains to the field of computer speech recognition technology, and more particularly, relates to a method of detecting a speech keyword based on a neural network, and corresponding device and system.
Description of Related Art

[0002] With respect to a task to detect speech keywords, it is traditionally modus operandi to introduce a complete decoder for speech recognition to decode once the speech in which keywords to be detected are input, to generate plural candidate results, and to store the same in a certain mode, such as a Lattice structure; an inverted index is further generated, and it is thereafter quickly retrieved from the inverted index as to whether the speech to be detected contains the designated keyword. Such a keyword policy based on Lattice usually has a very high recall rate because the plural candidates can all be expressed in Lattice. The deficiency thereof is its undue complexity, as the entire recognition system should be introduced, the complicated Lattice should also be processed, it is usual to further introduce the operation related to finite-state transducer (FST) for the generation of the inverted index, and this is usually very difficult to manipulate, and the deployment is also very complicated.

[0003] Under the newest framework of keyword detection based on a neural network, it is usual to establish a neural network for each keyword, and each neural network judges whether the keyword is activated through the summation of scores output from each frame.

Date Recue/Date Received 2022-05-24 However, the establishment of a neural network for each keyword to judge whether it is activated requires a great deal of speeches that contain this keyword to train the model on the one hand, and collection of data is extremely troublesome; on the other hand, when keywords are increased or modified, it is further required to collect data again and train the model again, the whole process is also very complicated. Moreover, the false-alarm rate of such model is also very high, and the system is inadvertently activated many times when it is not desired so.
SUMMARY OF THE INVENTION

[0004] In view of the deficiency concerning undue complexity in prior-art technology, the present invention proposes a method of detecting a speech keyword based on a neural network, and corresponding device and system. The present invention makes it possible to reduce network model resource required by the keyword retrieval system, besides, there is no need to train the model again when the keyword is modified, whereby it is made possible to save time required for retraining the model, and to economize on the cost required for retraining the model.

[0005] According to one aspect, the present invention discloses a method of detecting a speech keyword based on a neural network, and the method comprises the following steps:

[0006] receiving a speech to be detected, and extracting a speech feature of the speech;

[0007] inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds;

[0008] mapping each preset candidate keyword to corresponding basic phonemes;

[0009] calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword;
and

[0010] judging whether any keyword is activated according to the scores.

Date Recue/Date Received 2022-05-24

[0011] Preferably, the step of outputting basic phonemes to which each frame of the speech feature corresponds includes:

[0012] outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.

[0013] Preferably, the neural network model is obtained through the following steps:

[0014] obtaining a sample dataset for training, wherein the sample dataset includes a sample speech and a sample basic phoneme marking result corresponding to the sample speech;

[0015] extracting a sample speech feature of the sample speech; and

[0016] taking the sample speech feature as input, and taking the sample basic phoneme marking result to which the sample speech corresponds as output to train the neural network model.

[0017] Preferably, the step of inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds includes:

[0018] inputting the speech feature by frames into a previously well-trained GMM-HMM model of a target language to forcefully align the speech feature, and obtaining at least one basic phoneme to which each frame of the speech feature corresponds.

[0019] Preferably, the step of calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword includes:

[0020] obtaining a plurality of scores calculated by a plurality of score calculating policies according to basic phonemes of the speech feature and basic phonemes of the candidate keyword, and merging the plurality of scores to obtain a final score.

[0021] Preferably, the score calculating policies include at least two of dynamic programming, limited constrained maximal sequence scoring, and optimal path scoring after violent Date Recue/Date Received 2022-05-24 exhaustion in N x M matrix space.

[0022] Preferably, the step of judging whether any keyword is activated according to the scores includes:

[0023] sequentially judging, according to a descending order of the scores, relations between candidate keyword scores and a scoring threshold predefined for the candidate keywords, until it is judged there is a score of the candidate keyword greater than the scoring threshold predefined for the candidate keywords, and stopping judging after the candidate keyword is activated.

[0024] According to another aspect, the present invention discloses a device for detecting a speech keyword based on a neural network, and the device comprises:

[0025] a speech feature extracting unit, for receiving a speech to be detected, and extracting a speech feature of the speech;

[0026] a basic phoneme predicting unit, for inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds;

[0027] a candidate word mapping unit, for mapping each candidate keyword to corresponding basic phonemes;

[0028] a score calculating unit, for calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword; and

[0029] a judging unit, for judging whether any keyword is activated according to the scores.

[0030] Preferably, the basic phoneme predicting unit is employed for outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N
equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.

Date Recue/Date Received 2022-05-24

[0031] According to still another aspect, the present invention discloses a computer system that comprises:

[0032] one or more processor(s); and

[0033] a memory, associated with the one or more processor(s) and storing a program instruction that executes a terminal when it is read and executed by the one or more processor(s), wherein the terminal includes a memory and a processor, of which the processor reads a computer program instruction stored in the memory, so that the processor is enabled to execute the method as recited above.

[0034] According to the specific embodiments of the present application, the present application has made public the following technical effects.

[0035] With respect to different keywords, there is no need to train different neural network models, as it suffices to complete the detection of all the keywords by a single model alone. Under the traditional policy, one keyword requires one specific neural network model, whereby lots of resources are occupied.

[0036] When the keyword is modified, it is also not required to train the model again, as it suffices to modify the corresponding phoneme sequence. Under the traditional policy, when a keyword is modified, the model is necessarily retrained with specific speeches.
However, it is merely required in the present invention to train the network once with a speech containing the entire phonemes of the target language, whereby the cost in retraining the model is greatly reduced, operation is made simple, and deployment is rendered convenient.

[0037] It suffices for the aforementioned product of the present invention to achieve one of the aforementioned effects.

[0038] Through detailed description with reference to the following accompanying drawings and Date Recue/Date Received 2022-05-24 specific modes of execution of the present invention, the characteristics and advantages of the present invention will be made clear.
BRIEF DESCRIPTION OF THE DRAWINGS

[0039] Fig. 1 is a flowchart illustrating the method of detecting a speech keyword according to the present invention;

[0040] Fig. 2 is a flowchart illustrating the method of Embodiment 1 in the present invention;

[0041] Fig. 3 is a view illustrating the structure of the device in Embodiment 2 of the present invention; and

[0042] Fig. 4 is a view illustrating the structure of the computer system according to the present invention.
DETAILED DESCRIPTION OF THE INVENTION

[0043] In order to make the technical solutions of the present invention more lucid and clear, the present invention is described in greater detail below in conjunction with accompanying drawings. As should be understood, the specific embodiments described here are merely meant to explain the present invention, rather than to restrict the present invention.

[0044] The present invention employs the mode of addressing the task of detecting speech keywords based on a neural network. Specifically, the modeling units of the neural network according to the present invention are not complete keywords or single characters in the keywords, but are basic phoneme units of the language to which the keywords pertain. Taking the Chinese language for example, output nodes of the neural network according to the present invention are to model the entire initial consonants and vowels of the Chinese Pinyin (the Chinese phonetic alphabets), and to combine and join Date Recue/Date Received 2022-05-24 out the desired keywords according to sequences of the initial consonants and vowels.

[0045] In addition, since the neural network according to the present invention is relatively small, scores of the same speech obtained through plural neural networks can be further merged to further enhance the performance, so that the scores better reflect the confidence of the keywords, to enhance the recall rate of the keyword detection system and to lower the false-alarm rate.

[0046] Fig. 1 is a flowchart illustrating the method of detecting a speech keyword according to the present invention. As shown in Fig. 1, the method of detecting a speech keyword according to the present invention can be divided into two portions, one is to train a neural network model, and the other is to utilize the well-trained neural network model to detect speech keywords.

[0047] Training the neural network model includes the following steps.

[0048] Step 1 - obtaining a sample training set, which includes a sample speech for training and a sample basic phoneme marking result of the speech. With respect to speeches of a target language, certain quantities of marked speeches are collected, forming at best a phonetic training set of at least 500 hours.

[0049] Step 2 - extracting a sample speech feature.

[0050] Step 3 ¨ training a neural network model. A GMM-HMM model required for speech recognition is trained by means of sample speeches having basic phoneme marking results, and the speeches are forcefully aligned with this model to obtain information as to which basic phoneme or which basic phonemes of the target language each frame of the feature-extracted speech belongs (if each frame belongs to plural basic phonemes, a sum of probabilities of the plural basic phonemes is 1). In actual operation, phoneme Date Recue/Date Received 2022-05-24 information to which one sentence corresponds can be obtained through mapping of resources of an existent dictionary, but it is impossible to determine the phoneme information of a certain frame, so it is required to train a GMM-HMM model by which it is possible to further obtain the phoneme information of each frame.

[0051] Output nodes of the neural network indicate the basic phonemes of the target language, so the number of output nodes of the neural network can simply equal a sum of the numbers of basic phonemes of the target language. Taking the Chinese language for example, it can be a sum of the numbers of the entire initial consonants plus the vowels;
taking the English language for example, it can be a sum of the number of the international phonetic alphabets. In addition, as can be extended, with respect to a language with intonations, such as the Chinese language, the vowels can be intoned, and there are altogether 5 intonations (the four basic intonations plus the soft intonation), then the total number of nodes is the number of the initial consonants plus the number of vowels multiplied by 5. Moreover, it is further possible to add some additional nodes to take into consideration those parts in the speech that do not pertain to any phoneme, such as noises, abnormal sounds, and the coughing sounds, etc.

[0052] The neural network model according to the present invention is not directed to complete keywords or single characters in the keywords, but directed to basic phoneme units of the language to which the keywords pertain. Taking the Chinese language for example, output nodes of the neural network according to the present invention are to model the entire initial consonants and vowels of the Chinese Pinyin, and to combine and join out the desired keywords according to sequences of the initial consonants and vowels.

[0053] By way of example, the keyword is "xiaohuoxiaohuo (the Chinese phonetic transliteration of 'young man young man')", then a combination of sequences of its corresponding initial consonants plus vowels is "xiao3 huo3 xiao3 huo3". Basic phoneme units of the common language do not exceed 100 in number, even in the case of such an Date Recue/Date Received 2022-05-24 intonated language as the Chinese language, the modeling units generally also do not exceed 500 in number including the intonations, and this makes the neural network model not to be so large, and to be easily deployable in such an imbeddable equipment as a mobile phone, a camera, etc. The aforementioned network according to the present application can be embodied as a simple full connection feedforward neural network, and can also be embodied as such a relatively complex network as a time delay neural network, a convolution neural network, a recurrent neural network, etc., all of which fall within the protection scope of the present invention.

[0054] Detection of the speech keyword by means of the well-trained neural network model includes the following steps.

[0055] Step 4 ¨ receiving speech information to be detected as input by a user, and extracting a speech feature of the speech.

[0056] Step 5¨ inputting the speech feature by frames into the neural network model well trained in the foregoing step, and outputting corresponding phonemes. With respect to each frame, the neural network obtains a vector of the size as the number of network output nodes.
Suppose the speech has N frames altogether, and there are M network output nodes, then a phoneme distribution matrix sized as NxM will be obtained.

[0057] Corresponding to different target languages, the numbers of N and M are different.

[0058] Step 6¨ calculating the score of each candidate keyword, namely calculating the possible score of the aforementioned NxM matrix being each candidate keyword. Each candidate keyword is mapped to a phenome sequence via its pronunciation dictionary, since each phoneme can correspond to one node output from the network, the score of the phoneme sequence of the candidate keyword in the NxM matrix can be calculated. Such scoring mode includes, but is not limited to, dynamic programming, limited constrained maximal Date Recue/Date Received 2022-05-24 sequence scoring, or optimal path scoring after violent exhaustion in the NxM
matrix space. To facilitate discussion, all score calculating methods possibly used in this process are collectively referred to as "score calculating policy".

[0059] The present invention makes it possible to train plural neural networks for score calculation, with respect to one candidate keyword, plural scores are obtained by employing different "score calculating policies" in different score calculating neural networks, and these scores can be merged by means of different methods, such as weighted averaging, etc., so as to obtain better score expression.

[0060] As should be noted, since the candidate keyword can be certainly mapped to a phoneme sequence, so a score can be certainly calculated for any random candidate keyword in step 6, and it is also not required to train the neural network again. In addition, because the phoneme sequence of the candidate keyword is only considered here, candidate keywords identical in pronunciation but different in characters are equivalently regarded.

[0061] Step 7 ¨ judging whether any candidate keyword is activated. The candidate keyword with the highest score is selected from a collection of candidate keywords, if the score exceeds a predefined threshold for the candidate keywords, this candidate keyword is activated; otherwise, the candidate keyword with the second highest score is considered as to whether its score exceeds the predefined threshold for the candidate keywords, so on and so forth according to such sequence. Once a candidate keyword is activated, control information concerning activation of the candidate keyword is returned, thus completing recognition of a sentence. If the scores of all candidate keywords are lower than the threshold, information is returned to the effect that no candidate keyword is activated. The whole process ends here. Taking for example an app for financial payment on the mobile phone, after the app has been enabled, the user speaks out "show the collection code" and "show the payment code", whereupon the system judges that specific keywords are received according to the speeches of the user, and thereafter Date Recue/Date Received 2022-05-24 automatically shows the corresponding 2D code for use by the user.

[0062] Since this embodiment is based on a scenario of the Chinese language, a certain amount of Chinese language corpus is collected in advance, and it is very easy to find online a well-marked Chinese language corpus over 500 hours. An open-source tool is employed to train a Chinese language GMM-HMM model, and the well-trained model is used to further forcefully align the Chinese language corpus to obtain Chinese language phonemes, i.e., markings of the Chinese Pinyin level, namely phoneme information of each frame.

[0063] Subsequently, the markings and corpus of the phoneme level are used to train one or more neural network(s), which can be a full connection feedforward neural network and a time delay neural network, and the network output nodes are precisely the total number of phonemes. Thus, training of the neural network(s) is considered to be complete. Neural network resources are stored offline, packaged together with the mobile phone app and deployed on the mobile phone, and loaded into the memory of the mobile phone when the app is enabled. In the app are simultaneously stored the extracting policy of speech features, and the collection of such candidate keywords as "show collection code" and "show payment code", etc.

[0064] When the user completes a sentence, such as "please show my collection code", the microphone of the mobile phone collects the sampling point of this sentence, performs feature extraction, disposes the same in the neural network inside the memory, obtains a phoneme distribution matrix output, and then calculates scores of the phoneme distribution matrix of this sentence with respect to different candidate keywords. With respect to outputs of plural neural networks, a more precise score is obtained through merging by a certain policy, such as weighted averaging. For instance, as obtained through calculation, "please show my collection code" spoken out by the user has a score of 90 with respect to the candidate keyword "show collection code", has a score of 40 Date Recue/Date Received 2022-05-24 with respect to the candidate keyword "show payment code", and the thresholds of the candidate keywords are all 80, then it is found by inspecting whether each keyword score exceeds the threshold predefined for the keyword according to a descending order of the keyword scores that the score of the candidate keyword "show collection code"
exceeds the threshold, then this keyword is activated, and it suffices to use this activated keyword to perform subsequent operations.

[0065] Specifically speaking, all keywords supported by the application are written in one file, and read by system memory as required. When it is needed to modify or increase keywords, it is not required to collect speeches again or train the model again, as it is merely required to modify the file. Modified keywords or newly added keyword speeches are required in common keyword policies to retrain the model, whereas such operation is dispensed with in the present invention, thereby greatly saving both cost and time.

[0066] Embodiment 1

[0067] Corresponding to the description above, Embodiment 1 of the present application discloses a method of detecting a speech keyword based on a neural network, as shown in Fig. 2, the method comprises the following steps.

[0068] S21 - receiving a speech to be detected, and extracting a speech feature of the speech.

[0069] S22 - inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds.

[0070] This step specifically includes:

[0071] inputting the speech feature by frames into a previously well-trained GMM-HMM model of a target language to forcefully align the speech feature, obtaining at least one basic Date Recue/Date Received 2022-05-24 phoneme to which each frame of the speech feature corresponds, and outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.

[0072] S23 - mapping each preset candidate keyword to corresponding basic phonemes.
Mapping can be effected via a pronunciation dictionary in this step.

[0073] S24 - calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword.

[0074] This step specifically includes:

[0075] obtaining a plurality of scores calculated by a plurality of score calculating policies according to basic phonemes of the speech feature and basic phonemes of the candidate keyword, and merging the plurality of scores to obtain a final score.

[0076] Preferably, the score calculating policies include at least two of dynamic programming, limited constrained maximal sequence scoring, and optimal path scoring after violent exhaustion in N x M matrix space.

[0077] S25 - judging whether any keyword is activated according to the scores.

[0078] Specifically, relations between candidate keyword scores and a scoring threshold predefined for the candidate keywords can be sequentially judged according to a descending order of the scores, until it is judged there is a score of the candidate keyword greater than the scoring threshold predefined for the candidate keywords, and judging is stopped after the candidate keyword is activated.

[0079]
The neural network model can be obtained through the following steps:

Date Recue/Date Received 2022-05-24

[0080] obtaining a sample dataset for training, wherein the sample dataset includes a sample speech and a sample basic phoneme marking result corresponding to the sample speech;

[0081] extracting a sample speech feature of the sample speech; and

[0082] taking the sample speech feature as input, and taking the sample basic phoneme marking result to which the sample speech corresponds as output to train the neural network model.

[0083] Embodiment 2

[0084] Corresponding to the aforementioned method, Embodiment 2 of the present application further discloses a device for detecting a speech keyword based on a neural network, as shown in Fig. 3, the device comprises the following.

[0085] A speech feature extracting unit 31 is employed for receiving a speech to be detected, and extracting a speech feature of the speech.

[0086] A basic phoneme predicting unit 32 is employed for inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds.

[0087] Specifically, the basic phoneme predicting unit 32 is employed for:

[0088] inputting the speech feature by frames into a previously well-trained GMM-HMM model of a target language to forcefully align the speech feature, obtaining at least one basic phoneme to which each frame of the speech feature corresponds, and outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.

[0089] A candidate word mapping unit 33 is employed for mapping each candidate keyword to corresponding basic phonemes. Mapping can be specifically effected via a pronunciation Date Recue/Date Received 2022-05-24 dictionary.

[0090] A score calculating unit 34 is employed for calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword.

[0091] Specifically, the score calculating unit 34 is employed for obtaining a plurality of scores calculated by a plurality of score calculating policies according to basic phonemes of the speech feature and basic phonemes of the candidate keyword, and merging the plurality of scores to obtain a final score.

[0092] Wherein, the score calculating policies include at least two of dynamic programming, limited constrained maximal sequence scoring, and optimal path scoring after violent exhaustion in N x M matrix space.

[0093] A judging unit 35 is employed for judging whether any keyword is activated according to the scores.

[0094] Specifically, the judging unit 35 is employed for sequentially judging, according to a descending order of the scores, relations between candidate keyword scores and a scoring threshold predefined for the candidate keywords, until it is judged there is a score of the candidate keyword greater than the scoring threshold predefined for the candidate keywords, and stopping judging after the candidate keyword is activated.

[0095] Embodiment 3

[0096] Corresponding to the aforementioned method, Embodiment 3 of the present invention discloses a computer system that comprises:

[0097] one or more processor(s); and Date Recue/Date Received 2022-05-24

[0098] a memory, associated with the one or more processor(s) and storing a program instruction that executes a terminal when it is read and executed by the one or more processor(s), wherein the terminal includes a memory and a processor, of which the processor reads a computer program instruction stored in the memory, so that the processor is enabled to execute the method as recited above.

[0099] Embodiment 4 of the present application provides a computer system that comprises:

[0100] one or more processor(s); and

[0101] a memory, associated with the one or more processor(s) and storing a program instruction that executes the following operations when it is read and executed by the one or more processor(s):

[0102] receiving a speech to be detected, and extracting a speech feature of the speech;

[0103] inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds;

[0104] mapping each preset candidate keyword to corresponding basic phonemes;

[0105] calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword;
and

[0106] judging whether any keyword is activated according to the scores.

[0107] Preferably, the step of outputting basic phonemes to which each frame of the speech feature corresponds includes:

[0108] outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.

[0109] Preferably, the neural network model is obtained through the following steps:

[0110] obtaining a sample dataset for training, wherein the sample dataset includes a sample speech and a sample basic phoneme marking result corresponding to the sample speech;

Date Recue/Date Received 2022-05-24

[0111] extracting a sample speech feature of the sample speech; and

[0112] taking the sample speech feature as input, and taking the sample basic phoneme marking result to which the sample speech corresponds as output to train the neural network model.

[0113] Preferably, the step of inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds includes:

[0114] inputting the speech feature by frames into a previously well-trained GMM-HMM model of a target language to forcefully align the speech feature, and obtaining at least one basic phoneme to which each frame of the speech feature corresponds.

[0115] Preferably, the step of calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword includes:

[0116] obtaining a plurality of scores calculated by a plurality of score calculating policies according to basic phonemes of the speech feature and basic phonemes of the candidate keyword, and merging the plurality of scores to obtain a final score.

[0117] Preferably, the score calculating policies include at least two of dynamic programming, limited constrained maximal sequence scoring, and optimal path scoring after violent exhaustion in N x M matrix space.

[0118] Preferably, the step of judging whether any keyword is activated according to the scores includes:

[0119] sequentially judging, according to a descending order of the scores, relations between candidate keyword scores and a scoring threshold predefined for the candidate keywords, until it is judged there is a score of the candidate keyword greater than the scoring threshold predefined for the candidate keywords, and stopping judging after the candidate keyword is activated.

Date Recue/Date Received 2022-05-24

[0120] Fig. 4 exemplarily illustrates the framework of a computer system that can specifically include a processor 1510, a video display adapter 1511, a magnetic disk driver 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, the video display adapter 1511, the magnetic disk driver 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 can be communicably connected with one another via a communication bus 1530.

[0121] The processor 1510 can be embodied as a general CPU (Central Processing Unit), a microprocessor, an ASIC (Application Specific Integrated Circuit), or one or more integrated circuit(s) for executing relevant program(s) to realize the technical solutions provided by the present application.

[0122] The memory 1520 can be embodied in such a form as an ROM (Read Only Memory), an RAM (Random Access Memory), a static storage device, or a dynamic storage device.
The memory 1520 can store an operating system 1521 for controlling the running of a computer system 1500, and a basic input/output system (BIOS) for controlling lower-level operations of the computer system 1500. In addition, the memory 1520 can also store a web browser 1523, a data storage administration system 1524, and an icon font processing system 1525, etc. The icon font processing system 1525 can be an application program that specifically realizes the aforementioned various step operations in the embodiments of the present application. To sum it up, when the technical solutions provided by the present application are to be realized via software or firmware, the relevant program codes are stored in the memory 1520, and invoked and executed by the processor 1510.

[0123] The input/output interface 1513 is employed to connect with an input/output module to realize input and output of information. The input/output module can be equipped in the device as a component part (not shown in the drawings), and can also be externally Date Recue/Date Received 2022-05-24 connected with the device to provide corresponding functions. The input means can include a keyboard, a mouse, a touch screen, a microphone, and various sensors etc., and the output means can include a display screen, a loudspeaker, a vibrator, an indicator light etc.

[0124] The network interface 1514 is employed to connect to a communication module (not shown in the drawings) to realize intercommunication between the current device and other devices. The communication module can realize communication in a wired mode (via USB, network cable, for example) or in a wireless mode (via mobile network, WIFI, Bluetooth, etc.).

[0125] The bus 1530 includes a passageway transmitting information between various component parts of the device (such as the processor 1510, the video display adapter 1511, the magnetic disk driver 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).

[0126] Additionally, the computer system 1500 may further obtain information of specific collection conditions from a virtual resource object collection condition information database 1541 for judgment on conditions, and so on.

[0127] As should be noted, although merely the processor 1510, the video display adapter 1511, the magnetic disk driver 1512, the input/output interface 1513, the network interface 1514, the memory 1520, and the bus 1530 are illustrated for the aforementioned device, the device may further include other component parts prerequisite for realizing normal running during specific implementation. In addition, as can be understood by persons skilled in the art, the aforementioned device may as well only include component parts necessary for realizing the solutions of the present application, without including the entire component parts as illustrated.

Date Recue/Date Received 2022-05-24

[0128] As can be known through the description to the aforementioned embodiments, it is clearly learnt by person skilled in the art that the present application can be realized through software plus a general hardware platform. Based on such understanding, the technical solutions of the present application, or the contributions made thereby over the state of the art, can be essentially embodied in the form of a software product, and such a computer software product can be stored in a storage medium, such as an ROM/RAM, a magnetic disk, an optical disk etc., and includes plural instructions enabling a computer equipment (such as a personal computer, a server, or a network device etc.) to execute the methods described in various embodiments or some sections of the embodiments of the present application.

[0129] The various embodiments are progressively described in the Description, identical or similar sections among the various embodiments can be inferred from one another, and each embodiment stresses what is different from other embodiments.
Particularly, with respect to the system or system embodiment, since it is essentially similar to the method embodiment, its description is relatively simple, and the relevant sections thereof can be inferred from the corresponding sections of the method embodiment. The system or system embodiment as described above is merely exemplary in nature, units therein described as separate parts can be or may not be physically separate, parts displayed as units can be or may not be physical units, that is to say, they can be located in a single site, or distributed over a plurality of network units. It is possible to base on practical requirements to select partial modules or the entire modules to realize the objectives of the embodied solutions. It is understandable and implementable by persons ordinarily skilled in the art without spending creative effort in the process.

[0130] The method of detecting a speech keyword based on a neural network, and corresponding device and system provided by the present application are described in detail above, specific examples are used in this paper to enunciate the principles and modes of execution of the present application, and descriptions of the aforementioned Date Recue/Date Received 2022-05-24 embodiments are merely meant to help understand the method and kernel conception of the present application; at the same time, to persons ordinarily skilled in the art, there may be variations in both the specific modes of execution and the range of application based on the conception of the present application. To sum it up, the contents of the current Description shall not be understood to restrict the present application. In summary, the present invention employs a very simple mode to achieve the same functions, and the present invention changes the requirement in the traditional technology of plural neural network models for plural keywords to the requirement of only one neural network model for plural keywords, whereby it is made possible to make the size of the neural network to be extremely small, as it suffices for a model of 10M to achieve very excellent performance, so as to be adapted for deployment in an imbeddable equipment, and to complete functions with very low resources occupied. In addition, the keywords are randomly configurable, and there is no need to collect data and retrain the model with respect to specific keywords; at the same time, it is not required to retrain the model when keywords are modified, thus dispensing with the troublesome step of collecting specific keyword corpus, and saving the time required for training the model again.

[0131] What the above describes is merely directed to preferred embodiments of the present invention, and the patent scope of the present invention is not restricted thereby. Any equivalent structural change makeable by employing the contents of the Description and drawings of the present invention under the conception of the present invention, or any direct/indirect application to other related technical fields shall all be covered within the patent protection scope of the present invention.

Date Recue/Date Received 2022-05-24

Claims

What is claimed is:

1. A method of detecting a speech keyword based on a neural network, characterized in comprising the following steps:
receiving a speech to be detected, and extracting a speech feature of the speech;
inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds;
mapping each preset candidate keyword to corresponding basic phonemes;
calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword;
and judging whether any keyword is activated according to the scores.

2. The method according to Claim 1, characterized in that the step of outputting basic phonemes to which each frame of the speech feature corresponds includes:
outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M equals the number of basic phonemes of the target language.

3. The method according to Claim 1, characterized in that the neural network model is obtained through the following steps:
obtaining a sample dataset for training, wherein the sample dataset includes a sample speech and a sample basic phoneme marking result corresponding to the sample speech;
extracting a sample speech feature of the sample speech; and taking the sample speech feature as input, and taking the sample basic phoneme marking result to which the sample speech corresponds as output to train the neural network model.

4. The method according to Claim 1, characterized in that the step of inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds includes:
inputting the speech feature by frames into a previously well-trained GMM-HMM
model of a target language to forcefully align the speech feature, and obtaining at least one basic phoneme to which each frame of the speech feature corresponds.

5. The method according to Claim 1, characterized in that the step of calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword includes:
obtaining a plurality of scores calculated by a plurality of score calculating policies according to basic phonemes of the speech feature and basic phonemes of the candidate keyword, and merging the plurality of scores to obtain a final score.

6. The method according to Claim 5, characterized in that the score calculating policies include at least two of dynamic programming, limited constrained maximal sequence scoring, and optimal path scoring after violent exhaustion in NxM matrix space.

7. The method according to anyone of Claims 1 to 6, characterized in that the step of judging whether any keyword is activated according to the scores includes:
sequentially judging, according to a descending order of the scores, relations between candidate keyword scores and a scoring threshold predefined for the candidate keywords, until it is judged there is a score of the candidate keyword greater than the scoring threshold predefined for the candidate keywords, and stopping judging after the candidate keyword is activated.

8. A device for detecting a speech keyword based on a neural network, characterized in that the device comprises:
a speech feature extracting unit, for receiving a speech to be detected, and extracting a speech feature of the speech;

a basic phoneme predicting unit, for inputting the speech feature by frames into a previously well-trained neural network model of a target language, and outputting basic phonemes to which each frame of the speech feature corresponds;
a candidate word mapping unit, for mapping each candidate keyword to corresponding basic phonemes;
a score calculating unit, for calculating a score of the each candidate keyword graded by the speech according to basic phonemes of the speech feature and basic phonemes of the candidate keyword; and a judging unit, for judging whether any keyword is activated according to the scores.

9. The device according to Claim 8, characterized in that the basic phoneme predicting unit is employed for outputting basic phonemes to which each frame of the speech feature corresponds in a matrix of NxM, where N equals the number of frames of the speech, and M
equals the number of basic phonemes of the target language.

10. A computer system, characterized in comprising:
one or more processor(s); and a memory, associated with the one or more processor(s) and storing a program instruction that executes the method as recited in Claims 1 to 7 when it is read and executed by the one or more processor(s).