CN112151015B

CN112151015B - Keyword detection method, keyword detection device, electronic equipment and storage medium

Info

Publication number: CN112151015B
Application number: CN202010915963.XA
Authority: CN
Inventors: 吕志强; 黄申
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-03
Filing date: 2020-09-03
Publication date: 2024-03-12
Anticipated expiration: 2040-09-03
Also published as: CN112151015A

Abstract

The application discloses a keyword detection method, a keyword detection device, electronic equipment and a storage medium, wherein the keyword detection method comprises the following steps: extracting characteristics of voice to be recognized to obtain voice characteristics; detecting wake-up words in the voice to be recognized according to a preset acoustic model and the voice characteristics, and obtaining a wake-up word detection result; when the wake-up word detection result indicates that the voice to be recognized contains a wake-up word, dividing the voice characteristic based on the voice state of the voice to be recognized; detecting keywords in the voice to be recognized based on the segmented voice features, the preset acoustic model and preset keywords to obtain keyword detection results; and fusing the wake-up word detection result and the keyword detection result to obtain the keyword recognition result of the voice to be recognized.

Description

Keyword detection method, keyword detection device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a keyword detection method, a keyword detection device, an electronic device, and a storage medium.

Background

Artificial intelligence (AI, artificial Intelligence) is a comprehensive technology of computer science, and by researching the design principle and implementation method of various intelligent machines, the machines have the functions of sensing, reasoning and deciding. The artificial intelligence technology is a comprehensive subject, and relates to a wide range of fields, such as natural language processing technology, machine learning/deep learning and other directions, and is believed to be applied to more fields along with the development of the technology, and has an increasingly important value.

However, the current keyword detection often cannot meet the requirements of users, and problems such as delay and misrecognition often occur in the keyword detection process.

Disclosure of Invention

The application provides a keyword detection method, a keyword detection device, electronic equipment and a storage medium, which can improve the keyword detection efficiency and ensure the keyword detection accuracy.

The application provides a keyword detection method, which comprises the following steps:

Extracting characteristics of voice to be recognized to obtain voice characteristics;

detecting wake-up words in the voice to be recognized according to a preset acoustic model and the voice characteristics, and obtaining a wake-up word detection result;

when the wake-up word detection result indicates that the voice to be recognized contains a wake-up word, dividing the voice characteristic based on the voice state of the voice to be recognized;

detecting keywords in the voice to be recognized based on the segmented voice features, the preset acoustic model and preset keywords to obtain keyword detection results;

and fusing the wake-up word detection result and the keyword detection result to obtain the keyword recognition result of the voice to be recognized.

Correspondingly, the application also provides a keyword detection device, which comprises:

the extraction module is used for extracting the characteristics of the voice to be recognized to obtain the voice characteristics;

the first detection module is used for detecting wake-up words in the voice to be recognized according to a preset acoustic model and the voice characteristics to obtain a wake-up word detection result;

the segmentation module is used for segmenting the voice characteristics based on the voice state of the voice to be recognized when the wake-up word detection result indicates that the voice to be recognized contains the wake-up word;

The second detection module is used for detecting the keywords in the voice to be recognized based on the segmented voice characteristics, the preset acoustic model and the preset keywords to obtain a keyword detection result;

and the fusion module is used for fusing the wake-up word detection result and the keyword detection result to obtain the keyword recognition result of the voice to be recognized.

Optionally, in some embodiments of the present application, the second detection module includes:

the first acquisition unit is used for acquiring phoneme information of a word to be recognized in the voice to be recognized according to the preset acoustic model;

and the detection unit is used for detecting the keywords in the voice to be recognized based on the phoneme information, the segmented voice characteristics and the preset keywords to obtain a keyword detection result.

Optionally, in some embodiments of the present application, the detection unit includes:

a translation subunit, configured to perform text translation on the speech to be recognized according to the segmented speech features, so as to obtain a speech text of the speech to be recognized;

a determining subunit, configured to determine a word to be identified corresponding to the phoneme information from the voice text;

And the selecting subunit is used for selecting the word to be identified, which is matched with the preset keyword, from the determined words to be identified, so as to obtain a target keyword set.

Optionally, in some embodiments of the present application, the selecting subunit is specifically configured to:

calculating the similarity of the determined keywords between the words to be identified and preset keywords;

and determining the word to be identified with the keyword similarity larger than a preset value as a target keyword to obtain a target keyword set.

Optionally, in some embodiments of the present application, the translation subunit is specifically configured to:

generating a phoneme sequence corresponding to each segmented voice feature according to the phoneme information;

and identifying the phoneme sequence by using a preset language model to obtain a voice text of the voice to be identified.

Optionally, in some embodiments of the present application, the first detection module includes:

the second acquisition unit is used for acquiring a preset acoustic model and a preset phoneme library;

the recognition unit is used for recognizing the voice characteristics by adopting the preset acoustic model to obtain phoneme information corresponding to each word to be recognized in the voice to be recognized;

the selecting unit is used for selecting phonemes matched with the phoneme information from a preset phoneme library to obtain target phonemes;

And the generating unit is used for generating a wake-up word set based on the obtained target phonemes.

Optionally, in some embodiments of the present application, the selecting unit is specifically configured to:

extracting phonemes to be recognized corresponding to each word to be recognized in the voice to be recognized from the phoneme information;

calculating the similarity between the extracted phonemes to be recognized and each candidate phoneme of a preset phoneme library to obtain the phoneme similarity;

and obtaining the target phonemes by using the candidate phonemes with the phoneme similarity larger than the preset phoneme similarity.

Optionally, in some embodiments of the present application, the segmentation module is specifically configured to:

collecting voice frame information of each frame in the voice to be recognized, wherein the voice frame information comprises collected voice frames and time stamps corresponding to the voice frames;

detecting the voice state of each frame in the voice to be recognized;

determining a voice frame with the voice state being an activated state as a target voice frame;

and dividing the voice features according to the determined target voice frame and the corresponding time stamp thereof.

According to the method and the device, after the voice to be recognized is subjected to feature extraction to obtain the voice features, the wake-up words in the voice to be recognized are detected according to the preset acoustic model and the voice features to obtain the wake-up word detection result, when the wake-up word detection result indicates that the voice to be recognized contains the wake-up words, the voice features are segmented based on the voice state of the voice to be recognized, then the keywords in the voice to be recognized are detected based on the segmented voice features, the preset acoustic model and the preset keywords to obtain the keyword detection result, and finally the wake-up word detection result and the keyword detection result are fused to obtain the keyword recognition result of the voice to be recognized, so that the method and the device can improve the keyword detection efficiency and guarantee the keyword detection accuracy.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1a is a schematic view of a keyword detection method provided in the present application;

FIG. 1b is a schematic flow chart of a keyword detection method provided in the present application;

FIG. 2a is a schematic flow chart of a keyword detection method provided in the present application;

FIG. 2b is a flow chart of the keyword detection system provided herein;

FIG. 2c is a neural network training method in the keyword detection method provided herein;

fig. 3 is a schematic structural diagram of a keyword detection apparatus provided in the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The application provides a keyword detection method, a keyword detection device, electronic equipment and a storage medium.

The keyword detection device can be integrated in a server, wherein the server can be an independent physical server, can be a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

For example, referring to fig. 1a, the present application provides a keyword detection system, the keyword detection system includes a user 10, a terminal 20 and a server 30, after receiving a voice to be recognized input by the user 10 through a microphone of the terminal 20, the terminal 20 transmits the voice to be recognized to the server 30, the server 30 performs feature extraction on the voice to be recognized to obtain a voice feature, then the server 30 detects a wake-up word in the voice to be recognized according to a preset acoustic model and the voice feature to obtain a wake-up word detection result, when the wake-up word detection result indicates that the voice to be recognized contains the wake-up word, the server 30 segments the voice feature based on a voice state of the voice to be recognized, then the server 30 detects the keyword in the voice to be recognized based on the segmented voice feature, the preset acoustic model and the preset keyword to obtain a keyword detection result, and finally, the server 30 fuses the wake-up word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

According to the keyword detection method, when the keywords to be identified are detected, the phoneme information of the wake-up word is utilized, so that the efficiency of identifying the keywords in the voices to be identified can be improved, and meanwhile, two stages of the keywords (keywords and the wake-up word) of different types in the voices to be identified are respectively identified, namely, the keyword detection efficiency is improved, and meanwhile, the accuracy of keyword detection is guaranteed.

The following will describe in detail. It should be noted that the following description order of embodiments is not a limitation of the priority order of embodiments.

A keyword detection method comprising: extracting features of the voice to be recognized to obtain voice features, detecting wake-up words in the voice to be recognized according to the voice features to obtain wake-up word detection results, dividing the voice features based on the voice state of the voice to be recognized when the wake-up word detection results indicate that the voice to be recognized contains the wake-up words, detecting keywords in the voice to be recognized based on the divided voice features, a preset acoustic model and preset keywords to obtain keyword detection results, and fusing the wake-up word detection results and the keyword detection results to obtain keyword recognition results of the voice to be recognized.

Referring to fig. 1b, fig. 1b is a flow chart of a keyword detection method provided in the present application. The specific flow of the keyword detection method can be as follows:

101. and extracting the characteristics of the voice to be recognized to obtain the voice characteristics.

For example, specifically, the voice to be recognized may be collected through a microphone of a terminal device (such as a mobile phone or a notebook computer), or the voice to be recognized may be downloaded from a network database by using a wired or wireless communication manner, or the voice to be recognized may be obtained by accessing a local database, and specifically, selection may be performed according to an actual situation, and after the voice to be recognized is obtained, a deep learning network may be used to extract the feature of the voice to be recognized.

102. And detecting wake-up words in the voice to be recognized according to the preset acoustic model and the voice characteristics, and obtaining a wake-up word detection result.

It should be noted that, in this application, the wake-up word may be a single word or a word, where the wake-up word detection result includes the detected wake-up word and phoneme information of the wake-up word, and the minimum basic unit in the speech is a phoneme, and the phoneme is a basis on which a human can distinguish one word from another word. Phonemes constitute syllables, which in turn constitute different words and phrases. The phonemes are divided into vowels and consonants; vowels, also known as vowels, are one type of phoneme, as opposed to consonants. A vowel is a sound that is emitted by an air flow through the mouth during sound production without obstruction. Different vowels are caused by different shapes of the mouth (vowels and formants are closely related); the sounds formed by the obstruction of the air flow in the mouth or the pharyngeal head are called consonants, also called sub-sounds, different consonants are caused by different pronunciation parts and pronunciation methods, the concept of unvoiced sound and voiced sound is more involved in the literature, strictly speaking, the extraction of a plurality of characteristics needs to distinguish the unvoiced sound from the voiced sound, when the air flow passes through the glottal, if the area of a certain part in the sound channel is small, turbulent flow is generated when the air flow passes through the part at high speed, and friction sound is generated when the ratio of the air flow speed to the cross-sectional area is larger than a certain critical speed, namely the unvoiced sound. In short, the vocal cords do not vibrate when the unvoiced sound is made, and thus the unvoiced sound is not periodic. Unvoiced sounds are generated by air friction, and are equivalent to noise in analytical research, and in speech, the generated sounds of vocal cord vibration in pronunciation are called voiced sounds. Consonants are clear and turbid, while vowels in most languages are voiced. Voiced sounds have periodicity.

For example, the wake-up word in the speech to be recognized may be detected by using a preset acoustic model and a preset phoneme library, specifically, the speech features are processed by using the preset acoustic model to obtain phoneme information corresponding to each word to be recognized in the speech to be recognized, and then, a phoneme matched with the phoneme information is selected from the preset phoneme library to obtain a wake-up word detection result, that is, optionally, in some embodiments, the step of detecting the wake-up word in the speech to be recognized according to the speech features to obtain the wake-up word detection result may specifically include:

(11) Acquiring a preset acoustic model and a preset phoneme library;

(12) Identifying the voice characteristics by adopting a preset acoustic model to obtain phoneme information corresponding to each word to be identified in the voice to be identified;

(13) Selecting a phoneme matched with the phoneme information from a preset phoneme library to obtain a target phoneme;

(14) And generating a wake-up word set based on the obtained target phonemes.

The acoustic model is pre-constructed, and can be obtained through joint training of a Long Short-Term Memory (LSTM) and a connection time sequence classification (Connectionist Temporal Classification, CTC), and conventional acoustic model training is performed, so that data of each frame can be effectively trained by knowing a corresponding tag, and voice alignment preprocessing is required before the data is trained. The process of aligning the voices needs repeated iteration for a plurality of times to ensure more accurate alignment, so that the whole training process is quite time-consuming, compared with the traditional acoustic model training, the acoustic model training adopting CTC as a loss function is the complete end-to-end acoustic model training, the data does not need to be aligned in advance, and only one input sequence and one output sequence are needed to train. Thus, data alignment and one-to-one labeling are not needed, the CTC directly outputs the probability of sequence prediction, external post-processing is not needed, and LSTM is a time-cycled neural network and aims at solving the problem of general. The neural network (Recurrent Neural Network, RNN) is specially designed due to the long-term dependence problem of the circulating neural network, and the LSTM and the CTC are used for combined training, so that the problem of gradient disappearance or gradient explosion cannot occur.

The step of selecting a phoneme matched with the phoneme information from the preset phoneme library to obtain a target phoneme may specifically include:

(21) Extracting phonemes to be recognized corresponding to each word to be recognized in the voice to be recognized from the phoneme information;

(22) Calculating the similarity between the extracted phonemes to be recognized and each phoneme in a preset phoneme library to obtain the phoneme similarity;

(23) And determining a phoneme with the phoneme similarity larger than the preset phoneme similarity as a target phoneme.

It should be noted that the preset phoneme similarity may be set according to actual requirements, for example, may be set to 60%, 80% or 90%.

103. When the wake-up word detection result indicates that the voice to be recognized contains the wake-up word, the voice features are segmented based on the voice state of the voice to be recognized.

In the application, in order to improve the efficiency of the subsequent keyword detection, the scheme of keyword detection and the scheme of ring word detection are cascaded, and when the wake-up word detection result indicates that the voice to be recognized contains the wake-up word, the keyword detection can be performed.

It should be noted that, in the process of keyword detection, there is often a situation that the actual voice is mismatched with the training due to interference of the background noise, which is also a root cause of poor robustness of the keyword detection system (the other main is that unexpected input cannot be processed), so that detection errors are caused and performance is reduced. Even though the two content-identical speech signals may differ in speech rate, the time of the speech signals may differ and the time gap between phonemes may differ, and the characteristics may differ for time-varying but not stationary speech signals. There is a gap between phonemes, and also a gap between silence and speech itself, if the environment is a quiet environment without too much background noise, at which time the main error of the keyword detection system is derived from the inaccuracy of the end-point detection technique, so in order to improve the accuracy of the subsequent keyword detection, in this application, speech features may be segmented using a speech activity detection technique (Voice Activity Detection, VAD), the main task of which is to accurately locate the start and end points of speech from noisy speech, because speech contains very long silence, i.e. to separate silence from actual speech, and because it is the original processing of speech data, VAD is one of the key techniques of the speech signal processing procedure, where speech states include an active state (state where speech is present) and a silent state (state where speech is not present).

For example, the voice state of each voice frame in the voice information may be determined, and then the voice feature is segmented according to the voice frame in the active state, that is, optionally, in some embodiments, the step of "segmenting the voice feature based on the voice state of the voice to be recognized" may specifically include:

(31) Collecting voice frame information of each frame in voice to be recognized;

(32) Detecting the voice state of each frame in the voice to be recognized;

(33) Determining a voice frame with the voice state being an activated state as a target voice frame;

(34) And dividing the voice characteristics according to the determined target voice frame and the corresponding time stamp thereof.

The voice frame information includes collected voice frames and time stamps corresponding to the voice frames, for example, 100 voice frames of voice to be recognized are collected, the voice frames are continuous in time, wherein 10 th to 20 th voice frames are in an activated state, 25 th to 28 th voice frames are in an activated state, 59 th to 79 th voice frames are in an activated state, the voice frames with the voice states in the activated state are determined to be target voice frames, and then the voice features are segmented based on the time stamps corresponding to the target voice frames to obtain voice features corresponding to 10 th to 20 th voice frames, voice features corresponding to 25 th to 28 th voice frames, voice features corresponding to 59 th to 79 th voice frames and other voice features corresponding to voice frames in a silence state.

In order to realize cascading between the wake-up word detection scheme and the keyword detection scheme, specifically, feature extraction and acoustic modeling parts and a wake-up module can be combined to realize multiplexing of two calculation tasks, it is required to be noted that the acoustic model for keyword detection usually adopts a distinguishing training objective function, the acoustic model for wake-up word detection usually uses a Cross Entropy (Cross Entropy) objective function, slight differences between the two training objectives are possible, in order to realize the cascading scheme of the application, multiplexing of model parts can be carried out by using the following three strategies, as shown in fig. 2c, (a) scheme and (b) scheme, only one training objective function training model is used to realize multiplexing between the two tasks for wake-up word detection and keyword detection, and (c) scheme adopts two different objective functions only in the last layer, so that the respective training advantages of the two tasks can be maintained, calculation of the model parts can be shared to the maximum extent, after the wake-up results are confirmed twice, in a complex game scene, namely, the accuracy of the game scene can be improved by only from a small amount of training objective F to a small amount of 1% under the condition that the live broadcast accuracy is improved, and the accuracy is not improved by 73%. While performance is improved, overall traffic real-time rate is estimated to be reduced by approximately 17%.

104. And detecting the keywords in the voice to be recognized based on the segmented voice characteristics, a preset acoustic model and preset keywords to obtain a keyword detection result.

In this application, the keyword detection scheme multiplexes the acoustic model portion of the wake-up word detection scheme, that is, the portion for dividing the speech feature and the portion for keyword detection are added on the basis of the wake-up word detection scheme, and the phoneme information generated when the wake-up word detection scheme is still adopted is specifically, in some embodiments, the phoneme information of the word to be recognized in the speech to be recognized may be obtained according to the acoustic model, and then the keyword in the speech to be recognized is detected based on the phoneme information, the divided speech feature and the preset keyword, so as to obtain the keyword detection result.

Further, in order to convert the human language into a language that can be recognized by the machine, text translation may be performed on the speech to be recognized based on the segmented speech features, and then keyword detection may be performed on the basis of the translation result, that is, optionally, in some embodiments, the step of detecting keywords in the speech to be recognized based on the phoneme information, the segmented speech features, and the preset keywords to obtain a keyword detection result may specifically include:

(41) Text translation is carried out on the voice to be recognized according to the segmented voice characteristics, so that a voice text of the voice to be recognized is obtained;

(42) Determining words to be recognized corresponding to the phoneme information from the voice text;

(43) And selecting the word to be identified, which is matched with the preset keyword, from the determined words to be identified, so as to obtain a target keyword set.

It should be noted that, if text translation is directly performed on the voice to be recognized according to the voice features, the obtained voice text necessarily includes text corresponding to noise, so in the application, text translation on the voice to be recognized based on the segmented voice features can improve accuracy of voice translation to be recognized, thereby improving accuracy of subsequent keyword detection.

The method comprises the steps of carrying out text translation on voice to be recognized by utilizing a language model, wherein the language model is a basic part of a plurality of systems and is used for trying to solve natural language processing tasks such as machine translation, language recognition and the like, and the language model can be an N-gram model and represents the distribution of the language under a discrete space in a counting statistical mode; the method can also be a neural network model, the neural network language model adopts a distributed mode to express words, namely commonly-called word vectors, and the words are mapped into continuous space, so that the problem of data sparseness is effectively solved. And the neural network has strong pattern recognition capability.

Specifically, a preset language model may be obtained, and the phoneme sequence is identified by using the preset language model to obtain a speech text of the speech to be identified, that is, optionally, in some embodiments, the step of "text translating the speech to be identified according to the segmented speech features to obtain the speech text of the speech to be identified" may specifically include:

(51) Generating a phoneme sequence corresponding to each segmented voice feature according to the phoneme information;

(52) And identifying the phoneme sequence by using a preset language model to obtain a voice text of the voice to be identified.

The language model is pre-constructed, and a large amount of text information can be trained to obtain the probability of mutual correlation of single characters or words, and in the process of identifying the second type of keywords in the voice to be identified by the language model, firstly, the phoneme sequence corresponding to the segmented voice can be identified by the preset language model, so as to obtain the voice text corresponding to the segmented voice features, and the text translation of the segmented voice features is realized

After the voice text is obtained, the keywords in the voice text can be identified according to preset keywords to obtain a keyword set, and particularly, the voice text can be segmented so as to select the words to be identified matched with the preset keywords from the voice text later to obtain a target keyword set, wherein the word segmentation technology belongs to the category of natural language understanding technology, is a primary link of semantic understanding, and is a technology capable of exactly separating words in sentences. It is the basis of the fields of text classification, information retrieval, machine translation, voice input and output of text, etc. The Chinese word segmentation technology becomes a difficulty in the word segmentation technology due to the complexity of Chinese per se and the writing habit thereof. The word segmentation method can comprise a dictionary-based method, a statistical-based method and a rule-based method, wherein the dictionary-based method is also called a mechanical word segmentation method, and is characterized in that a Chinese character string to be analyzed is matched with an entry in a machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; according to different scanning directions, the word segmentation method can be divided into forward matching and reverse matching; according to different lengths, the method can be divided into maximum matching and minimum matching; the word segmentation algorithm based on statistics at present has a plurality of kinds, and the common algorithm is as follows: a probability statistical algorithm based on mutual information, a Chinese word segmentation decision algorithm based on combination degree and the like.

Further, the keyword similarity between the determined to-be-identified words and the preset keywords can be calculated, then the to-be-identified words with the keyword similarity larger than the preset value are determined to be target keywords, and the target keyword set is obtained.

105. And fusing the wake-up word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

After the keyword detection scheme is cascaded to the wake-up word detection scheme, in the wake-up word detection scheme, the recall rate of the first-level detection (wake-up word detection) can be increased by reducing the wake-up word detection threshold, in the subsequent fusion of the wake-up word detection result and the keyword detection result, the wake-up word detection threshold is increased to generate a keyword recognition result of the voice to be recognized, for example, firstly, a phoneme matched with the phoneme information is selected from a preset phoneme library to obtain a target phoneme, then, based on the obtained target phoneme, a wake-up word set is generated, in the process, a matching value (wake-up word detection threshold) can be set to be 60%, when the matching value between the phoneme information of the preset phoneme library and the matching value is greater than 60%, the two are considered to be matched, further screening is performed in the subsequent fusion scheme, for example, the matching value is set to be 80%, namely, the wake-up word with the matching value greater than 80% is selected from the wake-up word set, and the selected wake-up word is added to the keyword set corresponding to the keyword detection result to the keyword set to obtain the keyword recognition result of the voice to be recognized.

According to the method, after the voice characteristics are obtained, the wake-up words in the voice to be recognized are detected according to the voice characteristics, the wake-up word detection result is obtained, when the wake-up word detection result indicates that the voice to be recognized contains the wake-up words, the voice characteristics are segmented based on the voice states of the voice to be recognized, then the acoustic model corresponding to the voice to be recognized when the wake-up words are detected is called, then the keywords in the voice to be recognized are detected based on the segmented voice characteristics, the preset acoustic model and the preset keywords, the keyword detection result is obtained, and finally the wake-up word detection result and the keyword detection result are fused, so that the keyword recognition result of the voice to be recognized is obtained.

The method according to the embodiment will be described in further detail by way of example.

In this embodiment, the keyword detection apparatus will be described by taking the specific integration of the keyword detection apparatus in a server as an example.

Referring to fig. 2a, a keyword detection method may specifically include the following steps:

201. and the server performs feature extraction on the voice to be recognized to obtain voice features.

For example, specifically, the server may receive the voice to be recognized collected by the microphone of the terminal device (such as a mobile phone or a notebook computer), or may download the voice to be recognized from the network database by using a wired or wireless communication manner, or may access the local database to obtain the voice to be recognized, specifically select according to the actual situation, and after obtaining the voice to be recognized, the server may use the deep learning network to extract the feature of the voice to be recognized.

202. And the server detects the wake-up words in the voice to be recognized according to the voice characteristics, and a wake-up word detection result is obtained.

The method comprises the steps that a wake-up word detection result comprises detected wake-up words and phoneme information of the wake-up words, a server processes voice characteristics by means of a preset acoustic model to obtain phoneme information corresponding to each word to be recognized in voice to be recognized, and then phonemes matched with the phoneme information are selected from a preset phoneme library to obtain the wake-up word detection result.

203. When the wake-up word detection result indicates that the voice to be recognized contains the wake-up word, the server segments voice features based on the voice state of the voice to be recognized.

In the application, in order to improve the efficiency of the subsequent keyword detection, the keyword detection scheme and the ring word detection scheme are cascaded, when the wake word detection result indicates that the voice to be recognized contains the wake word, the keyword detection can be performed, for example, the server can determine the voice state of each voice frame in the voice information, then the server segments the voice features according to the voice frame in the active state, further, the server can collect the voice frame information of each frame in the voice to be recognized, then the server can detect the voice state of each frame in the voice to be recognized, then the server can determine the voice frame with the voice state being the active state as the target voice frame, and finally the server can segment the voice features according to the determined target voice frame and the corresponding time stamp thereof.

204. The server detects the keywords in the voice to be recognized based on the segmented voice features, a preset acoustic model and preset keywords, and a keyword detection result is obtained.

For example, specifically, the server may obtain phoneme information of a word to be recognized in the voice to be recognized according to the acoustic model, and then detect a keyword in the voice to be recognized based on the phoneme information, the segmented voice feature and a preset keyword, so as to obtain a keyword detection result.

205. And the server fuses the wake-up word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

After the keyword detection scheme is cascaded to the wake-up word detection scheme, the recall rate of primary detection (wake-up word detection) can be increased by reducing the wake-up word detection threshold in the wake-up word detection scheme, and the wake-up word detection threshold is increased in the subsequent fusion of the wake-up word detection result and the keyword detection result so as to generate the keyword recognition result of the voice to be recognized.

In order to facilitate further understanding of the keyword detection scheme of the present application, please refer to fig. 2b, which is a flowchart of the keyword detection system, wherein the keyword detection system includes a wake-up word detection module, a keyword detection module and a word fusion module, the wake-up word detection module includes a feature extraction unit, an acoustic recognition unit and a decoding unit, the keyword detection module includes a segmentation unit and a decoding unit, in the wake-up word detection task, the feature extraction unit sends voice features to the acoustic recognition unit after feature extraction is performed on voice to be recognized, the acoustic recognition unit learns a mapping relationship between the voice features and phoneme information, and then the decoding unit in the wake-up word detection module is used for recognizing a pronunciation sequence of the wake-up word to obtain a wake-up word set; in addition, the segmentation unit segments the voice characteristics based on the voice state of the voice to be recognized, then a decoding unit in the keyword detection module recognizes the keywords in the voice to be recognized according to the segmented voice characteristics, the phoneme information and the preset keywords to obtain a keyword set, and finally, the word fusion module fuses the wake-up word set and the keyword set to obtain a keyword recognition result of the voice to be recognized.

After the server performs feature extraction on the voice to be recognized to obtain the voice feature, the server detects the wake-up word in the voice to be recognized according to the voice feature to obtain a wake-up word detection result, when the wake-up word detection result indicates that the voice to be recognized contains the wake-up word, the server segments the voice feature based on the voice state of the voice to be recognized, then invokes an acoustic model corresponding to the voice to be recognized when the wake-up word detection is performed on the voice to be recognized, then detects the keywords in the voice to be recognized based on the segmented voice feature, the acoustic model and preset keywords, to obtain a keyword detection result, and finally, fuses the wake-up word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

In order to facilitate better implementation of the keyword detection method of the present application, the present application further provides a keyword detection device (abbreviated as a recognition device) based on the above method. The meaning of the nouns is the same as that of the keyword detection method, and specific implementation details can be referred to in the description of the method embodiment.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a keyword detection apparatus provided in the present application, where the recognition apparatus may include an extraction module 301, a first detection module 302, a segmentation module 303, a second detection module 304, and a fusion module 305, and may specifically be as follows:

the extracting module 301 is configured to perform feature extraction on the voice to be recognized, so as to obtain a voice feature.

For example, the extraction module 301 may employ a deep learning network to extract features of the speech to be recognized.

The first detection module 302 is configured to detect wake-up words in the voice to be recognized according to a preset acoustic model and voice features, and obtain a wake-up word detection result.

The wake-up word detection result includes the detected wake-up word and phoneme information of the wake-up word, the first detection module 302 processes the voice feature by using a preset acoustic model to obtain phoneme information corresponding to each word to be recognized in the voice to be recognized, and then selects phonemes matched with the phoneme information from a preset phoneme library to obtain the wake-up word detection result.

Optionally, in some embodiments, the first detection module 302 may specifically include:

the recognition unit is used for recognizing the voice characteristics by adopting a preset acoustic model to obtain phoneme information corresponding to each word to be recognized in the voice to be recognized;

Alternatively, in some embodiments, the selection unit may specifically be configured to: extracting a phoneme to be recognized corresponding to each word to be recognized in the voice to be recognized from the phoneme information, calculating the similarity between the extracted phoneme to be recognized and each candidate phoneme of the preset phoneme base to obtain the phoneme similarity, and obtaining a target phoneme by using the candidate phonemes with the phoneme similarity larger than the preset phoneme similarity.

The segmentation module 303 is configured to segment the voice feature based on the voice state of the voice to be recognized when the wake word detection result indicates that the voice to be recognized includes the wake word.

In this application, in order to improve the efficiency of the subsequent keyword detection, the keyword detection scheme is cascaded with the ring word detection scheme, when the wake word detection result indicates that the voice to be recognized includes a wake word, the keyword detection may be performed, for example, the segmentation module 303 may determine a voice state where each voice frame in the voice information is located, then the segmentation module 303 segments the voice feature according to the voice frame in the active state, further, the segmentation module 303 may collect the voice frame information of each frame in the voice to be recognized, then the segmentation module 303 may detect the voice state where each frame in the voice to be recognized is located, then the segmentation module 303 may determine the voice frame in the active state as a target voice frame, finally, the segmentation module 303 may segment the voice feature according to the determined target voice frame and the corresponding timestamp thereof, that is, in some embodiments, the segmentation module 303 may be specifically used to: and collecting voice frame information of each frame in the voice to be recognized, detecting the voice state of each frame in the voice to be recognized, determining the voice frame with the voice state being the activation state as a target voice frame, and dividing the voice features according to the determined target voice frame and the corresponding time stamp thereof.

The voice frame information comprises collected voice frames and time stamps corresponding to the voice frames.

The second detection module 304 is configured to detect a keyword in the voice to be recognized based on the segmented voice feature, the preset acoustic model and the preset keyword, so as to obtain a keyword detection result.

For example, specifically, the second detection module 305 may obtain phoneme information of a word to be recognized in the voice to be recognized according to a preset acoustic model, and then the second detection module 305 detects a keyword in the voice to be recognized based on the phoneme information, the segmented voice feature and the preset keyword, so as to obtain a keyword detection result.

That is, optionally, in some embodiments, the second detection module 305 may specifically include:

the first acquisition unit is used for acquiring phoneme information of a word to be recognized in the voice to be recognized according to a preset acoustic model;

Optionally, in some embodiments, the detection unit may specifically include:

a translation subunit, configured to perform text translation on the speech to be recognized according to the segmented speech features, so as to obtain a speech text of the recognized speech;

Alternatively, in some embodiments, the selection subunit may be specifically configured to: and calculating the keyword similarity between the determined to-be-identified words and the preset keywords, and determining the to-be-identified words with the keyword similarity larger than the preset value as target keywords to obtain a target keyword set.

Alternatively, in some embodiments, the translation subunit may be specifically configured to: and generating a phoneme sequence corresponding to each segmented voice characteristic according to the phoneme information, and identifying the phoneme sequence by using a preset language model to obtain a voice text of the voice to be identified.

And the fusion module 305 is configured to fuse the wake-up word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

According to the extraction module 301, after the feature extraction is performed on the voice to be recognized, the voice feature is obtained, the first detection module 302 detects the wake-up word in the voice to be recognized according to the voice feature, the wake-up word detection result is obtained, when the wake-up word detection result indicates that the voice to be recognized contains the wake-up word, the segmentation module 303 segments the voice feature based on the voice state of the voice to be recognized, then the second detection module 304 detects the keywords in the voice to be recognized based on the segmented voice feature, the preset acoustic model and the preset keywords, the keyword detection result is obtained, and finally the fusion module 305 fuses the wake-up word detection result and the keyword detection result to obtain the keyword recognition result of the voice to be recognized, and when the keyword detection device provided by the application detects the keywords in the voice to be recognized, phoneme information of the wake-up word is utilized, so that the efficiency of recognizing the keywords in the voice to be recognized can be improved, and meanwhile, different types of the keywords (keywords and the wake-up word) in the voice to be recognized are respectively recognized by two stages, namely, the keyword detection efficiency is improved, and the keyword detection accuracy is guaranteed.

In addition, the present application further provides an electronic device, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the present application, specifically:

the electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

extracting features of the voice to be recognized to obtain voice features, detecting wake-up words in the voice to be recognized according to a preset acoustic model and the voice features to obtain wake-up word detection results, dividing the voice features based on the voice states of the voice to be recognized when the wake-up word detection results indicate that the voice to be recognized contains the wake-up words, detecting key words in the voice to be recognized based on the divided voice features, the preset acoustic model and the preset key words to obtain key word detection results, and fusing the wake-up word detection results and the key word detection results to obtain key word recognition results of the voice to be recognized.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

According to the method, after the voice to be recognized is subjected to feature extraction to obtain the voice features, the wake-up words in the voice to be recognized are detected according to the preset acoustic model and the voice features to obtain the wake-up word detection result, when the wake-up word detection result indicates that the wake-up words are contained in the voice to be recognized, the voice features are segmented based on the voice states of the voice to be recognized, then, the keywords in the voice to be recognized are detected based on the segmented voice features, the preset acoustic model and the preset keywords to obtain the keyword detection result, and finally, the wake-up word detection result and the keyword detection result are fused to obtain the keyword recognition result of the voice to be recognized.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, the present application provides a storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any one of the keyword detection methods provided herein. For example, the instructions may perform the steps of:

extracting features of the voice to be recognized to obtain voice features, detecting wake-up words in the voice to be recognized according to the voice features to obtain wake-up word detection results, dividing the voice features based on the voice state of the voice to be recognized when the wake-up word detection results indicate that the voice to be recognized contains the wake-up words, detecting keywords in the voice to be recognized based on the divided voice features, a preset acoustic model and preset keywords to obtain keyword detection results, and fusing the wake-up word detection results and the keyword detection results to obtain keyword recognition results of the voice to be recognized.

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The instructions stored in the storage medium may perform steps in any of the keyword detection methods provided in the present application, so that the beneficial effects that any of the keyword detection methods provided in the present application can be achieved, which are described in detail in the previous embodiments and are not repeated herein.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternative implementations described above.

The foregoing has described in detail a keyword detection method, apparatus, electronic device and storage medium provided in the present application, and specific examples have been applied to illustrate the principles and embodiments of the present invention, where the foregoing examples are only for aiding in understanding of the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims

1. A keyword recognition method, comprising:

the method for detecting the wake-up word in the voice to be recognized according to the preset acoustic model and the voice feature to obtain a wake-up word detection result comprises the following steps: acquiring a preset acoustic model and a preset phoneme library; identifying the voice characteristics by adopting the preset acoustic model to obtain phoneme information corresponding to each word to be identified in the voice to be identified; setting a matching value, and selecting a phoneme matched with the phoneme information from a preset phoneme library to obtain a target phoneme; generating a wake-up word set based on the obtained target phonemes;

detecting the keywords in the voice to be recognized based on the phoneme information, the segmented voice characteristics and the preset keywords of the word to be recognized in the voice to be recognized, which are obtained by the preset acoustic model, so as to obtain a keyword detection result; combining an acoustic modeling part of a feature extraction and keyword detection module with a wake-up detection module when the preset acoustic model is constructed so as to multiplex the keyword detection module with the wake-up detection module, and training by adopting two different objective functions at the last layer of mapping layer during training, wherein the keyword detection module corresponds to a distinguishing function, and the wake-up word detection module corresponds to a cross entropy function;

Setting another matching value, selecting a wake-up word from the wake-up word set, wherein the other matching value is larger than the matching value set when the target phonemes are obtained, adding the selected wake-up word into a keyword detection result, and the added keyword detection result is the keyword recognition result of the voice to be recognized.

2. The method of claim 1, wherein the detecting the keywords in the speech to be recognized based on the phoneme information, the segmented speech features and the preset keywords of the word to be recognized in the speech to be recognized obtained based on the preset acoustic model to obtain a keyword detection result includes:

text translation is carried out on the voice to be recognized according to the voice characteristics after segmentation, so that a voice text of the voice to be recognized is obtained;

determining a word to be recognized corresponding to the phoneme information from the voice text;

and selecting the word to be identified, which is matched with the preset keyword, from the determined words to be identified, so as to obtain a target keyword set.

3. The method according to claim 2, wherein selecting the word to be recognized that matches the preset keyword from the determined words to be recognized to obtain the target keyword set includes:

4. The method according to claim 2, wherein said text-translating the speech to be recognized according to the segmented speech features to obtain a speech text of the speech to be recognized comprises:

5. The method of claim 1, wherein selecting a phoneme matched with the phoneme information from a preset phoneme library to obtain a target phoneme comprises:

calculating the similarity between the extracted phonemes to be recognized and each phoneme in a preset phoneme library to obtain the phoneme similarity;

and determining a phoneme with the phoneme similarity larger than the preset phoneme similarity as a target phoneme.

6. The method according to any one of claims 1 to 5, wherein the segmenting the speech feature based on the speech state of the speech to be recognized comprises:

detecting the voice state of each frame in the voice to be recognized;

7. A keyword detection apparatus, comprising:

the first detection module is specifically configured to obtain a preset acoustic model and a preset phoneme library; identifying the voice characteristics by adopting the preset acoustic model to obtain phoneme information corresponding to each word to be identified in the voice to be identified; setting a matching value, and selecting a phoneme matched with the phoneme information from a preset phoneme library to obtain a target phoneme; generating a wake-up word set based on the obtained target phonemes;

the calling module is used for calling an acoustic model corresponding to the voice to be recognized when wake-up word detection is carried out on the voice to be recognized;

the second detection module is used for detecting the keywords in the voice to be recognized based on the phoneme information, the segmented voice characteristics and the preset keywords of the word to be recognized in the voice to be recognized, which are obtained by the preset acoustic module, and obtaining a keyword detection result;

combining an acoustic modeling part of a feature extraction and keyword detection module with a wake-up detection module when the preset acoustic model is constructed so as to multiplex the keyword detection module with the wake-up detection module, and training by adopting two different objective functions at the last layer of mapping layer during training, wherein the keyword detection module corresponds to a distinguishing function, and the wake-up word detection module corresponds to a cross entropy function;

the fusion module is used for setting another matching value, selecting a wake-up word from the wake-up word set, wherein the other matching value is larger than the matching value set when the target phonemes are obtained, adding the selected wake-up word into a keyword detection result, and the added keyword detection result is the keyword recognition result of the voice to be recognized.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the keyword detection method of any one of claims 1-6 when the program is executed by the processor.

9. A computer readable storage medium, having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the keyword detection method of any one of claims 1-6.