CN112151015A

CN112151015A - Keyword detection method and device, electronic equipment and storage medium

Info

Publication number: CN112151015A
Application number: CN202010915963.XA
Authority: CN
Inventors: 吕志强; 黄申
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-03
Filing date: 2020-09-03
Publication date: 2020-12-29
Anticipated expiration: 2040-09-03
Also published as: CN112151015B

Abstract

The application discloses a keyword detection method, a keyword detection device, electronic equipment and a storage medium, wherein the keyword detection method comprises the following steps: carrying out feature extraction on the voice to be recognized to obtain voice features; detecting the awakening words in the voice to be recognized according to a preset acoustic model and the voice characteristics to obtain awakening word detection results; when the awakening word detection result indicates that the voice to be recognized contains the awakening word, segmenting the voice feature based on the voice state of the voice to be recognized; detecting the keywords in the voice to be recognized based on the segmented voice features, the preset acoustic model and the preset keywords to obtain a keyword detection result; and fusing the awakening word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

Description

Keyword detection method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a keyword detection method and device, electronic equipment and a storage medium.

Background

Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, for example, natural language processing technology and machine learning/deep learning and the like, and it is believed that with the development of the technology, the artificial intelligence technology will be applied in more fields and play more and more important values.

However, at present, keyword detection often cannot meet the requirements of users, and problems such as delay and misrecognition often occur in the process of keyword detection.

Disclosure of Invention

The application provides a keyword detection method, a keyword detection device, an electronic device and a storage medium, which can improve the keyword detection efficiency and ensure the accuracy of keyword detection.

The application provides a keyword detection method, which comprises the following steps:

carrying out feature extraction on the voice to be recognized to obtain voice features;

detecting the awakening words in the voice to be recognized according to a preset acoustic model and the voice characteristics to obtain awakening word detection results;

when the awakening word detection result indicates that the voice to be recognized contains the awakening word, segmenting the voice feature based on the voice state of the voice to be recognized;

detecting the keywords in the voice to be recognized based on the segmented voice features, the preset acoustic model and the preset keywords to obtain a keyword detection result;

and fusing the awakening word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

Correspondingly, this application still provides a keyword detection device, includes:

the extraction module is used for extracting the characteristics of the voice to be recognized to obtain voice characteristics;

the first detection module is used for detecting the awakening words in the voice to be recognized according to a preset acoustic model and the voice characteristics to obtain awakening word detection results;

the segmentation module is used for segmenting the voice features based on the voice state of the voice to be recognized when the detection result of the awakening word indicates that the voice to be recognized contains the awakening word;

the second detection module is used for detecting the keywords in the voice to be recognized based on the segmented voice characteristics, the preset acoustic model and the preset keywords to obtain a keyword detection result;

and the fusion module is used for fusing the awakening word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

Optionally, in some embodiments of the present application, the second detection module includes:

the first obtaining unit is used for obtaining phoneme information of a word to be recognized in the voice to be recognized according to the preset acoustic model;

and the detection unit is used for detecting the keywords in the voice to be recognized based on the phoneme information, the segmented voice characteristics and the preset keywords to obtain a keyword detection result.

Optionally, in some embodiments of the present application, the detection unit includes:

the translation subunit is used for performing text translation on the voice to be recognized according to the segmented voice features to obtain a voice text of the voice to be recognized;

a determining subunit, configured to determine a word to be recognized corresponding to the phoneme information from the speech text;

and the selecting subunit is used for selecting the words to be recognized matched with the preset keywords from the determined words to be recognized to obtain a target keyword set.

Optionally, in some embodiments of the present application, the selecting subunit is specifically configured to:

calculating keyword similarity between the determined word to be recognized and a preset keyword;

and determining the words to be recognized with the similarity of the keywords larger than a preset value as target keywords to obtain a target keyword set.

Optionally, in some embodiments of the present application, the translation subunit is specifically configured to:

generating a phoneme sequence corresponding to each segmented voice feature according to the phoneme information;

and recognizing the phoneme sequence by using a preset language model to obtain a voice text of the voice to be recognized.

Optionally, in some embodiments of the present application, the first detection module includes:

the second acquisition unit is used for acquiring a preset acoustic model and a preset phoneme library;

the recognition unit is used for recognizing the voice features by adopting the preset acoustic model to obtain phoneme information corresponding to each word to be recognized in the voice to be recognized;

the selecting unit is used for selecting a phoneme matched with the phoneme information from a preset phoneme library to obtain a target phoneme;

and the generating unit is used for generating a wake-up word set based on the obtained target phoneme.

Optionally, in some embodiments of the present application, the selecting unit is specifically configured to:

extracting phonemes to be recognized corresponding to each word to be recognized in the speech to be recognized from the phoneme information;

calculating the similarity between the extracted phoneme to be identified and each candidate phoneme of the preset phoneme library to obtain the phoneme similarity;

and obtaining the target phoneme by using the candidate phoneme with the phoneme similarity larger than the preset phoneme similarity.

Optionally, in some embodiments of the present application, the segmentation module is specifically configured to:

collecting voice frame information of each frame in the voice to be recognized, wherein the voice frame information comprises collected voice frames and timestamps corresponding to the voice frames;

detecting the voice state of each frame in the voice to be recognized;

determining a voice frame with a voice state being an activated state as a target voice frame;

and segmenting the voice features according to the determined target voice frame and the corresponding time stamp.

The method comprises the steps of extracting the characteristics of the voice to be recognized to obtain the voice characteristics, detecting the awakening words in the voice to be recognized according to the preset acoustic model and the voice characteristics to obtain the awakening word detection result, and when the awakening word detection result indicates that the voice to be recognized contains the awakening words, the voice feature is segmented based on the voice state of the voice to be recognized, and then, based on the segmented voice feature, the preset acoustic model and the preset keyword, detecting the keywords in the voice to be recognized to obtain a keyword detection result, finally, fusing the awakening word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized, therefore, the keyword detection method and the keyword detection device can improve the keyword detection efficiency and ensure the accuracy of keyword detection.

Drawings

In order to more clearly illustrate the technical solutions in the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1a is a schematic view of a scenario of a keyword detection method provided in the present application;

FIG. 1b is a schematic flow chart of a keyword detection method provided in the present application;

FIG. 2a is another schematic flow chart of a keyword detection method provided in the present application;

FIG. 2b is a flow diagram of a keyword detection system provided herein;

FIG. 2c is a neural network training method in the keyword detection method provided herein;

FIG. 3 is a schematic structural diagram of a keyword detection apparatus provided in the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

The technical solutions in the present application will be described clearly and completely with reference to the accompanying drawings in the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The application provides a keyword detection method, a keyword detection device, electronic equipment and a storage medium.

The keyword detection device can be specifically integrated in a server, the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and the keyword detection device can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network), big data and artificial intelligence platform and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

For example, referring to fig. 1a, the present application provides a keyword detection system, which includes a user 10, a terminal 20 and a server 30, wherein the terminal 20 transmits a to-be-recognized voice to the server 30 after receiving the to-be-recognized voice input by the user 10 through a microphone of the terminal 20, the server 30 performs feature extraction on the to-be-recognized voice to obtain a voice feature, then the server 30 detects a wakeup word in the to-be-recognized voice according to a preset acoustic model and the voice feature to obtain a wakeup word detection result, when the wakeup word detection result indicates that the to-be-recognized voice includes the wakeup word, the server 30 segments the voice feature based on a voice state of the to-be-recognized voice, then the server 30 detects a keyword in the to-be-recognized voice based on the segmented voice feature, the preset acoustic model and the preset keyword, and finally, the server 30 fuses the awakening word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

According to the keyword detection method, when keyword detection is carried out on the voice to be recognized, phoneme information of the awakening words is utilized, so that the efficiency of recognizing the keywords in the voice to be recognized can be improved, meanwhile, different types of words (the keywords and the awakening words) in the voice to be recognized are recognized respectively in two stages, namely, the keyword detection accuracy is guaranteed while the keyword detection efficiency is improved.

The following are detailed below. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.

A keyword detection method includes: the method comprises the steps of extracting features of voice to be recognized to obtain voice features, detecting awakening words in the voice to be recognized according to the voice features to obtain awakening word detection results, when the awakening word detection results indicate that the voice to be recognized contains the awakening words, segmenting the voice features based on the voice state of the voice to be recognized, detecting keywords in the voice to be recognized based on the segmented voice features, preset acoustic models and preset keywords to obtain keyword detection results, and fusing the awakening word detection results and the keyword detection results to obtain keyword recognition results of the voice to be recognized.

Referring to fig. 1b, fig. 1b is a schematic flow chart of a keyword detection method provided in the present application. The specific process of the keyword detection method can be as follows:

101. and extracting the characteristics of the voice to be recognized to obtain the voice characteristics.

For example, specifically, the speech to be recognized may be acquired through a microphone of a terminal device (such as a mobile phone or a notebook computer), or the speech to be recognized may be downloaded from a network database in a wired or wireless communication manner, or the speech to be recognized may be acquired by accessing a local database, specifically, the selection may be performed according to an actual situation, and after the speech to be recognized is acquired, features of the speech to be recognized may be extracted by using a deep learning network.

102. And detecting the awakening words in the voice to be recognized according to the preset acoustic model and the voice characteristics to obtain awakening word detection results.

It should be noted that, in the present application, the wakeup word may be a single word or a single word, where the detection result of the wakeup word includes the detected wakeup word and phoneme information of the wakeup word, the smallest basic unit in the speech is a phoneme, and the phoneme is the basis for human being to distinguish one word from another word. Phonemes constitute syllables which in turn constitute different words and phrases. Phonemes are divided into vowels and consonants; vowels, also known as vowels, are a type of phoneme as opposed to consonants. Vowels are sounds that are emitted unimpeded by airflow through the mouth during pronunciation. Different vowels are caused by different shapes of the mouth (vowels and formants are closely related); the sounds produced by the obstruction of the air flow in the mouth or pharynx are called consonants, also called consonants, which are caused by the differences in the sound-producing location and the sound-producing method, and the concepts of unvoiced sounds and voiced sounds are more involved in the literature, strictly speaking, the extraction of many features is required to distinguish unvoiced sounds from voiced sounds, when the air flow passes through the glottis, turbulence is produced when the air flow rushes through a vocal tract with high speed if the area of the vocal tract is small, and fricatives, i.e. unvoiced sounds, are produced when the ratio of the air flow speed to the cross-sectional area is greater than a certain critical speed. In short, the vocal cords do not vibrate when unvoiced sounds are emitted, and therefore the unvoiced sounds are not periodic. Unvoiced sounds are generated by air friction and are equivalent to noise in analytical studies, and in phonetics, the generated sounds of vocal cord vibrations during pronunciation are called voiced sounds. Consonants are clear or turbid, while vowels in most languages are turbid. Voiced sounds have periodicity.

For example, a preset acoustic model and a preset phoneme library may be used to detect a wakeup word in a speech to be recognized, specifically, the preset acoustic model is used to process speech features to obtain phoneme information corresponding to each word to be recognized in the speech to be recognized, and then a phoneme matched with the phoneme information is selected from the preset phoneme library to obtain a wakeup word detection result, that is, optionally, in some embodiments, the step "detecting the wakeup word in the speech to be recognized according to the speech features to obtain the wakeup word detection result" may specifically include:

(11) acquiring a preset acoustic model and a preset phoneme library;

(12) recognizing the voice characteristics by adopting a preset acoustic model to obtain phoneme information corresponding to each word to be recognized in the voice to be recognized;

(13) selecting a phoneme matched with the phoneme information from a preset phoneme library to obtain a target phoneme;

(14) and generating a set of awakening words based on the obtained target phonemes.

The acoustic model is pre-constructed, and can be obtained by performing joint training on a Long Short-Term Memory network (LSTM) and a Connection Timing Classification (CTC). The process of speech alignment needs to repeat multiple iterations to ensure more accurate alignment, so that the whole training process is time-consuming, compared with the traditional acoustic model training, the acoustic model training using the CTC as a loss function is a complete end-to-end acoustic model training, data alignment is not needed in advance, and only one input sequence and one output sequence are needed for training. Thus, there is no need to align and label the data one by one, and CTCs output the probability of sequence prediction directly without external post-processing, while LSTM is a time-cycled neural network to solve the general problem. The Neural Network is specially designed for the long-term dependence problem of the Recurrent Neural Network (RNN), and because some special processing is carried out in the LSTM, the problem of gradient disappearance or gradient explosion cannot occur by adopting a mode of combined training of the LSTM and the CTC.

The step of selecting a phoneme matched with the phoneme information from the preset phoneme library to obtain the target phoneme may specifically include:

(21) extracting phonemes to be recognized corresponding to each word to be recognized in the speech to be recognized from the phoneme information;

(22) calculating the similarity between the extracted phoneme to be recognized and each phoneme in a preset phoneme library to obtain the phoneme similarity;

(23) and determining the phoneme with the phoneme similarity larger than the preset phoneme similarity as the target phoneme.

It should be noted that the preset phoneme similarity may be set according to actual requirements, and may be set to 60%, 80%, or 90%, for example.

103. And when the awakening word detection result indicates that the voice to be recognized contains the awakening word, segmenting the voice characteristics based on the voice state of the voice to be recognized.

In the method and the device, in order to improve the efficiency of subsequent keyword detection, the scheme of keyword detection and the scheme of annular word detection are cascaded, and when the detection result of the awakening word indicates that the voice to be recognized contains the awakening word, the keyword detection can be performed.

It should be noted that, in the process of keyword detection, there is often a case that actual speech is mismatched with training due to interference of background noise, which is also a root cause of poor robustness of the keyword detection system (another main reason is that unexpected input cannot be processed), thereby causing detection error and performance degradation. Even if two sections of speech signals are identical in content, the speech signals are different in time due to different speech speeds, and the time gaps between phonemes are different, so that the characteristics of the speech signals are completely different for time-varying non-stationary speech signals. There are gaps between phonemes and between silence and speech itself, and if the environment is quiet and there is not much background noise, the main error of the keyword detection system is caused by the inaccuracy of the endpoint detection technique, so that, in order to improve the accuracy of the subsequent keyword detection, in the present application, Voice characteristics may be segmented using Voice Activity Detection (VAD), the main task of this technique is to accurately locate the beginning and end points of speech from noisy speech, because speech contains long silence, i.e., the silence is separated from the actual speech, because it is the original processing of the speech data, VAD is one of the key technologies in the speech signal processing process, in which the speech state includes an active state (a state in which speech exists) and a silent state (a state in which speech does not exist).

For example, a speech state of each speech frame in the speech information may be determined, and then, the speech features are segmented according to the speech frame in the active state, that is, optionally, in some embodiments, the step "segmenting the speech features based on the speech state of the speech to be recognized" may specifically include:

(31) collecting voice frame information of each frame in voice to be recognized;

(32) detecting the voice state of each frame in the voice to be recognized;

(33) determining a voice frame with a voice state being an activated state as a target voice frame;

(34) and segmenting the voice features according to the determined target voice frame and the corresponding timestamp.

The speech frame information includes a collected speech frame and a timestamp corresponding to the speech frame, for example, 100 speech frames of the speech to be recognized are collected and are continuous in time, wherein the 10 th to 20 th speech frames are in an activated state, the 25 th to 28 th speech frames are in an activated state, and the 59 th to 79 th speech frames are in an activated state, the speech frame with the speech state in the activated state is determined as a target speech frame, and then, based on the timestamp corresponding to the target speech frame, the speech features are segmented to obtain the speech features corresponding to the 10 th to 20 th speech frames, the speech features corresponding to the 25 th to 28 th speech frames, the speech features corresponding to the 59 th to 79 th speech frames, and the speech features corresponding to other speech frames in a silent state.

In order to implement the cascade connection between the awakening word detection scheme and the keyword detection scheme, specifically, the feature extraction and the combination of the acoustic modeling portion and the awakening module may be performed to implement the multiplexing of the two computing tasks, it should be noted that the acoustic model for keyword detection generally employs a discriminative training objective function, while the acoustic model for awakening word detection generally employs a Cross Entropy (Cross Entropy) objective function, which may have a slight difference in the training objective, and in order to implement the cascade connection scheme of the present application, the model portion may be multiplexed using the following three strategies, as shown in fig. 2c, both the (a) scheme and the (b) scheme employ only one training objective function training model to implement the multiplexing between the two tasks for awakening word detection and keyword detection, (c) scheme employs only two different objective functions in the last mapping layer, therefore, the training advantages of the two tasks can be maintained, the calculation of the model part can be maximally shared, after the technical scheme is used for confirming the awakening result for the second time, the keyword detection F1 result can be improved from 64% to 73% by only applying a small amount of training data in a complex live game scene, and the accuracy is improved by 22% under the condition that the recall rate is not changed. While the performance is improved, the real-time rate is approximately estimated to be reduced by 17% on the whole service.

104. And detecting the keywords in the voice to be recognized based on the segmented voice features, the preset acoustic model and the preset keywords to obtain a keyword detection result.

In the application, the keyword detection scheme multiplexes an acoustic model part of the awakening word detection scheme, that is, a part for segmenting the voice features and a part for detecting the keywords are added on the basis of the awakening word detection scheme, and the part still adopts phoneme information generated in the awakening word detection scheme.

Further, in order to convert human language into language that can be recognized by a machine, text translation may be performed on the speech to be recognized based on the segmented speech features, and then keyword detection may be performed based on the translation result, that is, optionally, in some embodiments, the step "detecting keywords in the speech to be recognized based on the phoneme information, the segmented speech features, and preset keywords to obtain a keyword detection result" may specifically include:

(41) performing text translation on the voice to be recognized according to the segmented voice characteristics to obtain a voice text of the voice to be recognized;

(42) determining words to be recognized corresponding to the phoneme information from the voice text;

(43) and selecting the words to be recognized matched with the preset keywords from the determined words to be recognized to obtain a target keyword set.

It should be noted that, if text translation is directly performed on the speech to be recognized according to the speech features, the obtained speech text necessarily includes a text corresponding to the noise, and therefore, in the present application, the accuracy of text translation of the speech to be recognized can be improved by performing text translation on the speech to be recognized based on the segmented speech features, so as to improve the accuracy of subsequent keyword detection.

The method comprises the following steps that a language model can be utilized to perform text translation on speech to be recognized, wherein the language model is a basic part of many systems and is used for trying to solve natural language processing tasks such as machine translation, language recognition and the like, and the language model can be an N-gram model and represents the distribution of languages in a discrete space in a counting and counting mode; the method can also be used as a neural network model, and the neural network language model adopts a distributed mode to represent words, namely word vectors, and maps the words into a continuous space, thereby effectively solving the problem of data sparsity. And the neural network has strong pattern recognition capability.

Specifically, the preset language model may be obtained, and the preset language model is utilized to recognize the phoneme sequence to obtain the speech text of the speech to be recognized, that is, optionally, in some embodiments, the step "performing text translation on the speech to be recognized according to the segmented speech features to obtain the speech text of the speech to be recognized" may specifically include:

(51) generating a phoneme sequence corresponding to each segmented voice feature according to the phoneme information;

(52) and recognizing the phoneme sequence by using a preset language model to obtain a voice text of the voice to be recognized.

In the process of recognizing the second type of key words in the speech to be recognized by using the language model, firstly, a preset language model can be used for recognizing a phoneme sequence corresponding to the segmentation, so that a speech text corresponding to the segmented speech features is obtained, and text translation of the segmented speech features is realized

After the voice text is obtained, the keywords in the voice text can be identified according to the preset keywords to obtain a keyword set, specifically, the voice text can be segmented so as to select the words to be identified matched with the preset keywords from the voice text subsequently to obtain a target keyword set, wherein the segmentation technology belongs to the category of natural language understanding technology, is a primary link of semantic understanding, and is a technology capable of exactly separating words in sentences. It is the basis of the fields of text classification, information retrieval, machine translation, speech input and output of text, and the like. Due to the complexity and writing habit of Chinese, the Chinese word segmentation technique becomes a difficult point in the word segmentation technique. The word segmentation method can comprise a dictionary-based method, a statistical-based method and a rule-based method, wherein the dictionary-based method is also called a mechanical word segmentation method, and is characterized in that a Chinese word string to be analyzed is matched with a vocabulary entry in a machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful; according to different scanning directions, the word segmentation method can be divided into forward matching and reverse matching; according to different lengths, the method can be divided into maximum matching and minimum matching; at present, there are many word segmentation algorithms based on statistics, and the more common algorithm is: probability statistical algorithm based on mutual information, Chinese word segmentation decision algorithm based on combination degree and the like.

Further, the similarity of the determined word to be recognized and a preset keyword can be calculated, and then the word to be recognized with the similarity of the keyword larger than the preset value is determined as a target keyword to obtain a target keyword set.

105. And fusing the awakening word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

After the keyword detection scheme is cascaded to the awakening word detection scheme, in the awakening word detection scheme, the recall rate of primary detection (awakening word detection) can be increased by reducing the awakening word detection threshold, and in the subsequent fusion of the awakening word detection result and the keyword detection result, the awakening word detection threshold is increased to generate a keyword recognition result of the speech to be recognized, for example, firstly, a phoneme matched with phoneme information is selected from a preset phoneme library to obtain a target phoneme, then, an awakening word set is generated based on the obtained target phoneme, in the process, a matching value (awakening word detection threshold) can be set to be 60%, when the matching value between the phoneme information of the preset phoneme library and the target phoneme is greater than 60%, the two are considered to be matched, in the subsequent fusion scheme, for example, the matching value is set to be 80%, namely, selecting the awakening words with the matching value larger than 80% from the awakening word set, and adding the selected awakening words into the keyword set corresponding to the keyword detection result to obtain the keyword recognition result of the voice to be recognized.

The method comprises the steps of extracting the characteristics of a voice to be recognized to obtain voice characteristics, detecting awakening words in the voice to be recognized according to the voice characteristics to obtain awakening word detection results, segmenting the voice characteristics based on the voice state of the voice to be recognized when the awakening word detection results indicate that the voice to be recognized contains the awakening words, calling an acoustic model corresponding to the awakening word detection of the voice to be recognized, detecting keywords in the voice to be recognized based on the segmented voice characteristics, the preset acoustic model and preset keywords to obtain keyword detection results, and fusing the awakening word detection results and the keyword detection results to obtain keyword recognition results of the voice to be recognized. Therefore, the efficiency of recognizing the keywords in the speech to be recognized can be improved, and meanwhile, different types of words (the keywords and the awakening words) in the speech to be recognized are recognized by two stages respectively, namely, the keyword detection efficiency is improved, and meanwhile, the accuracy of keyword detection is guaranteed.

The method according to the examples is further described in detail below by way of example.

In this embodiment, the keyword detection apparatus is specifically integrated in a server as an example.

Referring to fig. 2a, a keyword detection method may specifically include the following processes:

201. and the server extracts the characteristics of the voice to be recognized to obtain the voice characteristics.

For example, specifically, the server may receive the voice to be recognized collected by a microphone of the terminal device (such as a mobile phone or a notebook computer), the server may also download the voice to be recognized from the network database in a wired or wireless communication manner, the server may also access the local database to obtain the voice to be recognized, specifically, the server may select the voice to be recognized according to an actual situation, and after obtaining the voice to be recognized, the server may extract features of the voice to be recognized by using a deep learning network.

202. And the server detects the awakening words in the voice to be recognized according to the voice characteristics to obtain awakening word detection results.

The server processes the voice characteristics by using a preset acoustic model to obtain the phoneme information corresponding to each word to be recognized in the voice to be recognized, and then selects phonemes matched with the phoneme information from a preset phoneme library to obtain a wake-up word detection result.

203. And when the awakening word detection result indicates that the voice to be recognized contains the awakening word, the server segments the voice characteristics based on the voice state of the voice to be recognized.

In the application, in order to improve the efficiency of subsequent keyword detection, the scheme of keyword detection and the scheme of ring word detection are cascaded, when the detection result of the awakening word indicates that the voice to be recognized contains the awakening word, keyword detection may be performed, for example, the server may determine the speech state in which each speech frame in the speech information is, then the server segments the voice characteristics according to the voice frames in the activated state, further, the server can collect the voice frame information of each frame in the voice to be recognized, then, the server can detect the voice state of each frame in the voice to be recognized, then, the server can determine the voice frame with the voice state being the activated state as a target voice frame, and finally, the server can segment the voice features according to the determined target voice frame and the corresponding timestamp thereof.

204. And the server detects the keywords in the voice to be recognized based on the segmented voice features, the preset acoustic model and the preset keywords to obtain a keyword detection result.

For example, specifically, the server may obtain phoneme information of a word to be recognized in the speech to be recognized according to the acoustic model, and then detect a keyword in the speech to be recognized based on the phoneme information, the segmented speech features, and a preset keyword to obtain a keyword detection result.

205. And the server fuses the awakening word detection result and the keyword detection result to obtain a keyword recognition result of the voice to be recognized.

After the keyword detection scheme is cascaded to the awakening word detection scheme, in the awakening word detection scheme, the recall rate of primary detection (awakening word detection) can be improved by reducing the awakening word detection threshold, and the awakening word detection threshold is improved in the subsequent fusion of the awakening word detection result and the keyword detection result so as to generate the keyword recognition result of the voice to be recognized.

To facilitate further understanding of the keyword detection scheme of the present application, the present application provides a keyword detection system, please refer to fig. 2b, which is a flowchart of the keyword detection system, wherein, the keyword detection system comprises a wake-up word detection module, a keyword identification module and a word fusion module, wherein, the awakening word detection and identification module comprises a characteristic extraction unit, an acoustic identification unit and a decoding unit, the keyword identification module comprises a silence detection unit and a decoding unit, wherein, in the keyword detection task, after the feature extraction unit extracts the features of the speech to be recognized, the voice features are sent to an acoustic recognition unit, the acoustic recognition unit learns the mapping relation between the voice features and the phoneme information, then, a decoding unit in the awakening word detection module is used for identifying the pronunciation sequence of the awakening word so as to obtain an awakening word set; in addition, the silence detection unit segments voice characteristics based on the voice state of the voice to be recognized, then a decoding unit in the keyword recognition module recognizes keywords in the voice to be recognized according to the segmented voice characteristics, phoneme information and preset keywords to obtain a keyword set, and finally, the word fusion module fuses the awakening word set and the keyword set to obtain a keyword recognition result of the voice to be recognized.

The server extracts the characteristics of the voice to be recognized to obtain voice characteristics, then detects the awakening words in the voice to be recognized according to the voice characteristics to obtain awakening word detection results, when the awakening word detection results indicate that the voice to be recognized contains the awakening words, the server divides the voice characteristics based on the voice state of the voice to be recognized, then calls an acoustic model corresponding to the awakening word detection of the voice to be recognized, then detects the keywords in the voice to be recognized based on the divided voice characteristics, the acoustic model and preset keywords to obtain keyword detection results, finally, the server fuses the awakening word detection results and the keyword detection results to obtain keyword recognition results of the voice to be recognized, and when the server detects the keywords in the voice to be recognized, the phoneme information of the awakening words is utilized, so that the efficiency of identifying the keywords in the voice to be identified can be improved, meanwhile, two stages are adopted to identify different types of words (the keywords and the awakening words) in the voice to be identified respectively, namely, the accuracy of keyword detection is ensured while the efficiency of keyword detection is improved.

In order to better implement the keyword detection method of the present application, the present application further provides a keyword detection apparatus (identification apparatus for short) based on the above-mentioned keyword. The meanings of the nouns are the same as those in the keyword detection method, and specific implementation details can refer to the description in the method embodiment.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a keyword detection apparatus provided in the present application, wherein the identification apparatus may include an extraction module 301, a first detection module 302, a segmentation module 303, a second detection module 304, and a fusion module 305, which may specifically be as follows:

the extraction module 301 is configured to perform feature extraction on the speech to be recognized to obtain speech features.

For example, the extraction module 301 may employ a deep learning network to extract features of the speech to be recognized.

The first detection module 302 is configured to detect a wake-up word in a speech to be recognized according to a preset acoustic model and speech characteristics, so as to obtain a wake-up word detection result.

The detection result of the awakening word includes the detected awakening word and phoneme information of the awakening word, the first detection module 302 processes the voice features by using a preset acoustic model to obtain phoneme information corresponding to each to-be-recognized word in the to-be-recognized voice, and then selects a phoneme matched with the phoneme information from a preset phoneme library to obtain the detection result of the awakening word.

Optionally, in some embodiments, the first detecting module 302 may specifically include:

the recognition unit is used for recognizing the voice features by adopting a preset acoustic model to obtain phoneme information corresponding to each word to be recognized in the voice to be recognized;

the selecting unit is used for selecting phonemes matched with the phoneme information from a preset phoneme library to obtain target phonemes;

Optionally, in some embodiments, the selection unit may specifically be configured to: extracting the phonemes to be recognized corresponding to each word to be recognized in the speech to be recognized from the phoneme information, calculating the similarity between the extracted phonemes to be recognized and each candidate phoneme of the preset phoneme library to obtain phoneme similarity, and obtaining the target phoneme from the candidate phonemes with the phoneme similarity larger than the preset phoneme similarity.

The segmenting module 303 is configured to segment the voice feature based on the voice state of the voice to be recognized when the detection result of the wake word indicates that the voice to be recognized includes the wake word.

In the present application, in order to improve the efficiency of subsequent keyword detection, the keyword detection scheme and the ring word detection scheme are cascaded, and when the detection result of the wakeup word indicates that the speech to be recognized includes the wakeup word, the keyword detection may be performed, for example, the segmentation module 303 may determine the speech state of each speech frame in the speech information, then the segmentation module 303 segments the speech features according to the speech frame in the active state, further, the segmentation module 303 may collect the speech frame information of each frame in the speech to be recognized, then the segmentation module 303 may detect the speech state of each frame in the speech to be recognized, then the segmentation module 303 may determine the speech frame in the active state as the target speech frame 303, and finally, according to the determined target speech frame and the corresponding timestamp thereof, the speech features are segmented, that is, optionally, in some embodiments, the segmentation module 303 may be specifically configured to: collecting voice frame information of each frame in the voice to be recognized, detecting the voice state of each frame in the voice to be recognized, determining the voice frame with the voice state being an activated state as a target voice frame, and segmenting the voice features according to the determined target voice frame and a corresponding timestamp thereof.

The voice frame information comprises a collected voice frame and a time stamp corresponding to the voice frame.

The second detection module 304 is configured to detect a keyword in the speech to be recognized based on the segmented speech feature, the preset acoustic model, and the preset keyword, so as to obtain a keyword detection result.

For example, specifically, the second detection module 305 may obtain phoneme information of a word to be recognized in the speech to be recognized according to a preset acoustic model, and then the second detection module 305 detects a keyword in the speech to be recognized based on the phoneme information, the segmented speech features, and a preset keyword, so as to obtain a keyword detection result.

That is, optionally, in some embodiments, the second detection module 305 may specifically include:

the first acquisition unit is used for acquiring the phoneme information of a word to be recognized in the voice to be recognized according to a preset acoustic model;

and the detection unit is used for detecting the keywords in the speech to be recognized based on the phoneme information, the segmented speech features and the preset keywords to obtain a keyword detection result.

Optionally, in some embodiments, the detection unit may specifically include:

the translation subunit is used for performing text translation on the voice to be recognized according to the segmented voice characteristics to obtain a voice text of the recognized voice;

the determining subunit is used for determining a word to be recognized corresponding to the phoneme information from the voice text;

Optionally, in some embodiments, the selecting subunit may specifically be configured to: and calculating the similarity of the keywords between the determined word to be recognized and the preset keywords, and determining the word to be recognized with the similarity of the keywords larger than the preset value as the target keywords to obtain a target keyword set.

Optionally, in some embodiments, the translation subunit may be specifically configured to: and generating a phoneme sequence corresponding to each segmented voice feature according to the phoneme information, and identifying the phoneme sequence by using a preset language model to obtain a voice text of the voice to be identified.

And a fusion module 305, configured to fuse the detection result of the wakeup word and the detection result of the keyword to obtain a keyword recognition result of the speech to be recognized.

After the extraction module 301 extracts the characteristics of the speech to be recognized and obtains the speech characteristics, the first detection module 302 detects the awakening words in the speech to be recognized according to the speech characteristics to obtain awakening word detection results, when the awakening word detection results indicate that the speech to be recognized contains the awakening words, the segmentation module 303 segments the speech characteristics based on the speech state of the speech to be recognized, then the second detection module 304 detects the keywords in the speech to be recognized based on the segmented speech characteristics, the preset acoustic model and the preset keywords to obtain keyword detection results, and finally, the fusion module 305 fuses the awakening word detection results and the keyword detection results to obtain keyword recognition results of the speech to be recognized, and the keyword detection device provided by the application detects the keywords in the speech to be recognized, the phoneme information of the awakening words is utilized, so that the efficiency of identifying the keywords in the voice to be identified can be improved, meanwhile, two stages are adopted to identify different types of words (the keywords and the awakening words) in the voice to be identified respectively, namely, the accuracy of keyword detection is ensured while the efficiency of keyword detection is improved.

In addition, the present application also provides an electronic device, as shown in fig. 4, which shows a schematic structural diagram of the electronic device related to the present application, specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

the method comprises the steps of extracting features of voice to be recognized to obtain voice features, detecting awakening words in the voice to be recognized according to a preset acoustic model and the voice features to obtain awakening word detection results, when the awakening word detection results indicate that the voice to be recognized contains the awakening words, segmenting the voice features based on the voice state of the voice to be recognized, detecting keywords in the voice to be recognized based on the segmented voice features, the preset acoustic model and preset keywords to obtain keyword detection results, and fusing the awakening word detection results and the keyword detection results to obtain keyword recognition results of the voice to be recognized.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

The method comprises the steps of extracting the characteristics of a voice to be recognized to obtain voice characteristics, detecting awakening words in the voice to be recognized according to a preset acoustic model and the voice characteristics to obtain awakening word detection results, segmenting the voice characteristics based on the voice state of the voice to be recognized when the awakening word detection results indicate that the voice to be recognized contains the awakening words, detecting keywords in the voice to be recognized based on the segmented voice characteristics, the preset acoustic model and preset keywords to obtain keyword detection results, and finally fusing the awakening word detection results and the keyword detection results to obtain keyword recognition results of the voice to be recognized. Meanwhile, different types of words (keywords and awakening words) in the voice to be recognized are recognized respectively by two stages, namely, the keyword detection efficiency is improved, and meanwhile, the accuracy of keyword detection is guaranteed.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present application provides a storage medium having stored therein a plurality of instructions, which can be loaded by a processor to perform the steps of any of the keyword detection methods provided by the present application. For example, the instructions may perform the steps of:

the method comprises the steps of extracting features of voice to be recognized to obtain voice features, detecting awakening words in the voice to be recognized according to the voice features to obtain awakening word detection results, when the awakening word detection results indicate that the voice to be recognized contains the awakening words, segmenting the voice features based on the voice state of the voice to be recognized, detecting keywords in the voice to be recognized based on the segmented voice features, preset acoustic models and preset keywords to obtain keyword detection results, and fusing the awakening word detection results and the keyword detection results to obtain keyword recognition results of the voice to be recognized.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any keyword detection method provided by the present application, the beneficial effects that any keyword detection method provided by the present application can achieve can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.

The keyword detection method, the keyword detection device, the electronic device and the storage medium provided by the present application are described in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A keyword recognition method, comprising:

2. The method according to claim 1, wherein the detecting the keyword in the speech to be recognized based on the segmented speech features, the preset acoustic model and a preset keyword to obtain a keyword detection result comprises:

acquiring phoneme information of a word to be recognized in the voice to be recognized according to the preset acoustic model;

and detecting the keywords in the voice to be recognized based on the phoneme information, the segmented voice characteristics and the preset keywords to obtain a keyword detection result.

3. The method of claim 2, wherein the detecting the keyword in the speech to be recognized based on the phoneme information, the segmented speech features and a preset keyword to obtain a keyword detection result comprises:

performing text translation on the voice to be recognized according to the segmented voice characteristics to obtain a voice text of the voice to be recognized;

determining words to be recognized corresponding to the phoneme information from the voice text;

and selecting the words to be recognized matched with the preset keywords from the determined words to be recognized to obtain a target keyword set.

4. The method according to claim 3, wherein the selecting a word to be recognized matching a preset keyword from the determined words to be recognized to obtain a target keyword set comprises:

5. The method according to claim 3, wherein the performing text translation on the speech to be recognized according to the segmented speech features to obtain the speech text of the speech to be recognized comprises:

6. The method according to any one of claims 1 to 5, wherein the detecting the wake-up word in the speech to be recognized according to the speech feature to obtain a first detection result comprises:

acquiring a preset acoustic model and a preset phoneme library;

recognizing the voice features by adopting the preset acoustic model to obtain phoneme information corresponding to each word to be recognized in the voice to be recognized;

selecting a phoneme matched with the phoneme information from a preset phoneme library to obtain a target phoneme;

and generating a set of awakening words based on the obtained target phonemes.

7. The method of claim 6, wherein selecting the phoneme from the preset phoneme library which matches the phoneme information to obtain the target phoneme comprises:

calculating the similarity between the extracted phoneme to be recognized and each phoneme in a preset phoneme library to obtain the phoneme similarity;

and determining the phoneme with the phoneme similarity larger than the preset phoneme similarity as the target phoneme.

8. The method according to any one of claims 1 to 5, wherein the segmenting the speech feature based on the speech state of the speech to be recognized comprises:

detecting the voice state of each frame in the voice to be recognized;

9. A keyword detection apparatus, comprising:

the first detection module is used for detecting the awakening words in the voice to be recognized according to the voice characteristics to obtain awakening word detection results;

the calling module is used for calling an acoustic model corresponding to the voice to be recognized during the detection of the awakening word;

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the keyword detection method according to any of claims 1 to 8 are performed when the program is executed by the processor.

11. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, carries out the steps of the keyword detection method as claimed in any one of claims 1 to 8.