WO2018219023A1

WO2018219023A1 - Speech keyword identification method and device, terminal and server

Info

Publication number: WO2018219023A1
Application number: PCT/CN2018/079769
Authority: WO
Inventors: 王珺; 黄志恒; 于蒙; 蒲松柏
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2017-05-27
Filing date: 2018-03-21
Publication date: 2018-12-06
Also published as: CN107230475A; CN110349572B; CN110349572A; CN110444199A; TWI690919B; CN110444199B; CN107230475B; TW201832221A

Abstract

A speech keyword identification method and device, a terminal and a server, the method comprising: selecting a frame from a first frame sequence which forms a first speech and determining the same to be a first target frame (S201); selecting a keyword from a keyword sequence comprised in speech keywords and determining the same to be a target keyword (S202); determining whether an implicit feature vector of the target frame is matched successfully with a keyword template corresponding to the target keyword (S203); and if determined that the implicit feature vector of the frame in the first speech is matched successfully with the keyword template corresponding to each keyword in the keyword sequence one by one, determining that the first speech comprises the speech keyword therein (S204). The described method effectively carries out identification of the speech keywords in the first speech, and furthermore, facilitates an electronic device which uses speech awakening technology to automatically activate a processing module corresponding to the speech keyword when identifying that the first speech comprises the speech keyword therein.

Description

Voice keyword recognition method, device, terminal and server

This application claims the priority of the Chinese patent application filed on May 27, 2017, the Chinese Patent Office, the application number is 201710391388.6, and the invention name is "a voice keyword recognition method, device, terminal and server". The citations are incorporated herein by reference.

Technical field

The present invention relates to the field of voice recognition technology, and in particular, to a voice keyword recognition method, device, terminal, and server.

Background technique

With the development of technology, voice wake-up technology is more and more widely used in electronic devices, which greatly facilitates the user's operation on electronic devices, allowing users to interact with electronic devices without manual interaction. The word activates the corresponding processing module in the electronic device.

For example, Apple's mobile phone uses the keyword "siri" as the voice keyword to activate the voice dialogue assistant function in the Apple mobile phone. When the Apple mobile phone detects that the user inputs the voice including the keyword "siri", it automatically activates the voice in the Apple mobile phone. Dialogue Assistant feature.

In view of the above, a voice keyword recognition method, device, terminal and server are provided to realize the recognition of voice keywords in voice, which is crucial for the development of voice wake-up technology.

Summary of the invention

In view of this, an embodiment of the present invention provides a voice keyword recognition method, apparatus, terminal, and server to implement voice keyword recognition in voice.

To achieve the above objective, the embodiment of the present invention provides the following technical solutions:

A voice keyword recognition method includes:

Selecting a first target frame from a sequence of first frames constituting the first voice;

Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;

If the key layer template corresponding to the target keyword is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;

If it is determined that the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.

A voice keyword recognition device includes:

a first target frame determining unit, configured to select a first target frame from a first frame sequence constituting the first voice;

a target keyword determining unit, configured to select a keyword from the keyword sequence as the target keyword, wherein the keyword sequence belongs to the voice keyword;

a matching unit, configured to determine, according to the keyword template corresponding to each keyword in the keyword sequence, that the key template of the first target frame is successfully matched with the keyword template corresponding to the target keyword Whether the hidden layer feature vector of the frame located in the first voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;

The identifying unit is configured to determine, if the keyword template corresponding to each keyword in the keyword sequence is determined one by one, that the hidden layer feature vector of the frame located in the first voice is successfully matched, The voice keyword is included in a voice.

A terminal includes a memory for storing a program, and a processor calling the program, the program for:

A voice keyword recognition server includes a memory and a processor, the memory is used to store a program, and the processor calls the program, the program is used to:

A computer readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of the first aspect.

A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect.

The embodiment of the invention discloses a voice keyword recognition method, device, terminal and server, which determine a first target frame from a first frame sequence constituting the first voice; and determine a target from a keyword sequence included in the voice keyword a keyword; when it is determined that the hidden layer feature vector of the target frame is successfully matched with the keyword template corresponding to the target keyword (the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword), If the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the manner in which the voice keyword is included in the first voice is determined. The recognition of the speech keywords in the first speech is effectively implemented. Further, the electronic device that facilitates using the voice wake-up technology automatically activates a processing module corresponding to the voice keyword when identifying that the voice keyword is included in the first voice.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can obtain other drawings according to the provided drawings without any creative work.

FIG. 1 is a schematic structural diagram of a voice keyword recognition server according to an embodiment of the present application;

2 is a flowchart of a method for identifying a voice keyword according to an embodiment of the present application;

FIG. 3 is a flowchart of another method for identifying a voice keyword according to an embodiment of the present application;

4 is a flowchart of a method for selecting a frame from a first frame sequence constituting a first voice to be determined as a first target frame according to an embodiment of the present disclosure;

FIG. 5 is a flowchart of a method for selecting a keyword from a keyword sequence included in a voice keyword to be determined as a target keyword according to an embodiment of the present disclosure;

FIG. 6 is a flowchart of a method for generating a keyword template corresponding to a target keyword according to an embodiment of the present disclosure;

FIG. 7 is a flowchart of a method for selecting a frame with the highest degree of similarity with a target keyword as a second target frame from a second frame sequence based on a final layer feature vector corresponding to each frame according to an embodiment of the present application. ;

FIG. 8 is a flowchart of another voice keyword recognition method according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a voice keyword recognition apparatus according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a keyword template generating unit according to an embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of a second target frame determining unit according to an embodiment of the present disclosure.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

Example:

The embodiment of the present application provides a voice keyword identification method, which is applied to a terminal or a server.

In the embodiment of the present application, optionally, the terminal is an electronic device, for example, a mobile terminal, a desktop, or the like. The above is only an optional manner of the terminal provided by the embodiment of the present application. The inventor can arbitrarily set the specific expression of the terminal according to the requirements of the present application, which is not limited herein.

Optionally, the function of the server (referred to herein as a voice keyword recognition server) to which the voice keyword identification method provided by the embodiment of the present application is applied may be implemented by a single server or a server cluster composed of multiple servers. There is no limit here.

Taking a server as an example, a schematic diagram of a voice keyword recognition server provided by an embodiment of the present application is shown in FIG. 1 . The voice keyword recognition server includes a processor 11 and a memory 12.

The processor 11, the memory 12, and the communication interface 13 complete communication with each other via the communication bus 14.

Optionally, the communication interface 13 may be an interface of the communication module, such as an interface of a Global System for Mobile Communication (GSM) module. The processor 11 is configured to execute a program.

The processor 11 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.

The memory 12 is used to store a program.

The program can include program code, the program code including computer operating instructions. In the embodiment of the present invention, the program may include a program corresponding to the user interface editor described above.

The memory 12 may include a high speed random access memory (RAM) memory, and may also include a non-volatile memory (NVM), such as at least one disk memory.

Among them, the program can be specifically used to:

Selecting one frame from the first frame sequence constituting the first voice is determined as the first target frame;

Selecting a keyword from a sequence of keywords included in the voice keyword to determine the target keyword;

Determining whether the hidden layer feature vector of the target frame is successfully matched with the keyword template corresponding to the target keyword, and the keyword template indicates the hidden layer feature vector of the second target frame in the second speech including the target keyword;

If the matching is successful, if the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the first voice is determined. Includes voice keywords.

Correspondingly, the structure of a terminal provided by the embodiment of the present application includes at least the structure of the voice keyword recognition server as shown in FIG. 1 above. For the structure of the terminal, refer to the description of the structure of the voice keyword recognition server. I will not repeat them here.

Correspondingly, the embodiment of the present application provides a flowchart of a voice keyword recognition method, which is shown in FIG. 2 .

As shown in Figure 2, the method includes:

S201: Select a frame first target frame from a first frame sequence constituting the first voice;

S202. Select a keyword from a sequence of keywords included in the voice keyword to determine the target keyword.

S203. Determine whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, and the keyword template indicates the hidden layer feature vector of the second target frame in the second voice that includes the target keyword; If the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, step S204 is performed.

Optionally, a voice model is pre-set, and the second voice (including the second voice sequence including the second frame sequence) of the target keyword is input into the voice model, and the hidden layer feature vector of the second target frame in the second voice is obtained. The keyword template corresponding to the target keyword indicates the obtained hidden layer feature vector.

Optionally, the speech model is generated based on a Long Short-Term Memory (LSTM) and a Connectionist Temporal Classification (CTC).

The above is only an optional manner for generating a voice model provided by the embodiment of the present application. The inventor can arbitrarily set the specific generation process of the voice model according to his own needs, which is not limited herein.

Optionally, the first speech input speech model including the first frame sequence is included, and a hidden layer feature vector corresponding to the first target frame in the first speech is obtained.

Correspondingly, the hidden layer feature vector of the first target frame is matched with the keyword template corresponding to the target keyword, and it is determined whether the hidden layer feature vector of the first target frame matches the keyword template corresponding to the target keyword, if the matching is successful. Step S204 is successfully executed.

In the embodiment of the present application, optionally, determining whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword includes: calculating a hidden layer feature vector and a target keyword of the first target frame The cosine distance between the corresponding keyword templates; if the calculated cosine distance satisfies the preset value, it is determined that the hidden layer feature vector of the first target frame matches the keyword template corresponding to the target keyword; if the calculated cosine is obtained; If the distance does not meet the preset value, it is determined that the hidden layer feature vector of the first target frame is not successfully matched (failed) with the keyword template corresponding to the target keyword.

S204. If the keyword template corresponding to each keyword in the keyword sequence is determined one by one, and the hidden layer feature vector of the frame located in the first voice is determined to be successfully matched, determining that the first voice includes the voice keyword .

Optionally, if it is determined in step S203 that the matching is successful, determining whether the keyword template corresponding to each keyword in the keyword sequence has been determined one by one has determined the hidden layer feature of the frame located in the first voice. The vector is successfully matched with it; if so, it is determined that the voice is included in the first voice.

FIG. 3 is a flowchart of another voice keyword recognition method according to an embodiment of the present application.

As shown in FIG. 3, the method includes:

S301. Select a first target frame from a first frame sequence that constitutes the first voice.

S302. Select a keyword from a sequence of keywords included in the voice keyword to determine the target keyword.

S303. Determine whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, and the keyword template indicates the hidden layer feature vector of the second target frame in the second voice that includes the target keyword; If the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, step S304 is performed; if the matching is unsuccessful, the process returns to step S301;

S304. Determine whether the keyword template corresponding to each keyword in the keyword sequence has been determined one by one, and the hidden layer feature vector of the frame located in the first voice has been determined to be successfully matched. If yes, step S305 is performed; Otherwise, return to step S301;

Optionally, the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, and the hidden layer feature vector of the frame located in the first voice is determined to be successfully matched, including: for each keyword sequence The keyword templates corresponding to the keywords have been determined that the hidden layer feature vector of the frame located in the first voice is successfully matched; and the keywords that match the keyword template are successfully sorted according to the order of successful matching. The result obtained is a sequence of keywords.

S305. Determine to include a voice keyword in the first voice.

To facilitate understanding of a voice keyword recognition method provided by an embodiment of the present application, a flow chart of a method for determining a frame from a first frame sequence constituting a first voice as a first target frame is provided. 4.

As shown in FIG. 4, the method includes:

S401. Determine a first frame that is never determined to be the first target frame in the first frame sequence that constitutes the first voice.

S402. The determined frame is used as a first target frame determined from a first frame sequence constituting the first voice.

Optionally, the first speech comprises a first sequence of frames, and the first sequence of frames is composed of at least one frame arranged in sequence. Determining a frame from the first frame sequence constituting the first speech as the first target frame includes: selecting one frame from the first frame sequence as the first target frame, and the first target frame is the slave in the first frame sequence The frame that is not the first target frame and is sorted in the first frame sequence.

In order to facilitate the understanding of a voice keyword recognition method provided by the embodiment of the present application, a flow chart for selecting a keyword from a keyword sequence included in a voice keyword to be a target keyword is provided. Referring to FIG. 5 .

As shown in FIG. 5, the method includes:

S501. Determine, from a keyword sequence included in the voice keyword, a next keyword adjacent to the keyword corresponding to the keyword template that has been successfully matched last time;

Optionally, the keyword sequence is composed of multiple keywords that are sequentially sorted.

For example, if the keyword sequence included in the voice keyword is “Little Red Hello”, if the keyword corresponding to the key template of the last successful match is “red”, then the keyword sequence included in the voice keyword is The next keyword adjacent to the keyword corresponding to the last successful keyword template is the keyword "you".

S502, determining whether the number of times the next keyword is continuously determined as the target keyword reaches a preset threshold; if the number of times the next keyword is continuously determined as the target keyword does not reach the preset threshold, step S503 is performed; If the number of times the next keyword is continuously determined as the target keyword reaches the threshold, step S504 is performed;

Optionally, the preset threshold is 30 times. The foregoing is only an optional manner of the threshold provided by the embodiment of the present application. The inventor may arbitrarily set the specific content of the threshold according to his own needs, which is not limited herein.

S503. Determine the next keyword as the target keyword.

S504. Determine a first keyword in the keyword sequence as the target keyword.

For example, if the keyword sequence included in the voice keyword is "Little Red Hello", the first keyword in the keyword sequence is determined as the target keyword, including: the first keyword in the keyword sequence "Small" is determined as the target keyword.

To facilitate the understanding of a voice keyword recognition method provided by the embodiment of the present application, a flow chart of a method for generating a keyword template corresponding to a target keyword is provided. Referring to FIG. 6 .

As shown in FIG. 6, the method includes:

S601. Determine a second voice that includes a target keyword, where the second voice is composed of a second frame sequence.

Optionally, the process of generating a keyword template corresponding to the target keyword includes: determining a second voice that includes the target keyword, the second voice is composed of a second frame sequence, and the second frame sequence is composed of at least one frame that is sequentially arranged .

S602. The second voice is used as the input information of the preset voice model, and the final layer feature vector corresponding to each frame in the second frame sequence is determined respectively.

Optionally, a voice model is pre-set, and the input information of the voice model is a voice (eg, a second voice)/frame, and the output information may include a hidden layer feature vector and a final layer feature vector respectively corresponding to each frame input.

In the embodiment of the present application, optionally, the second voice is used as the input information of the voice model, and the final layer feature vector corresponding to each frame in the second frame sequence included in the second voice is obtained.

S603. Determine, according to a final layer feature vector corresponding to each frame, a second target frame from the second frame sequence.

Optionally, one frame is selected as the second target frame from the second voice according to the end layer feature vector corresponding to each frame in the second frame sequence included in the second voice.

S604. Generate a keyword template corresponding to the target keyword according to the hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as the input information of the voice model.

Optionally, the second target frame is used as the input information of the voice model, and the obtained process of the hidden layer feature vector corresponding to the second target frame may be implemented in step S602, where the second voice is used as the input of the preset voice model. And determining, by the information, a final layer feature vector corresponding to each frame in the second frame sequence, and a hidden layer feature vector corresponding to each frame in the second frame sequence respectively; and further, in the process of performing step S604, directly From the result of the "hidden layer feature vector corresponding to each frame in the second frame sequence" of step S602, the hidden layer feature vector corresponding to the second target frame is directly acquired.

The above is only an optional manner of the embodiment of the present application, and the inventor can arbitrarily set the implementation of the hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as the input information of the speech model according to his own needs. The method is as follows: the process of the “hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as the input information of the voice model” is implemented in step S602, which is not limited herein.

Optionally, the number of the second voices is at least one, and the keyword template corresponding to the target keyword is generated according to the hidden layer feature vector corresponding to the second target frame, including: determining the second and the second voice respectively The hidden layer feature vector corresponding to the two target frames is averaged for each determined hidden layer feature vector, and the obtained result is used as a keyword template corresponding to the target keyword.

In order to facilitate understanding of a voice keyword recognition method provided by an embodiment of the present application, a method for determining a second target frame from a second frame sequence based on a final layer feature vector corresponding to each frame is provided. Introduction.

In the embodiment of the present application, optionally, the end layer feature vector corresponding to the frame includes: a similarity between the frame and each text in the preset text set in the voice model, and the target keyword is one in the file set. Text.

For example, if the text set is 5200 Chinese characters, the final layer feature vector corresponding to the frame includes: the similarity between the frame and each of the 5200 Chinese characters.

Determining the second target frame from the second frame sequence based on the end layer feature vectors respectively corresponding to each frame, comprising: selecting and targeting the target from the second frame sequence according to the final layer feature vector corresponding to each frame respectively The frame with the highest degree of similarity of words is used as the second target frame; wherein the degree of similarity between the frame and the target keyword is determined according to the similarity between the frame and each character in the text set.

For ease of understanding, a flow chart of a method for selecting a frame with the highest degree of similarity to a target keyword as a second target frame from a second frame sequence based on a final layer feature vector corresponding to each frame is provided. Figure 7.

As shown in Figure 7, the method includes:

S701: Determine at least one first candidate frame from the second frame sequence, where the similarity between the first candidate frame and the target keyword is smaller than the similarity between the first candidate frame and the at least one character in the text set, and the number of the at least one character is less than Default value

S702. Determine at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is each of the first candidate frames having the greatest similarity with the target keyword in the at least one first candidate frame.

S703. Determine a second target frame from the at least one second candidate frame. The similarity between the second target frame and the target keyword is in the similarity between the second target frame and each character according to the order of similarity from high to low. The ranking is higher than the ranking of each second candidate frame and the target keyword except the second target frame in the similarity between the second candidate frame and each character.

Further, in order to facilitate the selection of the end layer feature vector corresponding to each frame respectively as shown in FIG. 7 provided by the embodiment of the present application, the frame with the highest degree of similarity with the target keyword is selected from the second frame sequence. The understanding of the method of the second target frame is now illustrated by:

If the second frame sequence included in the second voice includes four frames, namely frame 1, frame 2, frame 3, and frame 4, the preset text set in the voice model includes four characters, namely, text 1, text 2, respectively Text 3 and text 4, where text 3 is the target keyword.

Inputting the second speech as input information of the speech model to the speech model, and obtaining a final layer feature vector corresponding to frame 1, a final layer feature vector corresponding to frame 2, a final layer feature vector 3 corresponding to frame 3, and The final layer feature vector 4 corresponding to frame 4.

The final layer feature vector 1 includes a similarity degree 11 between the frame 1 and the text 1, a similarity 12 between the frame 1 and the text 2, a similarity 13 between the frame 1 and the text 3, and a similarity 14 between the frame 1 and the character 4, wherein The similarity 11 is 20%, the similarity 12 is 30%, the similarity 13 is 15%, and the similarity 14 is 50%;

The final layer feature vector 2 includes the similarity 21 between the frame 2 and the text 1, the similarity 22 between the frame 2 and the text 2, the similarity 23 between the frame 2 and the text 3, and the similarity 24 between the frame 2 and the character 4, wherein the similarity 21 is 15%, similarity 22 is 5%, similarity 23 is 65%, and similarity 24 is 95%;

The final layer feature vector 3 includes the similarity degree 31 of the frame 3 and the text 1, the similarity 32 of the frame 3 to the text 2, the similarity 33 of the frame 3 to the character 3, and the similarity 34 of the frame 3 and the character 4, wherein the similarity 31 is 10%, similarity 32 is 20%, similarity 33 is 65%, and similarity 34 is 30%;

The final layer feature vector 4 includes the similarity 41 of the frame 4 to the text 1, the similarity 42 of the frame 4 to the text 2, the similarity 43 of the frame 4 to the character 3, and the similarity 44 of the frame 4 and the character 4, wherein the similarity 41 is 10%, similarity 42 is 20%, similarity 43 is 55%, and similarity 44 is 30%.

First, determining at least one first candidate frame from the second frame sequence, the similarity between the first candidate frame and the target keyword is smaller than the similarity between the first candidate frame and the at least one character in the text set, and the number of the at least one character is less than The preset value, if the preset value is 3, indicates that at least one first candidate frame is determined from the second frame sequence, and specifically, the similarity between the first candidate frame and each character in the text set is from large to large The small order is arranged to obtain a sequence, and the similarity between the first candidate frame and the target keyword is within the first 3 digits of the sequence (the similarity between the first candidate frame and the target keyword is located in the first and second positions of the sequence) Bit or third place). At this time, at least one first candidate frame determined from the second frame sequence includes three, which are frame 2, frame 3, and frame 4.

Determining at least one second candidate frame from the at least one first candidate frame: the time similarity 23 and the similarity 33 are equal, both being 65%; the similarity 43 is 55%; thus determining from the at least one first candidate frame At least one second candidate frame includes two, frame 2 and frame 3, respectively.

Determining a second target frame from the at least one second candidate frame: the similarity 33 corresponding to the frame 3 is ranked first in each similarity corresponding to the frame 3; the similarity 23 corresponding to the frame 2 corresponds to the frame 2 The rank in each of the similarities is the second digit, so the frame 3 corresponding to the first bit is selected as the second target frame.

The voice keyword recognition method provided by the embodiment of the present application is more clear and complete, and is convenient for those skilled in the art to understand.

Further, in order to facilitate understanding of a voice keyword recognition method provided by the foregoing embodiment, the method is described in more detail below, please refer to FIG. 8.

As shown in Figure 8, the method includes:

It should be noted that each frame in the first frame sequence included in the corresponding first voice in the method is provided with a unique frame ID, wherein the sequence number of the frame in the first frame sequence is the frame ID of the frame. . For example, the first frame sequence includes three frames that are sequentially sorted, frame 1, frame 3, and frame 2, respectively. Then, the sequence number of frame 1 is 1, the frame ID is 1, the sequence number of frame 3 is 2, the frame ID is 2, the sequence number of frame 2 is 3, and the frame ID is 3.

Optionally, each keyword in the keyword sequence included in the voice keyword is set with a unique keyword ID, wherein the sequence number of the keyword in the keyword sequence is the keyword ID of the keyword. For example, the keyword sequence includes four keywords sorted in order, namely, keyword 1, keyword 3 keyword 2, and keyword 4. Then, the sequence number of the keyword 1 is 1, the keyword ID is 1, the sequence number of the keyword 3 is 2, the keyword ID is 2, the sequence number of the keyword 2 is 3, and the keyword ID is 3. Keyword 4 has a serial number of 4 and a keyword ID of 4.

S801, initialization frame ID: n=0; keyword ID: m=1; the calculator is set to zero;

S802, i=n++; determining whether the hidden layer feature vector of the i-th frame in the first frame sequence included in the first speech matches the keyword template corresponding to the m-th keyword in the voice keyword; if the matching is successful, Step S803 is performed; if the matching fails, step S806 is performed;

S803, determining whether the current keyword is the last keyword in the keyword sequence included in the voice keyword; if yes, executing step S804; if not, executing step S805;

S804. Determine that the first voice includes a voice keyword.

S805, setting the counter s is the trigger initial value; n++; returning to step S802;

Optionally, the trigger initial value is the threshold involved in the foregoing step S502. Optionally, the initial value of the trigger is 30.

The above is only an optional method for triggering the initial value provided by the embodiment of the present application. The inventor can arbitrarily set the specific value of the trigger initial value according to his own needs, which is not limited herein.

S806, s--;

Optionally, s-- indicates that the counter count is decremented by one.

S807, determining whether the count s of the counter is greater than 0; if yes, returning to step S802; if no, returning to step S801.

The above is only an alternative manner of the voice keyword recognition method provided by the embodiment of the present application. Specifically, the inventor can arbitrarily set a specific implementation manner of the voice keyword recognition method according to the embodiment of the present application. There is no limit here.

The method is described in detail in the above-disclosed embodiments of the present invention, and the method of the present invention can be implemented in various forms of the apparatus. Therefore, the present invention also discloses an apparatus, and a specific embodiment will be described in detail below.

FIG. 9 is a schematic structural diagram of a voice keyword recognition apparatus according to an embodiment of the present application.

As shown in Figure 9, the device includes:

a first target frame determining unit 91, configured to select a first target frame from a first frame sequence constituting the first voice;

The target keyword determining unit 92 is configured to select a keyword from the keyword sequence and determine the target keyword, wherein the keyword sequence belongs to the voice keyword;

The matching unit 93 is configured to: if the key layer template corresponding to the target keyword of the first target frame is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is used one by one. Determining whether a hidden layer feature vector of a frame located in the first voice matches, wherein the keyword template indicates a hidden layer feature vector of a second target frame in the second voice including the target keyword;

The identifying unit 94 is configured to determine, if the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, that the hidden layer feature vector of the frame located in the first voice is successfully matched, The voice keyword is included in the first voice. Further, the voice keyword recognition apparatus provided by the embodiment of the present application further includes: a return execution unit, configured to: when the matching fails, return to perform “selecting a frame from the first frame sequence constituting the first voice. Determine as the first target frame" step.

An embodiment of the present invention provides an optional structure of the first target frame determining unit 91.

Optionally, the first target frame determining unit 91 includes:

a first determining unit, configured to determine, from the first sequence of frames constituting the first voice, a frame that is never determined to be the first target frame;

And a second determining unit, configured to use the frame as the first target frame determined from the first frame sequence constituting the first voice.

An embodiment of the present invention provides an optional structure of the target keyword determining unit 92.

Optionally, the target keyword determining unit 92 includes:

a third determining unit, configured to determine, from the keyword sequence included in the voice keyword, a next keyword adjacent to a keyword corresponding to a keyword template that has been successfully matched last time;

a fourth determining unit, configured to determine the next keyword as a target keyword if the number of times the next keyword is continuously determined as the target keyword does not reach a preset threshold;

And a fifth determining unit, configured to determine, as the target keyword, the first keyword in the keyword sequence if the number of times the next keyword is continuously determined as the target keyword reaches the threshold.

Further, the voice keyword recognition apparatus provided by the embodiment of the present application further includes: a keyword template generating unit.

An optional structure of the keyword template generating unit provided by the embodiment of the present invention is shown in FIG. 10 .

As shown in FIG. 10, the keyword template generating unit includes:

a second voice determining unit 101, configured to determine a second voice that includes the target keyword, where the second voice is composed of a second sequence of frames;

The final layer feature vector determining unit 102 is configured to determine, as the input information of the preset voice model, the second layer voice as a final layer feature vector corresponding to each frame in the second frame sequence;

a second target frame determining unit 103, configured to determine a second target frame from the second frame sequence according to a final layer feature vector corresponding to each frame respectively;

a keyword template generating sub-unit 104, configured to generate, with the target keyword, a hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as input information of the voice model The corresponding keyword template.

In the embodiment of the present application, optionally, the end layer feature vector corresponding to the frame includes: a similarity between the frame and each text in a preset text set in the voice model, The target keyword is a character in the file set; the second target frame determining unit is specifically configured to: select and describe from the second frame sequence based on a final layer feature vector corresponding to each frame respectively The frame with the highest degree of similarity of the target keyword is used as the second target frame; wherein the degree of similarity between the frame and the target keyword is determined according to the similarity between the frame and each character in the text set.

An embodiment of the present invention provides an optional structure of the second target frame determining unit, which is shown in FIG.

As shown in FIG. 11, the second target frame determining unit includes:

The first candidate frame determining unit 111 is configured to determine at least one first candidate frame from the second frame sequence, where the similarity between the first candidate frame and the target keyword is smaller than the first candidate frame and the Comparing the similarity of at least one character in the text set, the number of the at least one character being less than a preset value;

a second candidate frame determining unit 112, configured to determine at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is the target in the at least one first candidate frame Each of the first candidate frames having the highest similarity of the keywords;

a second target frame determining sub-unit 113, configured to determine a second target frame from the at least one second candidate frame, in order of high to low similarity, the second target frame and the target keyword The similarity is located in the ranking of the similarity between the second target frame and each character, and the similarity between each of the second candidate frames except the second target frame and the target keyword is located in the The ranking in the similarity between the second candidate frame and each character.

In summary:

The various embodiments in the present specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the various embodiments may be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.

A person skilled in the art will further appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software or a combination of both, in order to clearly illustrate the hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented directly in hardware, a software module executed by a processor, or a combination of both. The software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but the scope of the invention is to be accorded

Claims

A voice keyword recognition method, comprising:

Selecting a first target frame from a sequence of first frames constituting the first voice;

Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;

If the key layer template corresponding to the target keyword is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;

If it is determined that the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
The method according to claim 1, wherein in the case that the matching fails, the method further comprises:

Returning to the step of performing the selection of one frame from the first frame sequence constituting the first speech as the first target frame.
The method according to claim 2, wherein the selecting a first target frame from the first sequence of frames constituting the first speech comprises:

Determining, from the first sequence of frames constituting the first speech, a frame that is never determined to be the first target frame;

The frame is taken as a first target frame determined from the first frame sequence constituting the first speech.
The method according to any one of claims 1 to 3, wherein the selecting a keyword from the keyword sequence to determine the target keyword comprises:

Determining, from the sequence of keywords included in the voice keyword, a next keyword adjacent to a keyword corresponding to a keyword template that has been successfully matched last time;

If the number of times the next keyword is continuously determined as the target keyword does not reach the preset threshold, the next keyword is determined as the target keyword;

If the number of times the next keyword is continuously determined as the target keyword reaches the threshold, the first keyword in the keyword sequence is determined as the target keyword.
The method according to any one of claims 1 to 4, wherein the process of generating the keyword template comprises:

Determining a second speech comprising the target keyword, the second speech being composed of a second sequence of frames;

Determining, by using the second voice as input information of a preset voice model, a final layer feature vector corresponding to each frame in the second frame sequence;

Determining a second target frame from the second frame sequence according to a final layer feature vector corresponding to each frame respectively;

Generating a keyword template corresponding to the target keyword according to the hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as input information of the voice model.
The method according to claim 5, wherein the end layer feature vector corresponding to the frame comprises: a similarity between the frame and each text in a preset text set in the voice model, The target keyword is a text in the file set;

Determining the second target frame from the second frame sequence according to the end layer feature vector corresponding to each frame respectively, including:

And selecting, according to the final layer feature vector corresponding to each frame, a frame with the highest degree of similarity to the target keyword as the second target frame; wherein, the frame and the target keyword The degree of similarity is determined based on the similarity between the frames and each of the texts in the set of words.
The method according to claim 6, wherein the frame having the highest degree of similarity to the target keyword is selected from the second frame sequence according to a final layer feature vector corresponding to each frame respectively. The second target frame includes:

Determining at least one first candidate frame from the second frame sequence, the similarity between the first candidate frame and the target keyword is smaller than the similarity between the first candidate frame and at least one character in the text set The number of the at least one character is less than a preset value;

Determining at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is the first one of the at least one first candidate frame having the greatest similarity with the target keyword Candidate frame

Determining a second target frame from the at least one second candidate frame, the similarity between the second target frame and the target keyword is located in the second target frame and each according to a sequence of similarity from high to low a ranking in the similarity of the characters, the degree of similarity of each of the second candidate frames and the target keyword being higher than the second target frame is located in the similarity between the second candidate frame and each character Ranking.
A voice keyword recognition device, comprising:

a first target frame determining unit, configured to select a first target frame from a first frame sequence constituting the first voice;

a target keyword determining unit, configured to select a keyword from the keyword sequence as the target keyword, wherein the keyword sequence belongs to the voice keyword;

a matching unit, configured to determine, according to the keyword template corresponding to each keyword in the keyword sequence, that the key template of the first target frame is successfully matched with the keyword template corresponding to the target keyword Whether the hidden layer feature vector of the frame located in the first voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;

The identifying unit is configured to determine, if the keyword template corresponding to each keyword in the keyword sequence is determined one by one, that the hidden layer feature vector of the frame located in the first voice is successfully matched, The voice keyword is included in a voice.
The apparatus according to claim 8, further comprising: a return execution unit, configured to: when the matching fails, return to perform execution of the frame selected from the first frame sequence constituting the first speech as The step of the first target frame.
The apparatus according to claim 9, wherein the first target frame determining unit comprises:

a first determining unit, configured to determine, from the first sequence of frames constituting the first voice, a frame that is never determined to be the first target frame;

And a second determining unit, configured to use the frame as the first target frame determined from the first frame sequence constituting the first voice.
The device according to any one of claims 8 to 10, wherein the target keyword determining unit comprises:

a third determining unit, configured to determine, from the keyword sequence included in the voice keyword, a next keyword adjacent to a keyword corresponding to a keyword template that has been successfully matched last time;

a fourth determining unit, configured to determine the next keyword as a target keyword if the number of times the next keyword is continuously determined as the target keyword does not reach a preset threshold;

And a fifth determining unit, configured to determine, as the target keyword, the first keyword in the keyword sequence if the number of times the next keyword is continuously determined as the target keyword reaches the threshold.
The device according to any one of claims 8 to 11, further comprising a keyword template generating unit, the keyword template generating unit comprising:

a second voice determining unit, configured to determine a second voice that includes the target keyword, where the second voice is composed of a second sequence of frames;

a final layer feature vector determining unit, configured to use the second voice as input information of a preset voice model, and determine a final layer feature vector corresponding to each frame in the second frame sequence;

a second target frame determining unit, configured to determine a second target frame from the second frame sequence according to a final layer feature vector corresponding to each frame respectively;

a keyword template generating subunit, configured to generate a hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as input information of the voice model, and generate a mapping corresponding to the target keyword Keyword template.
The apparatus according to claim 12, wherein the end layer feature vector corresponding to the frame comprises: a similarity between the frame and each text in a preset text set in the voice model, The target keyword is a text in the file set;

The second target frame determining unit is configured to: select, according to the final layer feature vector corresponding to each frame, a frame with the highest degree of similarity to the target keyword as the second frame from the second frame sequence. a target frame; wherein a degree of similarity between the frame and the target keyword is determined according to a similarity between the frame and each of the characters in the set of characters.
The apparatus according to claim 13, wherein the second target frame determining unit comprises:

a first candidate frame determining unit, configured to determine at least one first candidate frame from the second frame sequence, where a similarity between the first candidate frame and the target keyword is smaller than the first candidate frame and the a similarity of at least one character in the text set, the number of the at least one text being less than a preset value;

a second candidate frame determining unit, configured to determine at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is the target key in the at least one first candidate frame Each of the first candidate frames having the largest similarity of words;

a second target frame determining subunit, configured to determine a second target frame from the at least one second candidate frame, the second target frame is similar to the target keyword according to a sequence of similarity from high to low a ranking of a degree of similarity between the second target frame and each character, and a similarity between each of the second candidate frames and the target keyword except the second target frame is located at the first The ranking in the similarity between the two candidate frames and each text.
A terminal, comprising: a memory for storing a program, the processor calling the program, the program for:

Selecting a first target frame from a sequence of first frames constituting the first voice;

Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;

If the key layer template corresponding to the target keyword is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;

If it is determined that the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
A voice keyword recognition server, comprising: a memory for storing a program, the processor calling the program, the program for:

Selecting a first target frame from a sequence of first frames constituting the first voice;

Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;

If the key layer template corresponding to the target keyword is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;

If it is determined that the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
A computer readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7.
A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 7.