WO2018219023A1 - Speech keyword identification method and device, terminal and server - Google Patents

Speech keyword identification method and device, terminal and server Download PDF

Info

Publication number
WO2018219023A1
WO2018219023A1 PCT/CN2018/079769 CN2018079769W WO2018219023A1 WO 2018219023 A1 WO2018219023 A1 WO 2018219023A1 CN 2018079769 W CN2018079769 W CN 2018079769W WO 2018219023 A1 WO2018219023 A1 WO 2018219023A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
frame
target
voice
sequence
Prior art date
Application number
PCT/CN2018/079769
Other languages
French (fr)
Chinese (zh)
Inventor
王珺
黄志恒
于蒙
蒲松柏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018219023A1 publication Critical patent/WO2018219023A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present invention relates to the field of voice recognition technology, and in particular, to a voice keyword recognition method, device, terminal, and server.
  • voice wake-up technology is more and more widely used in electronic devices, which greatly facilitates the user's operation on electronic devices, allowing users to interact with electronic devices without manual interaction.
  • the word activates the corresponding processing module in the electronic device.
  • Apple's mobile phone uses the keyword "siri" as the voice keyword to activate the voice dialogue assistant function in the Apple mobile phone.
  • the Apple mobile phone detects that the user inputs the voice including the keyword "siri", it automatically activates the voice in the Apple mobile phone.
  • Dialogue Assistant feature When the Apple mobile phone detects that the user inputs the voice including the keyword "siri", it automatically activates the voice in the Apple mobile phone.
  • a voice keyword recognition method, device, terminal and server are provided to realize the recognition of voice keywords in voice, which is crucial for the development of voice wake-up technology.
  • an embodiment of the present invention provides a voice keyword recognition method, apparatus, terminal, and server to implement voice keyword recognition in voice.
  • the embodiment of the present invention provides the following technical solutions:
  • a voice keyword recognition method includes:
  • Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
  • the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
  • the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
  • a voice keyword recognition device includes:
  • a first target frame determining unit configured to select a first target frame from a first frame sequence constituting the first voice
  • a target keyword determining unit configured to select a keyword from the keyword sequence as the target keyword, wherein the keyword sequence belongs to the voice keyword
  • a matching unit configured to determine, according to the keyword template corresponding to each keyword in the keyword sequence, that the key template of the first target frame is successfully matched with the keyword template corresponding to the target keyword Whether the hidden layer feature vector of the frame located in the first voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
  • the identifying unit is configured to determine, if the keyword template corresponding to each keyword in the keyword sequence is determined one by one, that the hidden layer feature vector of the frame located in the first voice is successfully matched, The voice keyword is included in a voice.
  • a terminal includes a memory for storing a program, and a processor calling the program, the program for:
  • Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
  • the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
  • the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
  • a voice keyword recognition server includes a memory and a processor, the memory is used to store a program, and the processor calls the program, the program is used to:
  • Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
  • the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
  • the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
  • a computer readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of the first aspect.
  • a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect.
  • the embodiment of the invention discloses a voice keyword recognition method, device, terminal and server, which determine a first target frame from a first frame sequence constituting the first voice; and determine a target from a keyword sequence included in the voice keyword a keyword; when it is determined that the hidden layer feature vector of the target frame is successfully matched with the keyword template corresponding to the target keyword (the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword), If the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the manner in which the voice keyword is included in the first voice is determined. The recognition of the speech keywords in the first speech is effectively implemented. Further, the electronic device that facilitates using the voice wake-up technology automatically activates a processing module corresponding to the voice keyword when identifying that the voice keyword is included in the first voice.
  • FIG. 1 is a schematic structural diagram of a voice keyword recognition server according to an embodiment of the present application
  • FIG. 2 is a flowchart of a method for identifying a voice keyword according to an embodiment of the present application
  • FIG. 3 is a flowchart of another method for identifying a voice keyword according to an embodiment of the present application
  • FIG. 4 is a flowchart of a method for selecting a frame from a first frame sequence constituting a first voice to be determined as a first target frame according to an embodiment of the present disclosure
  • FIG. 5 is a flowchart of a method for selecting a keyword from a keyword sequence included in a voice keyword to be determined as a target keyword according to an embodiment of the present disclosure
  • FIG. 6 is a flowchart of a method for generating a keyword template corresponding to a target keyword according to an embodiment of the present disclosure
  • FIG. 7 is a flowchart of a method for selecting a frame with the highest degree of similarity with a target keyword as a second target frame from a second frame sequence based on a final layer feature vector corresponding to each frame according to an embodiment of the present application. ;
  • FIG. 8 is a flowchart of another voice keyword recognition method according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a voice keyword recognition apparatus according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a keyword template generating unit according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of a second target frame determining unit according to an embodiment of the present disclosure.
  • the embodiment of the present application provides a voice keyword identification method, which is applied to a terminal or a server.
  • the terminal is an electronic device, for example, a mobile terminal, a desktop, or the like.
  • the terminal is an electronic device, for example, a mobile terminal, a desktop, or the like.
  • the above is only an optional manner of the terminal provided by the embodiment of the present application.
  • the inventor can arbitrarily set the specific expression of the terminal according to the requirements of the present application, which is not limited herein.
  • the function of the server (referred to herein as a voice keyword recognition server) to which the voice keyword identification method provided by the embodiment of the present application is applied may be implemented by a single server or a server cluster composed of multiple servers. There is no limit here.
  • the voice keyword recognition server includes a processor 11 and a memory 12.
  • the processor 11, the memory 12, and the communication interface 13 complete communication with each other via the communication bus 14.
  • the communication interface 13 may be an interface of the communication module, such as an interface of a Global System for Mobile Communication (GSM) module.
  • GSM Global System for Mobile Communication
  • the processor 11 is configured to execute a program.
  • the processor 11 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 12 is used to store a program.
  • the program can include program code, the program code including computer operating instructions.
  • the program may include a program corresponding to the user interface editor described above.
  • the memory 12 may include a high speed random access memory (RAM) memory, and may also include a non-volatile memory (NVM), such as at least one disk memory.
  • RAM high speed random access memory
  • NVM non-volatile memory
  • the program can be specifically used to:
  • the matching is successful, if the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the first voice is determined. Includes voice keywords.
  • the structure of a terminal provided by the embodiment of the present application includes at least the structure of the voice keyword recognition server as shown in FIG. 1 above.
  • the structure of the terminal refer to the description of the structure of the voice keyword recognition server. I will not repeat them here.
  • the embodiment of the present application provides a flowchart of a voice keyword recognition method, which is shown in FIG. 2 .
  • the method includes:
  • S201 Select a frame first target frame from a first frame sequence constituting the first voice
  • step S203 Determine whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, and the keyword template indicates the hidden layer feature vector of the second target frame in the second voice that includes the target keyword; If the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, step S204 is performed.
  • a voice model is pre-set, and the second voice (including the second voice sequence including the second frame sequence) of the target keyword is input into the voice model, and the hidden layer feature vector of the second target frame in the second voice is obtained.
  • the keyword template corresponding to the target keyword indicates the obtained hidden layer feature vector.
  • the speech model is generated based on a Long Short-Term Memory (LSTM) and a Connectionist Temporal Classification (CTC).
  • LSTM Long Short-Term Memory
  • CTC Connectionist Temporal Classification
  • the above is only an optional manner for generating a voice model provided by the embodiment of the present application.
  • the inventor can arbitrarily set the specific generation process of the voice model according to his own needs, which is not limited herein.
  • the first speech input speech model including the first frame sequence is included, and a hidden layer feature vector corresponding to the first target frame in the first speech is obtained.
  • the hidden layer feature vector of the first target frame is matched with the keyword template corresponding to the target keyword, and it is determined whether the hidden layer feature vector of the first target frame matches the keyword template corresponding to the target keyword, if the matching is successful.
  • Step S204 is successfully executed.
  • determining whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword includes: calculating a hidden layer feature vector and a target keyword of the first target frame The cosine distance between the corresponding keyword templates; if the calculated cosine distance satisfies the preset value, it is determined that the hidden layer feature vector of the first target frame matches the keyword template corresponding to the target keyword; if the calculated cosine is obtained; If the distance does not meet the preset value, it is determined that the hidden layer feature vector of the first target frame is not successfully matched (failed) with the keyword template corresponding to the target keyword.
  • step S203 determining whether the keyword template corresponding to each keyword in the keyword sequence has been determined one by one has determined the hidden layer feature of the frame located in the first voice. The vector is successfully matched with it; if so, it is determined that the voice is included in the first voice.
  • FIG. 3 is a flowchart of another voice keyword recognition method according to an embodiment of the present application.
  • the method includes:
  • step S303 Determine whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, and the keyword template indicates the hidden layer feature vector of the second target frame in the second voice that includes the target keyword; If the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, step S304 is performed; if the matching is unsuccessful, the process returns to step S301;
  • step S304 Determine whether the keyword template corresponding to each keyword in the keyword sequence has been determined one by one, and the hidden layer feature vector of the frame located in the first voice has been determined to be successfully matched. If yes, step S305 is performed; Otherwise, return to step S301;
  • the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, and the hidden layer feature vector of the frame located in the first voice is determined to be successfully matched, including: for each keyword sequence The keyword templates corresponding to the keywords have been determined that the hidden layer feature vector of the frame located in the first voice is successfully matched; and the keywords that match the keyword template are successfully sorted according to the order of successful matching.
  • the result obtained is a sequence of keywords.
  • a flow chart of a method for determining a frame from a first frame sequence constituting a first voice as a first target frame is provided. 4.
  • the method includes:
  • the determined frame is used as a first target frame determined from a first frame sequence constituting the first voice.
  • the first speech comprises a first sequence of frames
  • the first sequence of frames is composed of at least one frame arranged in sequence.
  • Determining a frame from the first frame sequence constituting the first speech as the first target frame includes: selecting one frame from the first frame sequence as the first target frame, and the first target frame is the slave in the first frame sequence The frame that is not the first target frame and is sorted in the first frame sequence.
  • a flow chart for selecting a keyword from a keyword sequence included in a voice keyword to be a target keyword is provided. Referring to FIG. 5 .
  • the method includes:
  • S501 Determine, from a keyword sequence included in the voice keyword, a next keyword adjacent to the keyword corresponding to the keyword template that has been successfully matched last time;
  • the keyword sequence is composed of multiple keywords that are sequentially sorted.
  • the keyword sequence included in the voice keyword is “Little Red Hello”
  • the keyword corresponding to the key template of the last successful match is “red”
  • the keyword sequence included in the voice keyword is The next keyword adjacent to the keyword corresponding to the last successful keyword template is the keyword "you”.
  • step S502 determining whether the number of times the next keyword is continuously determined as the target keyword reaches a preset threshold; if the number of times the next keyword is continuously determined as the target keyword does not reach the preset threshold, step S503 is performed; If the number of times the next keyword is continuously determined as the target keyword reaches the threshold, step S504 is performed;
  • the preset threshold is 30 times.
  • the foregoing is only an optional manner of the threshold provided by the embodiment of the present application.
  • the inventor may arbitrarily set the specific content of the threshold according to his own needs, which is not limited herein.
  • the first keyword in the keyword sequence is determined as the target keyword, including: the first keyword in the keyword sequence "Small" is determined as the target keyword.
  • a flow chart of a method for generating a keyword template corresponding to a target keyword is provided. Referring to FIG. 6 .
  • the method includes:
  • the process of generating a keyword template corresponding to the target keyword includes: determining a second voice that includes the target keyword, the second voice is composed of a second frame sequence, and the second frame sequence is composed of at least one frame that is sequentially arranged .
  • the second voice is used as the input information of the preset voice model, and the final layer feature vector corresponding to each frame in the second frame sequence is determined respectively.
  • a voice model is pre-set, and the input information of the voice model is a voice (eg, a second voice)/frame, and the output information may include a hidden layer feature vector and a final layer feature vector respectively corresponding to each frame input.
  • the second voice is used as the input information of the voice model, and the final layer feature vector corresponding to each frame in the second frame sequence included in the second voice is obtained.
  • one frame is selected as the second target frame from the second voice according to the end layer feature vector corresponding to each frame in the second frame sequence included in the second voice.
  • the second target frame is used as the input information of the voice model
  • the obtained process of the hidden layer feature vector corresponding to the second target frame may be implemented in step S602, where the second voice is used as the input of the preset voice model. And determining, by the information, a final layer feature vector corresponding to each frame in the second frame sequence, and a hidden layer feature vector corresponding to each frame in the second frame sequence respectively; and further, in the process of performing step S604, directly From the result of the "hidden layer feature vector corresponding to each frame in the second frame sequence" of step S602, the hidden layer feature vector corresponding to the second target frame is directly acquired.
  • step S602 the process of the “hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as the input information of the voice model” is implemented in step S602, which is not limited herein.
  • the number of the second voices is at least one
  • the keyword template corresponding to the target keyword is generated according to the hidden layer feature vector corresponding to the second target frame, including: determining the second and the second voice respectively
  • the hidden layer feature vector corresponding to the two target frames is averaged for each determined hidden layer feature vector, and the obtained result is used as a keyword template corresponding to the target keyword.
  • a method for determining a second target frame from a second frame sequence based on a final layer feature vector corresponding to each frame is provided. Introduction.
  • the end layer feature vector corresponding to the frame includes: a similarity between the frame and each text in the preset text set in the voice model, and the target keyword is one in the file set. Text.
  • the final layer feature vector corresponding to the frame includes: the similarity between the frame and each of the 5200 Chinese characters.
  • Determining the second target frame from the second frame sequence based on the end layer feature vectors respectively corresponding to each frame comprising: selecting and targeting the target from the second frame sequence according to the final layer feature vector corresponding to each frame respectively The frame with the highest degree of similarity of words is used as the second target frame; wherein the degree of similarity between the frame and the target keyword is determined according to the similarity between the frame and each character in the text set.
  • the method includes:
  • S701 Determine at least one first candidate frame from the second frame sequence, where the similarity between the first candidate frame and the target keyword is smaller than the similarity between the first candidate frame and the at least one character in the text set, and the number of the at least one character is less than Default value
  • S702. Determine at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is each of the first candidate frames having the greatest similarity with the target keyword in the at least one first candidate frame.
  • S703. Determine a second target frame from the at least one second candidate frame.
  • the similarity between the second target frame and the target keyword is in the similarity between the second target frame and each character according to the order of similarity from high to low.
  • the ranking is higher than the ranking of each second candidate frame and the target keyword except the second target frame in the similarity between the second candidate frame and each character.
  • the frame with the highest degree of similarity with the target keyword is selected from the second frame sequence.
  • the understanding of the method of the second target frame is now illustrated by:
  • the preset text set in the voice model includes four characters, namely, text 1, text 2, respectively Text 3 and text 4, where text 3 is the target keyword.
  • the final layer feature vector 1 includes a similarity degree 11 between the frame 1 and the text 1, a similarity 12 between the frame 1 and the text 2, a similarity 13 between the frame 1 and the text 3, and a similarity 14 between the frame 1 and the character 4, wherein The similarity 11 is 20%, the similarity 12 is 30%, the similarity 13 is 15%, and the similarity 14 is 50%;
  • the final layer feature vector 2 includes the similarity 21 between the frame 2 and the text 1, the similarity 22 between the frame 2 and the text 2, the similarity 23 between the frame 2 and the text 3, and the similarity 24 between the frame 2 and the character 4, wherein the similarity 21 is 15%, similarity 22 is 5%, similarity 23 is 65%, and similarity 24 is 95%;
  • the final layer feature vector 3 includes the similarity degree 31 of the frame 3 and the text 1, the similarity 32 of the frame 3 to the text 2, the similarity 33 of the frame 3 to the character 3, and the similarity 34 of the frame 3 and the character 4, wherein the similarity 31 is 10%, similarity 32 is 20%, similarity 33 is 65%, and similarity 34 is 30%;
  • the final layer feature vector 4 includes the similarity 41 of the frame 4 to the text 1, the similarity 42 of the frame 4 to the text 2, the similarity 43 of the frame 4 to the character 3, and the similarity 44 of the frame 4 and the character 4, wherein the similarity 41 is 10%, similarity 42 is 20%, similarity 43 is 55%, and similarity 44 is 30%.
  • At least one first candidate frame from the second frame sequence the similarity between the first candidate frame and the target keyword is smaller than the similarity between the first candidate frame and the at least one character in the text set, and the number of the at least one character is less than
  • the preset value if the preset value is 3, indicates that at least one first candidate frame is determined from the second frame sequence, and specifically, the similarity between the first candidate frame and each character in the text set is from large to large
  • the small order is arranged to obtain a sequence, and the similarity between the first candidate frame and the target keyword is within the first 3 digits of the sequence (the similarity between the first candidate frame and the target keyword is located in the first and second positions of the sequence) Bit or third place).
  • at least one first candidate frame determined from the second frame sequence includes three, which are frame 2, frame 3, and frame 4.
  • At least one second candidate frame includes two, frame 2 and frame 3, respectively.
  • the similarity 33 corresponding to the frame 3 is ranked first in each similarity corresponding to the frame 3; the similarity 23 corresponding to the frame 2 corresponds to the frame 2
  • the rank in each of the similarities is the second digit, so the frame 3 corresponding to the first bit is selected as the second target frame.
  • the voice keyword recognition method provided by the embodiment of the present application is more clear and complete, and is convenient for those skilled in the art to understand.
  • the method includes:
  • each frame in the first frame sequence included in the corresponding first voice in the method is provided with a unique frame ID, wherein the sequence number of the frame in the first frame sequence is the frame ID of the frame.
  • the first frame sequence includes three frames that are sequentially sorted, frame 1, frame 3, and frame 2, respectively. Then, the sequence number of frame 1 is 1, the frame ID is 1, the sequence number of frame 3 is 2, the frame ID is 2, the sequence number of frame 2 is 3, and the frame ID is 3.
  • each keyword in the keyword sequence included in the voice keyword is set with a unique keyword ID, wherein the sequence number of the keyword in the keyword sequence is the keyword ID of the keyword.
  • the keyword sequence includes four keywords sorted in order, namely, keyword 1, keyword 3 keyword 2, and keyword 4. Then, the sequence number of the keyword 1 is 1, the keyword ID is 1, the sequence number of the keyword 3 is 2, the keyword ID is 2, the sequence number of the keyword 2 is 3, and the keyword ID is 3.
  • Keyword 4 has a serial number of 4 and a keyword ID of 4.
  • step S805 setting the counter s is the trigger initial value; n++; returning to step S802;
  • the trigger initial value is the threshold involved in the foregoing step S502.
  • the initial value of the trigger is 30.
  • s-- indicates that the counter count is decremented by one.
  • the voice keyword recognition method provided by the embodiment of the present application is more clear and complete, and is convenient for those skilled in the art to understand.
  • FIG. 9 is a schematic structural diagram of a voice keyword recognition apparatus according to an embodiment of the present application.
  • the device includes:
  • a first target frame determining unit 91 configured to select a first target frame from a first frame sequence constituting the first voice
  • the target keyword determining unit 92 is configured to select a keyword from the keyword sequence and determine the target keyword, wherein the keyword sequence belongs to the voice keyword;
  • the matching unit 93 is configured to: if the key layer template corresponding to the target keyword of the first target frame is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is used one by one. Determining whether a hidden layer feature vector of a frame located in the first voice matches, wherein the keyword template indicates a hidden layer feature vector of a second target frame in the second voice including the target keyword;
  • the identifying unit 94 is configured to determine, if the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, that the hidden layer feature vector of the frame located in the first voice is successfully matched, The voice keyword is included in the first voice. Further, the voice keyword recognition apparatus provided by the embodiment of the present application further includes: a return execution unit, configured to: when the matching fails, return to perform “selecting a frame from the first frame sequence constituting the first voice. Determine as the first target frame" step.
  • An embodiment of the present invention provides an optional structure of the first target frame determining unit 91.
  • the first target frame determining unit 91 includes:
  • a first determining unit configured to determine, from the first sequence of frames constituting the first voice, a frame that is never determined to be the first target frame;
  • a second determining unit configured to use the frame as the first target frame determined from the first frame sequence constituting the first voice.
  • An embodiment of the present invention provides an optional structure of the target keyword determining unit 92.
  • the target keyword determining unit 92 includes:
  • a third determining unit configured to determine, from the keyword sequence included in the voice keyword, a next keyword adjacent to a keyword corresponding to a keyword template that has been successfully matched last time;
  • a fourth determining unit configured to determine the next keyword as a target keyword if the number of times the next keyword is continuously determined as the target keyword does not reach a preset threshold
  • a fifth determining unit configured to determine, as the target keyword, the first keyword in the keyword sequence if the number of times the next keyword is continuously determined as the target keyword reaches the threshold.
  • the voice keyword recognition apparatus provided by the embodiment of the present application further includes: a keyword template generating unit.
  • FIG. 10 An optional structure of the keyword template generating unit provided by the embodiment of the present invention is shown in FIG. 10 .
  • the keyword template generating unit includes:
  • a second voice determining unit 101 configured to determine a second voice that includes the target keyword, where the second voice is composed of a second sequence of frames;
  • the final layer feature vector determining unit 102 is configured to determine, as the input information of the preset voice model, the second layer voice as a final layer feature vector corresponding to each frame in the second frame sequence;
  • a second target frame determining unit 103 configured to determine a second target frame from the second frame sequence according to a final layer feature vector corresponding to each frame respectively;
  • a keyword template generating sub-unit 104 configured to generate, with the target keyword, a hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as input information of the voice model The corresponding keyword template.
  • the end layer feature vector corresponding to the frame includes: a similarity between the frame and each text in a preset text set in the voice model,
  • the target keyword is a character in the file set
  • the second target frame determining unit is specifically configured to: select and describe from the second frame sequence based on a final layer feature vector corresponding to each frame respectively The frame with the highest degree of similarity of the target keyword is used as the second target frame; wherein the degree of similarity between the frame and the target keyword is determined according to the similarity between the frame and each character in the text set.
  • An embodiment of the present invention provides an optional structure of the second target frame determining unit, which is shown in FIG.
  • the second target frame determining unit includes:
  • the first candidate frame determining unit 111 is configured to determine at least one first candidate frame from the second frame sequence, where the similarity between the first candidate frame and the target keyword is smaller than the first candidate frame and the Comparing the similarity of at least one character in the text set, the number of the at least one character being less than a preset value;
  • a second candidate frame determining unit 112 configured to determine at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is the target in the at least one first candidate frame Each of the first candidate frames having the highest similarity of the keywords;
  • a second target frame determining sub-unit 113 configured to determine a second target frame from the at least one second candidate frame, in order of high to low similarity, the second target frame and the target keyword
  • the similarity is located in the ranking of the similarity between the second target frame and each character, and the similarity between each of the second candidate frames except the second target frame and the target keyword is located in the The ranking in the similarity between the second candidate frame and each character.
  • the embodiment of the invention discloses a voice keyword recognition method, device, terminal and server, which determine a first target frame from a first frame sequence constituting the first voice; and determine a target from a keyword sequence included in the voice keyword a keyword; when it is determined that the hidden layer feature vector of the target frame is successfully matched with the keyword template corresponding to the target keyword (the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword), If the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the manner in which the voice keyword is included in the first voice is determined. The recognition of the speech keywords in the first speech is effectively implemented. Further, the electronic device that facilitates using the voice wake-up technology automatically activates a processing module corresponding to the voice keyword when identifying that the voice keyword is included in the first voice.
  • the steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented directly in hardware, a software module executed by a processor, or a combination of both.
  • the software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

Abstract

A speech keyword identification method and device, a terminal and a server, the method comprising: selecting a frame from a first frame sequence which forms a first speech and determining the same to be a first target frame (S201); selecting a keyword from a keyword sequence comprised in speech keywords and determining the same to be a target keyword (S202); determining whether an implicit feature vector of the target frame is matched successfully with a keyword template corresponding to the target keyword (S203); and if determined that the implicit feature vector of the frame in the first speech is matched successfully with the keyword template corresponding to each keyword in the keyword sequence one by one, determining that the first speech comprises the speech keyword therein (S204). The described method effectively carries out identification of the speech keywords in the first speech, and furthermore, facilitates an electronic device which uses speech awakening technology to automatically activate a processing module corresponding to the speech keyword when identifying that the first speech comprises the speech keyword therein.

Description

一种语音关键词识别方法、装置、终端及服务器Voice keyword recognition method, device, terminal and server
本申请要求于2017年5月27日提交中国专利局、申请号为201710391388.6、发明名称为“一种语音关键词识别方法、装置、终端及服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on May 27, 2017, the Chinese Patent Office, the application number is 201710391388.6, and the invention name is "a voice keyword recognition method, device, terminal and server". The citations are incorporated herein by reference.
技术领域Technical field
本发明涉及语音识别技术领域,具体涉及一种语音关键词识别方法、装置、终端及服务器。The present invention relates to the field of voice recognition technology, and in particular, to a voice keyword recognition method, device, terminal, and server.
背景技术Background technique
随着科技的发展,语音唤醒技术在电子设备中的应用越来越广泛,其极大程度的方便了用户对电子设备的操作,允许用户与电子设备之间无需手动交互,即可通过语音关键词激活电子设备中相应的处理模块。With the development of technology, voice wake-up technology is more and more widely used in electronic devices, which greatly facilitates the user's operation on electronic devices, allowing users to interact with electronic devices without manual interaction. The word activates the corresponding processing module in the electronic device.
例如,苹果手机采用关键词“siri”作为激活苹果手机中的语音对话智能助理功能的语音关键词,当苹果手机检测到用户输入包括关键词“siri”的语音时,自动激活苹果手机中的语音对话智能助理功能。For example, Apple's mobile phone uses the keyword "siri" as the voice keyword to activate the voice dialogue assistant function in the Apple mobile phone. When the Apple mobile phone detects that the user inputs the voice including the keyword "siri", it automatically activates the voice in the Apple mobile phone. Dialogue Assistant feature.
有鉴于此,提供一种语音关键词识别方法、装置、终端及服务器,以实现对语音中的语音关键词的识别,对于语音唤醒技术的发展是至关重要的。In view of the above, a voice keyword recognition method, device, terminal and server are provided to realize the recognition of voice keywords in voice, which is crucial for the development of voice wake-up technology.
发明内容Summary of the invention
有鉴于此,本发明实施例提供一种语音关键词识别方法、装置、终端及服务器,以实现对语音中的语音关键词的识别。In view of this, an embodiment of the present invention provides a voice keyword recognition method, apparatus, terminal, and server to implement voice keyword recognition in voice.
为实现上述目的,本发明实施例提供如下技术方案:To achieve the above objective, the embodiment of the present invention provides the following technical solutions:
一种语音关键词识别方法,包括:A voice keyword recognition method includes:
从构成第一语音的第一帧序列中选取一个第一目标帧;Selecting a first target frame from a sequence of first frames constituting the first voice;
从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板 匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;If the key layer template corresponding to the target keyword is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。If it is determined that the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
一种语音关键词识别装置,包括:A voice keyword recognition device includes:
第一目标帧确定单元,用于从构成第一语音的第一帧序列中选取一个第一目标帧;a first target frame determining unit, configured to select a first target frame from a first frame sequence constituting the first voice;
目标关键字确定单元,用于从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;a target keyword determining unit, configured to select a keyword from the keyword sequence as the target keyword, wherein the keyword sequence belongs to the voice keyword;
匹配单元,用于若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;a matching unit, configured to determine, according to the keyword template corresponding to each keyword in the keyword sequence, that the key template of the first target frame is successfully matched with the keyword template corresponding to the target keyword Whether the hidden layer feature vector of the frame located in the first voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
识别单元,用于若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。The identifying unit is configured to determine, if the keyword template corresponding to each keyword in the keyword sequence is determined one by one, that the hidden layer feature vector of the frame located in the first voice is successfully matched, The voice keyword is included in a voice.
一种终端,包括存储器和处理器,所述存储器用于存储程序,所述处理器调用所述程序,所述程序用于:A terminal includes a memory for storing a program, and a processor calling the program, the program for:
从构成第一语音的第一帧序列中选取一个第一目标帧;Selecting a first target frame from a sequence of first frames constituting the first voice;
从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;If the key layer template corresponding to the target keyword is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位 于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。If it is determined that the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
一种语音关键词识别服务器,包括存储器和处理器,所述存储器用于存储程序,所述处理器调用所述程序,所述程序用于:A voice keyword recognition server includes a memory and a processor, the memory is used to store a program, and the processor calls the program, the program is used to:
从构成第一语音的第一帧序列中选取一个第一目标帧;Selecting a first target frame from a sequence of first frames constituting the first voice;
从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;If the key layer template corresponding to the target keyword is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。If it is determined that the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如第一方面所述的方法。A computer readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of the first aspect.
一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行如第一方面所述的方法。A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect.
本发明实施例公开了一种语音关键词识别方法、装置、终端及服务器,通过从构成第一语音的第一帧序列中确定第一目标帧;从语音关键词包括的关键字序列中确定目标关键字;在确定目标帧的隐层特征向量与目标关键字对应的关键字模板匹配成功时(关键字模板指示包括目标关键字的第二语音中的第二目标帧的隐层特征向量),若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,确定第一语音中包括语音关键词的方式,有效实现了对第一语音中的语音关键词的识别。进一步的,便于使用语音唤醒技术的电子设备在识别出第一语音中包括语音关键词时,自动激活与所述语音关键词相应的处理模块。The embodiment of the invention discloses a voice keyword recognition method, device, terminal and server, which determine a first target frame from a first frame sequence constituting the first voice; and determine a target from a keyword sequence included in the voice keyword a keyword; when it is determined that the hidden layer feature vector of the target frame is successfully matched with the keyword template corresponding to the target keyword (the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword), If the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the manner in which the voice keyword is included in the first voice is determined. The recognition of the speech keywords in the first speech is effectively implemented. Further, the electronic device that facilitates using the voice wake-up technology automatically activates a processing module corresponding to the voice keyword when identifying that the voice keyword is included in the first voice.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can obtain other drawings according to the provided drawings without any creative work.
图1为本申请实施例提供的一种语音关键词识别服务器的结构示意图;FIG. 1 is a schematic structural diagram of a voice keyword recognition server according to an embodiment of the present application;
图2为本申请实施例提供的一种语音关键词识别方法的流程图;2 is a flowchart of a method for identifying a voice keyword according to an embodiment of the present application;
图3为本申请实施例提供的另一种语音关键词识别方法的流程图;FIG. 3 is a flowchart of another method for identifying a voice keyword according to an embodiment of the present application;
图4为本申请实施例提供的一种从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧的方法流程图;4 is a flowchart of a method for selecting a frame from a first frame sequence constituting a first voice to be determined as a first target frame according to an embodiment of the present disclosure;
图5为本申请实施例提供的一种从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字的方法流程图;FIG. 5 is a flowchart of a method for selecting a keyword from a keyword sequence included in a voice keyword to be determined as a target keyword according to an embodiment of the present disclosure;
图6为本申请实施例提供的一种与目标关键字对应的关键字模板的生成方法流程图;FIG. 6 is a flowchart of a method for generating a keyword template corresponding to a target keyword according to an embodiment of the present disclosure;
图7为本申请实施例提供的一种基于分别与每个帧对应的终层特征向量,从第二帧序列中选取与目标关键字的相似程度最高的帧作为第二目标帧的方法流程图;FIG. 7 is a flowchart of a method for selecting a frame with the highest degree of similarity with a target keyword as a second target frame from a second frame sequence based on a final layer feature vector corresponding to each frame according to an embodiment of the present application. ;
图8为本申请实施例提供的另一种语音关键词识别方法的流程图;FIG. 8 is a flowchart of another voice keyword recognition method according to an embodiment of the present application;
图9为本申请实施例提供的一种语音关键词识别装置的结构示意图;FIG. 9 is a schematic structural diagram of a voice keyword recognition apparatus according to an embodiment of the present application;
图10为本申请实施例提供的一种关键字模板生成单元的详细结构示意图;FIG. 10 is a schematic structural diagram of a keyword template generating unit according to an embodiment of the present disclosure;
图11为本申请实施例提供的一种第二目标帧确定单元的详细结构示意图。FIG. 11 is a schematic structural diagram of a second target frame determining unit according to an embodiment of the present disclosure.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
实施例:Example:
本申请实施例提供一种语音关键词识别方法,应用于终端或服务器。The embodiment of the present application provides a voice keyword identification method, which is applied to a terminal or a server.
在本申请实施例中,可选的,终端为电子设备,例如,移动终端、台式机等。以上仅仅是本申请实施例提供的终端的可选方式,发明人可根据自己的需求任意设置终端的具体表现形式,在此不做限定。In the embodiment of the present application, optionally, the terminal is an electronic device, for example, a mobile terminal, a desktop, or the like. The above is only an optional manner of the terminal provided by the embodiment of the present application. The inventor can arbitrarily set the specific expression of the terminal according to the requirements of the present application, which is not limited herein.
可选的,应用本申请实施例提供的一种语音关键词识别方法的服务器(此处可称为语音关键词识别服务器)的功能可由单台服务器实现也可由多台服务器构成的服务器集群实现,在此不做限定。Optionally, the function of the server (referred to herein as a voice keyword recognition server) to which the voice keyword identification method provided by the embodiment of the present application is applied may be implemented by a single server or a server cluster composed of multiple servers. There is no limit here.
以服务器为例,本申请实施例提供的一种语音关键词识别服务器的结构示意图,具体请参见图1。语音关键词识别服务器包括:处理器11和存储器12。Taking a server as an example, a schematic diagram of a voice keyword recognition server provided by an embodiment of the present application is shown in FIG. 1 . The voice keyword recognition server includes a processor 11 and a memory 12.
其中处理器11、存储器12、通信接口13通过通信总线14完成相互间的通信。The processor 11, the memory 12, and the communication interface 13 complete communication with each other via the communication bus 14.
可选的,通信接口13可以为通信模块的接口,如全球移动通信系统(Global System for Mobile Communication,GSM)模块的接口。处理器11,用于执行程序。Optionally, the communication interface 13 may be an interface of the communication module, such as an interface of a Global System for Mobile Communication (GSM) module. The processor 11 is configured to execute a program.
处理器11可能是一个中央处理器CPU,或者是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路。The processor 11 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
存储器12,用于存放程序。The memory 12 is used to store a program.
程序可以包括程序代码,程序代码包括计算机操作指令。在本发明实施例中,程序可以包括上述用户界面编辑器对应的程序。The program can include program code, the program code including computer operating instructions. In the embodiment of the present invention, the program may include a program corresponding to the user interface editor described above.
存储器12可能包含高速随机存取存储器(Random-Access Memory,RAM)存储器,也可能还包括非易失性存储器(non-volatile memory,NVM),例如至少一个磁盘存储器。The memory 12 may include a high speed random access memory (RAM) memory, and may also include a non-volatile memory (NVM), such as at least one disk memory.
其中,程序可具体用于:Among them, the program can be specifically used to:
从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧;Selecting one frame from the first frame sequence constituting the first voice is determined as the first target frame;
从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字;Selecting a keyword from a sequence of keywords included in the voice keyword to determine the target keyword;
确定目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成 功,关键字模板指示包括目标关键字的第二语音中的第二目标帧的隐层特征向量;Determining whether the hidden layer feature vector of the target frame is successfully matched with the keyword template corresponding to the target keyword, and the keyword template indicates the hidden layer feature vector of the second target frame in the second speech including the target keyword;
在匹配成功的情况下,若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,确定第一语音中包括语音关键词。If the matching is successful, if the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the first voice is determined. Includes voice keywords.
相应的,本申请实施例提供的一种终端的结构中至少包括如上述图1所示的语音关键词识别服务器的结构,有关终端的结构请参见上述对语音关键词识别服务器的结构的描述,在此不做赘述。Correspondingly, the structure of a terminal provided by the embodiment of the present application includes at least the structure of the voice keyword recognition server as shown in FIG. 1 above. For the structure of the terminal, refer to the description of the structure of the voice keyword recognition server. I will not repeat them here.
相应的,本申请实施例提供一种语音关键词识别方法的流程图,请参见图2。Correspondingly, the embodiment of the present application provides a flowchart of a voice keyword recognition method, which is shown in FIG. 2 .
如图2所示,该方法包括:As shown in Figure 2, the method includes:
S201、从构成第一语音的第一帧序列中选取一个帧第一目标帧;S201: Select a frame first target frame from a first frame sequence constituting the first voice;
S202、从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字;S202. Select a keyword from a sequence of keywords included in the voice keyword to determine the target keyword.
S203、确定第一目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成功,关键字模板指示包括目标关键字的第二语音中的第二目标帧的隐层特征向量;若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则执行步骤S204。S203. Determine whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, and the keyword template indicates the hidden layer feature vector of the second target frame in the second voice that includes the target keyword; If the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, step S204 is performed.
可选的,预设有语音模型,将包括目标关键字的第二语音(第二语音包括第二帧序列)输入语音模型后,可得到第二语音中的第二目标帧的隐层特征向量,与目标关键字对应的关键字模板指示所得到的隐层特征向量。Optionally, a voice model is pre-set, and the second voice (including the second voice sequence including the second frame sequence) of the target keyword is input into the voice model, and the hidden layer feature vector of the second target frame in the second voice is obtained. The keyword template corresponding to the target keyword indicates the obtained hidden layer feature vector.
可选的,语音模型基于时间递归神经网络(Long Short-Term Memory,LSTM)以及目标准则(Connectionist Temporal Classification,CTC)生成。Optionally, the speech model is generated based on a Long Short-Term Memory (LSTM) and a Connectionist Temporal Classification (CTC).
以上仅仅是本申请实施例提供的语音模型生成的可选方式,发明人可根据自己的需求任意设置语音模型的具体生成过程,在此不做限定。The above is only an optional manner for generating a voice model provided by the embodiment of the present application. The inventor can arbitrarily set the specific generation process of the voice model according to his own needs, which is not limited herein.
可选的,将包括第一帧序列的第一语音输入语音模型,可得到与第一语音中的第一目标帧对应的隐层特征向量。Optionally, the first speech input speech model including the first frame sequence is included, and a hidden layer feature vector corresponding to the first target frame in the first speech is obtained.
相应的,将第一目标帧的隐层特征向量与目标关键字对应的关键字模板进 行匹配,确定第一目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成功,如果匹配成功执行步骤S204。Correspondingly, the hidden layer feature vector of the first target frame is matched with the keyword template corresponding to the target keyword, and it is determined whether the hidden layer feature vector of the first target frame matches the keyword template corresponding to the target keyword, if the matching is successful. Step S204 is successfully executed.
在本申请实施例中,可选的,确定第一目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成功,包括:计算第一目标帧的隐层特征向量与目标关键字对应的关键字模板之间的余弦距离;若计算得到的余弦距离满足预设值,则确定第一目标帧的隐层特征向量与目标关键字对应的关键字模板匹配成功;若计算得到的余弦距离不满足预设值,则确定第一目标帧的隐层特征向量与目标关键字对应的关键字模板匹配不成功(失败)。In the embodiment of the present application, optionally, determining whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword includes: calculating a hidden layer feature vector and a target keyword of the first target frame The cosine distance between the corresponding keyword templates; if the calculated cosine distance satisfies the preset value, it is determined that the hidden layer feature vector of the first target frame matches the keyword template corresponding to the target keyword; if the calculated cosine is obtained; If the distance does not meet the preset value, it is determined that the hidden layer feature vector of the first target frame is not successfully matched (failed) with the keyword template corresponding to the target keyword.
S204、若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,则确定第一语音中包括语音关键词。S204. If the keyword template corresponding to each keyword in the keyword sequence is determined one by one, and the hidden layer feature vector of the frame located in the first voice is determined to be successfully matched, determining that the first voice includes the voice keyword .
可选的,在步骤S203确定匹配成功的情况下,判断当前是否已经逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功;如果是,确定第一语音中包括语音关键词。Optionally, if it is determined in step S203 that the matching is successful, determining whether the keyword template corresponding to each keyword in the keyword sequence has been determined one by one has determined the hidden layer feature of the frame located in the first voice. The vector is successfully matched with it; if so, it is determined that the voice is included in the first voice.
图3为本申请实施例提供的另一种语音关键词识别方法的流程图。FIG. 3 is a flowchart of another voice keyword recognition method according to an embodiment of the present application.
如图3所示,该方法包括:As shown in FIG. 3, the method includes:
S301、从构成第一语音的第一帧序列中选取一个第一目标帧;S301. Select a first target frame from a first frame sequence that constitutes the first voice.
S302、从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字;S302. Select a keyword from a sequence of keywords included in the voice keyword to determine the target keyword.
S303、确定第一目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成功,关键字模板指示包括目标关键字的第二语音中的第二目标帧的隐层特征向量;在第一目标帧的隐层特征向量与目标关键字对应的关键字模板匹配成功的情况下,执行步骤S304;在匹配不成功的情况下,返回执行步骤S301;S303. Determine whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, and the keyword template indicates the hidden layer feature vector of the second target frame in the second voice that includes the target keyword; If the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, step S304 is performed; if the matching is unsuccessful, the process returns to step S301;
S304、判断是否已逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,如果是,执行步骤S305;如果否,返回执行步骤S301;S304. Determine whether the keyword template corresponding to each keyword in the keyword sequence has been determined one by one, and the hidden layer feature vector of the frame located in the first voice has been determined to be successfully matched. If yes, step S305 is performed; Otherwise, return to step S301;
可选的,逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,包括:针对关键字序 列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功;并且,匹配关键字模板成功的各个关键字,按照匹配成功的先后顺序进行排序后得到的结果为关键字序列。Optionally, the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, and the hidden layer feature vector of the frame located in the first voice is determined to be successfully matched, including: for each keyword sequence The keyword templates corresponding to the keywords have been determined that the hidden layer feature vector of the frame located in the first voice is successfully matched; and the keywords that match the keyword template are successfully sorted according to the order of successful matching. The result obtained is a sequence of keywords.
S305、确定第一语音中包括语音关键词。S305. Determine to include a voice keyword in the first voice.
为了便于对本申请实施例提供的一种语音关键词识别方法的理解,现提供一种从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧的方法流程图,请参见图4。To facilitate understanding of a voice keyword recognition method provided by an embodiment of the present application, a flow chart of a method for determining a frame from a first frame sequence constituting a first voice as a first target frame is provided. 4.
如图4所示,该方法包括:As shown in FIG. 4, the method includes:
S401、确定构成第一语音的第一帧序列中的、第一个从未被确定为第一目标帧的帧;S401. Determine a first frame that is never determined to be the first target frame in the first frame sequence that constitutes the first voice.
S402、将所确定的帧,作为从构成第一语音的第一帧序列中确定的第一目标帧。S402. The determined frame is used as a first target frame determined from a first frame sequence constituting the first voice.
可选的,第一语音包括第一帧序列,第一帧序列由依次排列的至少一个帧构成。从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧,包括:从第一帧序列中选取一个帧作为第一目标帧,第一目标帧为第一帧序列中的从未被作为第一目标帧的、且在第一帧序列中排序最靠前的帧。Optionally, the first speech comprises a first sequence of frames, and the first sequence of frames is composed of at least one frame arranged in sequence. Determining a frame from the first frame sequence constituting the first speech as the first target frame includes: selecting one frame from the first frame sequence as the first target frame, and the first target frame is the slave in the first frame sequence The frame that is not the first target frame and is sorted in the first frame sequence.
为了便于对本申请实施例提供的一种语音关键词识别方法的理解,现提供一种从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字的方法流程图,请参见图5。In order to facilitate the understanding of a voice keyword recognition method provided by the embodiment of the present application, a flow chart for selecting a keyword from a keyword sequence included in a voice keyword to be a target keyword is provided. Referring to FIG. 5 .
如图5所示,该方法包括:As shown in FIG. 5, the method includes:
S501、从语音关键词包括的关键字序列中,确定与最近一次匹配成功的关键字模板对应的关键字相邻的下一关键字;S501. Determine, from a keyword sequence included in the voice keyword, a next keyword adjacent to the keyword corresponding to the keyword template that has been successfully matched last time;
可选的,关键字序列由依次排序的多个关键字构成。Optionally, the keyword sequence is composed of multiple keywords that are sequentially sorted.
例如,若语音关键词包括的关键字序列为“小红你好”时,若最近一次匹配成功的关键模板对应的关键字为“红”,则语音关键词包括的关键字序列中的,与最近一次匹配成功的关键字模板对应的关键字相邻的下一关键字为关键字“你”。For example, if the keyword sequence included in the voice keyword is “Little Red Hello”, if the keyword corresponding to the key template of the last successful match is “red”, then the keyword sequence included in the voice keyword is The next keyword adjacent to the keyword corresponding to the last successful keyword template is the keyword "you".
S502、判断下一关键字被连续确定为目标关键字的次数是否达到预设的阈 值;若下一关键字被连续确定为目标关键字的次数未达到预设的阈值,则执行步骤S503;若下一关键字被连续确定为目标关键字的次数达到阈值,则执行步骤S504;S502, determining whether the number of times the next keyword is continuously determined as the target keyword reaches a preset threshold; if the number of times the next keyword is continuously determined as the target keyword does not reach the preset threshold, step S503 is performed; If the number of times the next keyword is continuously determined as the target keyword reaches the threshold, step S504 is performed;
可选的,预设的阈值为30次,以上仅仅是本申请实施例提供的阈值的可选方式,发明人可根据自己的需求任意设置阈值的具体内容,在此不做限定。Optionally, the preset threshold is 30 times. The foregoing is only an optional manner of the threshold provided by the embodiment of the present application. The inventor may arbitrarily set the specific content of the threshold according to his own needs, which is not limited herein.
S503、将下一关键字确定为目标关键字;S503. Determine the next keyword as the target keyword.
S504、将关键字序列中的第一个关键字确定为目标关键字。S504. Determine a first keyword in the keyword sequence as the target keyword.
例如,若语音关键词包括的关键字序列为“小红你好”时,将关键字序列中的第一个关键字确定为目标关键字,包括:将关键字序列中的第一个关键字“小”,确定为目标关键字。For example, if the keyword sequence included in the voice keyword is "Little Red Hello", the first keyword in the keyword sequence is determined as the target keyword, including: the first keyword in the keyword sequence "Small" is determined as the target keyword.
为了便于对本申请实施例提供的一种语音关键词识别方法的理解,现提供一种与目标关键字对应的关键字模板的生成方法流程图,请参见图6。To facilitate the understanding of a voice keyword recognition method provided by the embodiment of the present application, a flow chart of a method for generating a keyword template corresponding to a target keyword is provided. Referring to FIG. 6 .
如图6所示,该方法包括:As shown in FIG. 6, the method includes:
S601、确定包括目标关键字的第二语音,第二语音由第二帧序列构成;S601. Determine a second voice that includes a target keyword, where the second voice is composed of a second frame sequence.
可选的,生成与目标关键字对应的关键字模板的过程包括:确定包括目标关键字的第二语音,第二语音由第二帧序列构成,第二帧序列由依次排列的至少一个帧构成。Optionally, the process of generating a keyword template corresponding to the target keyword includes: determining a second voice that includes the target keyword, the second voice is composed of a second frame sequence, and the second frame sequence is composed of at least one frame that is sequentially arranged .
S602、将第二语音作为预设的语音模型的输入信息,确定分别与第二帧序列中的每个帧对应的终层特征向量;S602. The second voice is used as the input information of the preset voice model, and the final layer feature vector corresponding to each frame in the second frame sequence is determined respectively.
可选的,预设有语音模型,语音模型的输入信息为语音(如第二语音)/帧,输出信息可包括分别与输入的每个帧对应的隐层特征向量和终层特征向量。Optionally, a voice model is pre-set, and the input information of the voice model is a voice (eg, a second voice)/frame, and the output information may include a hidden layer feature vector and a final layer feature vector respectively corresponding to each frame input.
在本申请实施例中,可选的,将第二语音作为语音模型的输入信息,得到第二语音包括的第二帧序列中的每个帧对应的终层特征向量。In the embodiment of the present application, optionally, the second voice is used as the input information of the voice model, and the final layer feature vector corresponding to each frame in the second frame sequence included in the second voice is obtained.
S603、基于分别与每个帧对应的终层特征向量,从第二帧序列中确定第二目标帧;S603. Determine, according to a final layer feature vector corresponding to each frame, a second target frame from the second frame sequence.
可选的,根据第二语音包括的第二帧序列中的每个帧对应的终层特征向量,从第二语音中选取一个帧作为第二目标帧。Optionally, one frame is selected as the second target frame from the second voice according to the end layer feature vector corresponding to each frame in the second frame sequence included in the second voice.
S604、根据将第二目标帧作为语音模型的输入信息所得到的与第二目标帧对应的隐层特征向量,生成与目标关键字对应的关键字模板。S604. Generate a keyword template corresponding to the target keyword according to the hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as the input information of the voice model.
可选的,第二目标帧作为语音模型的输入信息,得到的与第二目标帧对应的隐层特征向量的过程,可以在步骤S602中实现,将第二语音作为预设的语音模型的输入信息,确定分别与第二帧序列中的每个帧对应的终层特征向量,以及分别与第二帧序列中的每个帧对应的隐层特征向量;进而,在步骤S604执行过程中,直接从步骤S602的“分别与第二帧序列中的每个帧对应的隐层特征向量”结果中,直接获取与第二目标帧对应的隐层特征向量。Optionally, the second target frame is used as the input information of the voice model, and the obtained process of the hidden layer feature vector corresponding to the second target frame may be implemented in step S602, where the second voice is used as the input of the preset voice model. And determining, by the information, a final layer feature vector corresponding to each frame in the second frame sequence, and a hidden layer feature vector corresponding to each frame in the second frame sequence respectively; and further, in the process of performing step S604, directly From the result of the "hidden layer feature vector corresponding to each frame in the second frame sequence" of step S602, the hidden layer feature vector corresponding to the second target frame is directly acquired.
以上仅仅是本申请实施例的可选方式,发明人可根据自己的需求任意设置“将第二目标帧作为语音模型的输入信息所得到的与第二目标帧对应的隐层特征向量”的实现方式,如将“将第二目标帧作为语音模型的输入信息所得到的与第二目标帧对应的隐层特征向量”过程独立于步骤S602实现,在此不做限定。The above is only an optional manner of the embodiment of the present application, and the inventor can arbitrarily set the implementation of the hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as the input information of the speech model according to his own needs. The method is as follows: the process of the “hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as the input information of the voice model” is implemented in step S602, which is not limited herein.
可选的,第二语音的个数为至少一个,根据与第二目标帧对应的隐层特征向量,生成与目标关键字对应的关键字模板,包括:确定分别与每个第二语音的第二目标帧对应的隐层特征向量,对所确定的各个隐层特征向量求平均,并将所得到的结果作为与目标关键字对应的关键字模板。Optionally, the number of the second voices is at least one, and the keyword template corresponding to the target keyword is generated according to the hidden layer feature vector corresponding to the second target frame, including: determining the second and the second voice respectively The hidden layer feature vector corresponding to the two target frames is averaged for each determined hidden layer feature vector, and the obtained result is used as a keyword template corresponding to the target keyword.
为了便于对本申请实施例提供的一种语音关键词识别方法的理解,现提供一种基于分别与每个帧对应的终层特征向量,从第二帧序列中确定第二目标帧的方法进行详细介绍。In order to facilitate understanding of a voice keyword recognition method provided by an embodiment of the present application, a method for determining a second target frame from a second frame sequence based on a final layer feature vector corresponding to each frame is provided. Introduction.
在本申请实施例中,可选的,帧对应的终层特征向量,包括:帧分别与语音模型中预设的文字集中的每个文字之间的相似度,目标关键字为文件集中的一个文字。In the embodiment of the present application, optionally, the end layer feature vector corresponding to the frame includes: a similarity between the frame and each text in the preset text set in the voice model, and the target keyword is one in the file set. Text.
例如,若文字集为5200个汉字,则帧对应的终层特征向量包括:帧分别与5200个汉字中的每个汉字的相似度。For example, if the text set is 5200 Chinese characters, the final layer feature vector corresponding to the frame includes: the similarity between the frame and each of the 5200 Chinese characters.
基于分别与每个帧对应的终层特征向量,从第二帧序列中确定第二目标帧,包括:根据分别与每个帧对应的终层特征向量,从第二帧序列中选取与目标关键字的相似程度最高的帧作为第二目标帧;其中,帧与目标关键字的相似程度根据帧分别与文字集中的每个文字之间的相似度确定。Determining the second target frame from the second frame sequence based on the end layer feature vectors respectively corresponding to each frame, comprising: selecting and targeting the target from the second frame sequence according to the final layer feature vector corresponding to each frame respectively The frame with the highest degree of similarity of words is used as the second target frame; wherein the degree of similarity between the frame and the target keyword is determined according to the similarity between the frame and each character in the text set.
为了便于理解,现提供一种基于分别与每个帧对应的终层特征向量,从第二帧序列中选取与目标关键字的相似程度最高的帧作为第二目标帧的方法流程图,请参见图7。For ease of understanding, a flow chart of a method for selecting a frame with the highest degree of similarity to a target keyword as a second target frame from a second frame sequence based on a final layer feature vector corresponding to each frame is provided. Figure 7.
如图7,该方法包括:As shown in Figure 7, the method includes:
S701、从第二帧序列中确定至少一个第一候选帧,第一候选帧与目标关键字的相似度小于第一候选帧与文字集中的至少一个文字的相似度,至少一个文字的个数小于预设数值;S701: Determine at least one first candidate frame from the second frame sequence, where the similarity between the first candidate frame and the target keyword is smaller than the similarity between the first candidate frame and the at least one character in the text set, and the number of the at least one character is less than Default value
S702、从至少一个第一候选帧中确定至少一个第二候选帧,至少一个第二候选帧为至少一个第一候选帧中与目标关键字的相似度最大的各第一候选帧;S702. Determine at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is each of the first candidate frames having the greatest similarity with the target keyword in the at least one first candidate frame.
S703、从至少一个第二候选帧中确定第二目标帧,按照相似度从高到低的顺序,第二目标帧与目标关键字的相似度位于第二目标帧与各文字的相似度中的排名,高于除第二目标帧外的每个第二候选帧与目标关键字的相似度位于第二候选帧与各文字的相似度中的排名。S703. Determine a second target frame from the at least one second candidate frame. The similarity between the second target frame and the target keyword is in the similarity between the second target frame and each character according to the order of similarity from high to low. The ranking is higher than the ranking of each second candidate frame and the target keyword except the second target frame in the similarity between the second candidate frame and each character.
进一步的,为了便于对本申请实施例提供的如图7所示的一种基于分别与每个帧对应的终层特征向量,从第二帧序列中选取与目标关键字的相似程度最高的帧作为第二目标帧的方法的理解,现举例说明:Further, in order to facilitate the selection of the end layer feature vector corresponding to each frame respectively as shown in FIG. 7 provided by the embodiment of the present application, the frame with the highest degree of similarity with the target keyword is selected from the second frame sequence. The understanding of the method of the second target frame is now illustrated by:
若第二语音包括的第二帧序列包括四个帧,分别为帧1、帧2、帧3和帧4,语音模型中预设的文字集包括4个文字,分别为文字1、文字2、文字3和文字4,其中文字3为目标关键字。If the second frame sequence included in the second voice includes four frames, namely frame 1, frame 2, frame 3, and frame 4, the preset text set in the voice model includes four characters, namely, text 1, text 2, respectively Text 3 and text 4, where text 3 is the target keyword.
将第二语音作为语音模型的输入信息输入至语音模型,得到与帧1对应的终层特征向量1、与帧2对应的终层特征向量2、与帧3对应的终层特征向量3,以及与帧4对应的终层特征向量4。Inputting the second speech as input information of the speech model to the speech model, and obtaining a final layer feature vector corresponding to frame 1, a final layer feature vector corresponding to frame 2, a final layer feature vector 3 corresponding to frame 3, and The final layer feature vector 4 corresponding to frame 4.
其中,终层特征向量1包括帧1与文字1的相似度11、帧1与文字2的相似度12、帧1与文字3的相似度13和帧1与文字4的相似度14,其中,相似度11为20%、相似度12为30%、相似度13为15%、相似度14为50%;The final layer feature vector 1 includes a similarity degree 11 between the frame 1 and the text 1, a similarity 12 between the frame 1 and the text 2, a similarity 13 between the frame 1 and the text 3, and a similarity 14 between the frame 1 and the character 4, wherein The similarity 11 is 20%, the similarity 12 is 30%, the similarity 13 is 15%, and the similarity 14 is 50%;
终层特征向量2包括帧2与文字1的相似度21、帧2与文字2的相似度22、帧2与文字3的相似度23和帧2与文字4的相似度24,其中,相似度21为15%、相似度22为5%、相似度23为65%、相似度24为95%;The final layer feature vector 2 includes the similarity 21 between the frame 2 and the text 1, the similarity 22 between the frame 2 and the text 2, the similarity 23 between the frame 2 and the text 3, and the similarity 24 between the frame 2 and the character 4, wherein the similarity 21 is 15%, similarity 22 is 5%, similarity 23 is 65%, and similarity 24 is 95%;
终层特征向量3包括帧3与文字1的相似度31、帧3与文字2的相似度32、帧3与文字3的相似度33和帧3与文字4的相似度34,其中,相似度31为10%、相似度32为20%、相似度33为65%、相似度34为30%;The final layer feature vector 3 includes the similarity degree 31 of the frame 3 and the text 1, the similarity 32 of the frame 3 to the text 2, the similarity 33 of the frame 3 to the character 3, and the similarity 34 of the frame 3 and the character 4, wherein the similarity 31 is 10%, similarity 32 is 20%, similarity 33 is 65%, and similarity 34 is 30%;
终层特征向量4包括帧4与文字1的相似度41、帧4与文字2的相似度42、帧4与文字3的相似度43和帧4与文字4的相似度44,其中,相似度41为10%、相似度42为20%、相似度43为55%、相似度44为30%。The final layer feature vector 4 includes the similarity 41 of the frame 4 to the text 1, the similarity 42 of the frame 4 to the text 2, the similarity 43 of the frame 4 to the character 3, and the similarity 44 of the frame 4 and the character 4, wherein the similarity 41 is 10%, similarity 42 is 20%, similarity 43 is 55%, and similarity 44 is 30%.
首先,从第二帧序列中确定至少一个第一候选帧,第一候选帧与目标关键字的相似度小于第一候选帧与文字集中的至少一个文字的相似度,至少一个文字的个数小于预设数值,若预设数值为3时,则说明:从第二帧序列中确定至少一个第一候选帧,具体的,第一候选帧与文字集中的每个文字的相似度按照从大到小的顺序进行排列得到一个序列,第一候选帧与目标关键字的相似度位于此序列的前3位以内(第一候选帧与目标关键字的相似度位于此序列的第1位、第2位或第3位)。此时,从第二帧序列中确定的至少一个第一候选帧包括3个,分别为帧2、帧3和帧4。First, determining at least one first candidate frame from the second frame sequence, the similarity between the first candidate frame and the target keyword is smaller than the similarity between the first candidate frame and the at least one character in the text set, and the number of the at least one character is less than The preset value, if the preset value is 3, indicates that at least one first candidate frame is determined from the second frame sequence, and specifically, the similarity between the first candidate frame and each character in the text set is from large to large The small order is arranged to obtain a sequence, and the similarity between the first candidate frame and the target keyword is within the first 3 digits of the sequence (the similarity between the first candidate frame and the target keyword is located in the first and second positions of the sequence) Bit or third place). At this time, at least one first candidate frame determined from the second frame sequence includes three, which are frame 2, frame 3, and frame 4.
从至少一个第一候选帧中确定至少一个第二候选帧:因此时相似度23和相似度33相等,均为65%;相似度43为55%;故从至少一个第一候选帧中确定出的至少一个第二候选帧包括2个,分别为帧2和帧3。Determining at least one second candidate frame from the at least one first candidate frame: the time similarity 23 and the similarity 33 are equal, both being 65%; the similarity 43 is 55%; thus determining from the at least one first candidate frame At least one second candidate frame includes two, frame 2 and frame 3, respectively.
从至少一个第二候选帧中确定第二目标帧:因与帧3对应的相似度33在帧3对应的各个相似度中的排名为第1位;帧2对应的相似度23在帧2对应的各个相似度中的排名为第2位,故选择与第1位对应的帧3作为第二目标帧。Determining a second target frame from the at least one second candidate frame: the similarity 33 corresponding to the frame 3 is ranked first in each similarity corresponding to the frame 3; the similarity 23 corresponding to the frame 2 corresponds to the frame 2 The rank in each of the similarities is the second digit, so the frame 3 corresponding to the first bit is selected as the second target frame.
通过上述对本申请实施例提供的一种语音关键词识别方法的详细介绍,使得本申请实施例提供的一种语音关键词识别方法更加清晰、完整,便于本领域技术人员理解。The voice keyword recognition method provided by the embodiment of the present application is more clear and complete, and is convenient for those skilled in the art to understand.
进一步的,为了便于理解上述实施例提供的一种语音关键词识别方法,下面对此方法进行更具体的详细说明,请参见图8。Further, in order to facilitate understanding of a voice keyword recognition method provided by the foregoing embodiment, the method is described in more detail below, please refer to FIG. 8.
如图8所示,该方法包括:As shown in Figure 8, the method includes:
需要注意的是:该方法中对应的第一语音包括的第一帧序列中的每个帧设置有唯一的帧ID,其中,帧在第一帧序列中的序位号即为帧的帧ID。例如, 第一帧序列包括依次排序的三个帧,分别为帧1、帧3和帧2。则,帧1的序位号为1,帧ID为1;帧3的序位号为2,帧ID为2;帧2的序位号为3,帧ID为3。It should be noted that each frame in the first frame sequence included in the corresponding first voice in the method is provided with a unique frame ID, wherein the sequence number of the frame in the first frame sequence is the frame ID of the frame. . For example, the first frame sequence includes three frames that are sequentially sorted, frame 1, frame 3, and frame 2, respectively. Then, the sequence number of frame 1 is 1, the frame ID is 1, the sequence number of frame 3 is 2, the frame ID is 2, the sequence number of frame 2 is 3, and the frame ID is 3.
可选的,语音关键词包括的关键字序列中的每个关键字设置有唯一的关键字ID,其中,关键字在关键字序列中的序位号为关键字的关键字ID。例如,关键词序列包括依次排序的4个关键字,分别为关键字1、关键字3关键字2和关键字4。则,关键字1的序位号为1,关键字ID为1;关键字3的序位号为2,关键字ID为2;关键字2的序位号为3,关键字ID为3;关键字4的序位号为4,关键字ID为4。Optionally, each keyword in the keyword sequence included in the voice keyword is set with a unique keyword ID, wherein the sequence number of the keyword in the keyword sequence is the keyword ID of the keyword. For example, the keyword sequence includes four keywords sorted in order, namely, keyword 1, keyword 3 keyword 2, and keyword 4. Then, the sequence number of the keyword 1 is 1, the keyword ID is 1, the sequence number of the keyword 3 is 2, the keyword ID is 2, the sequence number of the keyword 2 is 3, and the keyword ID is 3. Keyword 4 has a serial number of 4 and a keyword ID of 4.
S801、初始化帧ID:n=0;关键字ID:m=1;计算器置零;S801, initialization frame ID: n=0; keyword ID: m=1; the calculator is set to zero;
S802、i=n++;判断第一语音包括的第一帧序列中的第i个帧的隐层特征向量与语音关键词中的第m个关键字对应关键字模板是否匹配成功;如果匹配成功,执行步骤S803;如果匹配失败,执行步骤S806;S802, i=n++; determining whether the hidden layer feature vector of the i-th frame in the first frame sequence included in the first speech matches the keyword template corresponding to the m-th keyword in the voice keyword; if the matching is successful, Step S803 is performed; if the matching fails, step S806 is performed;
S803、判断当前关键字是否为语音关键词包括的关键词序列中的最后一个关键字;如果是,执行步骤S804;如果否,执行步骤S805;S803, determining whether the current keyword is the last keyword in the keyword sequence included in the voice keyword; if yes, executing step S804; if not, executing step S805;
S804、确定第一语音中包括语音关键词;S804. Determine that the first voice includes a voice keyword.
S805、设置计数器的计数s为触发初始值;n++;返回执行步骤S802;S805, setting the counter s is the trigger initial value; n++; returning to step S802;
可选的,触发初始值即为上述步骤S502中所涉及到的阈值。可选的,触发初始值为30。Optionally, the trigger initial value is the threshold involved in the foregoing step S502. Optionally, the initial value of the trigger is 30.
以上仅仅是本申请实施例提供的触发初始值的可选方式,发明人可根据自己的需求任意设置触发初始值的具体数值,在此不做限定。The above is only an optional method for triggering the initial value provided by the embodiment of the present application. The inventor can arbitrarily set the specific value of the trigger initial value according to his own needs, which is not limited herein.
S806、s--;S806, s--;
可选的,s--表示计数器的计数减一。Optionally, s-- indicates that the counter count is decremented by one.
S807、判断计数器的计数s是否大于0;若是,返回执行步骤S802;若否,返执行步骤S801。S807, determining whether the count s of the counter is greater than 0; if yes, returning to step S802; if no, returning to step S801.
以上仅仅是本申请实施例提供的一种语音关键词识别方法的可选方式,具体的,发明人可根据自己的需求任意设置本申请实施例提供一种语音关键词识别方法的具体实现方式,在此不做限定。The above is only an alternative manner of the voice keyword recognition method provided by the embodiment of the present application. Specifically, the inventor can arbitrarily set a specific implementation manner of the voice keyword recognition method according to the embodiment of the present application. There is no limit here.
通过上述对本申请实施例提供的一种语音关键词识别方法的详细介绍,使得本申请实施例提供的一种语音关键词识别方法更加清晰、完整,便于本领域技术人员理解。The voice keyword recognition method provided by the embodiment of the present application is more clear and complete, and is convenient for those skilled in the art to understand.
上述本发明公开的实施例中详细描述了方法,对于本发明的方法可采用多种形式的装置实现,因此本发明还公开了一种装置,下面给出具体的实施例进行详细说明。The method is described in detail in the above-disclosed embodiments of the present invention, and the method of the present invention can be implemented in various forms of the apparatus. Therefore, the present invention also discloses an apparatus, and a specific embodiment will be described in detail below.
图9为本申请实施例提供的一种语音关键词识别装置的结构示意图。FIG. 9 is a schematic structural diagram of a voice keyword recognition apparatus according to an embodiment of the present application.
如图9所示,该装置包括:As shown in Figure 9, the device includes:
第一目标帧确定单元91,用于从构成第一语音的第一帧序列中选取一个第一目标帧;a first target frame determining unit 91, configured to select a first target frame from a first frame sequence constituting the first voice;
目标关键字确定单元92,用于从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;The target keyword determining unit 92 is configured to select a keyword from the keyword sequence and determine the target keyword, wherein the keyword sequence belongs to the voice keyword;
匹配单元93,用于若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;The matching unit 93 is configured to: if the key layer template corresponding to the target keyword of the first target frame is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is used one by one. Determining whether a hidden layer feature vector of a frame located in the first voice matches, wherein the keyword template indicates a hidden layer feature vector of a second target frame in the second voice including the target keyword;
识别单元94,用于若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。进一步的,本申请实施例提供的一种语音关键词识别装置还包括:返回执行单元,用于:在匹配失败的情况下,返回执行“从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧”步骤。The identifying unit 94 is configured to determine, if the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, that the hidden layer feature vector of the frame located in the first voice is successfully matched, The voice keyword is included in the first voice. Further, the voice keyword recognition apparatus provided by the embodiment of the present application further includes: a return execution unit, configured to: when the matching fails, return to perform “selecting a frame from the first frame sequence constituting the first voice. Determine as the first target frame" step.
本发明实施例提供第一目标帧确定单元91的一种可选结构。An embodiment of the present invention provides an optional structure of the first target frame determining unit 91.
可选的,第一目标帧确定单元91包括:Optionally, the first target frame determining unit 91 includes:
第一确定单元,用于从所述构成第一语音的第一帧序列中确定第一个从未被确定为第一目标帧的帧;a first determining unit, configured to determine, from the first sequence of frames constituting the first voice, a frame that is never determined to be the first target frame;
第二确定单元,用于将所述帧作为从所述构成第一语音的第一帧序列中确 定的第一目标帧。And a second determining unit, configured to use the frame as the first target frame determined from the first frame sequence constituting the first voice.
本发明实施例提供目标关键字确定单元92的一种可选结构。An embodiment of the present invention provides an optional structure of the target keyword determining unit 92.
可选的,目标关键字确定单元92包括:Optionally, the target keyword determining unit 92 includes:
第三确定单元,用于从所述语音关键词包括的所述关键字序列中,确定与最近一次匹配成功的关键字模板对应的关键字相邻的下一关键字;a third determining unit, configured to determine, from the keyword sequence included in the voice keyword, a next keyword adjacent to a keyword corresponding to a keyword template that has been successfully matched last time;
第四确定单元,用于若所述下一关键字被连续确定为目标关键字的次数未达到预设的阈值,将所述下一关键字确定为目标关键字;a fourth determining unit, configured to determine the next keyword as a target keyword if the number of times the next keyword is continuously determined as the target keyword does not reach a preset threshold;
第五确定单元,用于若所述下一关键字被连续确定为目标关键字的次数达到所述阈值,将所述关键字序列中的第一个关键字确定为目标关键字。And a fifth determining unit, configured to determine, as the target keyword, the first keyword in the keyword sequence if the number of times the next keyword is continuously determined as the target keyword reaches the threshold.
进一步的,本申请实施例提供的一种语音关键词识别装置还包括:关键字模板生成单元。Further, the voice keyword recognition apparatus provided by the embodiment of the present application further includes: a keyword template generating unit.
本发明实施例提供的关键字模板生成单元的一种可选结构,请参见图10。An optional structure of the keyword template generating unit provided by the embodiment of the present invention is shown in FIG. 10 .
如图10所示,所述关键字模板生成单元,包括:As shown in FIG. 10, the keyword template generating unit includes:
第二语音确定单元101,用于确定包括所述目标关键字的第二语音,所述第二语音由第二帧序列构成;a second voice determining unit 101, configured to determine a second voice that includes the target keyword, where the second voice is composed of a second sequence of frames;
终层特征向量确定单元102,用于将所述第二语音作为预设的语音模型的输入信息,确定分别与所述第二帧序列中的每个帧对应的终层特征向量;The final layer feature vector determining unit 102 is configured to determine, as the input information of the preset voice model, the second layer voice as a final layer feature vector corresponding to each frame in the second frame sequence;
第二目标帧确定单元103,用于根据分别与每个帧对应的终层特征向量,从所述第二帧序列中确定第二目标帧;a second target frame determining unit 103, configured to determine a second target frame from the second frame sequence according to a final layer feature vector corresponding to each frame respectively;
关键字模板生成子单元104,用于根据将所述第二目标帧作为所述语音模型的输入信息所得到的与所述第二目标帧对应的隐层特征向量,生成与所述目标关键字对应的关键字模板。a keyword template generating sub-unit 104, configured to generate, with the target keyword, a hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as input information of the voice model The corresponding keyword template.
在本申请实施例中,可选地,所述帧对应的终层特征向量,包括:所述帧分别与所述语音模型中预设的文字集中的每个文字之间的相似度,所述目标关键字为所述文件集中的一个文字;所述第二目标帧确定单元,具体用于:基于分别与每个帧对应的终层特征向量,从所述第二帧序列中选取与所述目标关键字的相似程度最高的帧作为第二目标帧;其中,帧与所述目标关键字的相似程度根据所述帧分别与所述文字集中的每个文字之间的相似度确定。In the embodiment of the present application, optionally, the end layer feature vector corresponding to the frame includes: a similarity between the frame and each text in a preset text set in the voice model, The target keyword is a character in the file set; the second target frame determining unit is specifically configured to: select and describe from the second frame sequence based on a final layer feature vector corresponding to each frame respectively The frame with the highest degree of similarity of the target keyword is used as the second target frame; wherein the degree of similarity between the frame and the target keyword is determined according to the similarity between the frame and each character in the text set.
本发明实施例提供第二目标帧确定单元的一种可选结构,请参见图11。An embodiment of the present invention provides an optional structure of the second target frame determining unit, which is shown in FIG.
如图11所示,所述第二目标帧确定单元,包括:As shown in FIG. 11, the second target frame determining unit includes:
第一候选帧确定单元111,用于从所述第二帧序列中确定至少一个第一候选帧,所述第一候选帧与所述目标关键字的相似度小于所述第一候选帧与所述文字集中的至少一个文字的相似度,所述至少一个文字的个数小于预设数值;The first candidate frame determining unit 111 is configured to determine at least one first candidate frame from the second frame sequence, where the similarity between the first candidate frame and the target keyword is smaller than the first candidate frame and the Comparing the similarity of at least one character in the text set, the number of the at least one character being less than a preset value;
第二候选帧确定单元112,用于从所述至少一个第一候选帧中确定至少一个第二候选帧,所述至少一个第二候选帧为所述至少一个第一候选帧中与所述目标关键字的相似度最大的各第一候选帧;a second candidate frame determining unit 112, configured to determine at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is the target in the at least one first candidate frame Each of the first candidate frames having the highest similarity of the keywords;
第二目标帧确定子单元113,用于从所述至少一个第二候选帧中确定第二目标帧,按照相似度从高到低的顺序,所述第二目标帧与所述目标关键字的相似度位于所述第二目标帧与各文字的相似度中的排名,高于除所述第二目标帧外的每个所述第二候选帧与所述目标关键字的相似度位于所述第二候选帧与各文字的相似度中的排名。a second target frame determining sub-unit 113, configured to determine a second target frame from the at least one second candidate frame, in order of high to low similarity, the second target frame and the target keyword The similarity is located in the ranking of the similarity between the second target frame and each character, and the similarity between each of the second candidate frames except the second target frame and the target keyword is located in the The ranking in the similarity between the second candidate frame and each character.
综上:In summary:
本发明实施例公开了一种语音关键词识别方法、装置、终端及服务器,通过从构成第一语音的第一帧序列中确定第一目标帧;从语音关键词包括的关键字序列中确定目标关键字;在确定目标帧的隐层特征向量与目标关键字对应的关键字模板匹配成功时(关键字模板指示包括目标关键字的第二语音中的第二目标帧的隐层特征向量),若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,确定第一语音中包括语音关键词的方式,有效实现了对第一语音中的语音关键词的识别。进一步的,便于使用语音唤醒技术的电子设备在识别出第一语音中包括语音关键词时,自动激活与所述语音关键词相应的处理模块。The embodiment of the invention discloses a voice keyword recognition method, device, terminal and server, which determine a first target frame from a first frame sequence constituting the first voice; and determine a target from a keyword sequence included in the voice keyword a keyword; when it is determined that the hidden layer feature vector of the target frame is successfully matched with the keyword template corresponding to the target keyword (the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword), If the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the manner in which the voice keyword is included in the first voice is determined. The recognition of the speech keywords in the first speech is effectively implemented. Further, the electronic device that facilitates using the voice wake-up technology automatically activates a processing module corresponding to the voice keyword when identifying that the voice keyword is included in the first voice.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in the present specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the various embodiments may be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例 的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。A person skilled in the art will further appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software or a combination of both, in order to clearly illustrate the hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented directly in hardware, a software module executed by a processor, or a combination of both. The software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments are obvious to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but the scope of the invention is to be accorded

Claims (18)

  1. 一种语音关键词识别方法,其特征在于,包括:A voice keyword recognition method, comprising:
    从构成第一语音的第一帧序列中选取一个第一目标帧;Selecting a first target frame from a sequence of first frames constituting the first voice;
    从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
    若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;If the key layer template corresponding to the target keyword is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
    若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。If it is determined that the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
  2. 根据权利要求1所述的方法,其特征在于,在匹配失败的情况下,所述方法还包括:The method according to claim 1, wherein in the case that the matching fails, the method further comprises:
    返回执行所述从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧的步骤。Returning to the step of performing the selection of one frame from the first frame sequence constituting the first speech as the first target frame.
  3. 根据权利要求2所述的方法,其特征在于,所述从构成第一语音的第一帧序列中选取一个第一目标帧,包括:The method according to claim 2, wherein the selecting a first target frame from the first sequence of frames constituting the first speech comprises:
    从所述构成第一语音的第一帧序列中确定第一个从未被确定为第一目标帧的帧;Determining, from the first sequence of frames constituting the first speech, a frame that is never determined to be the first target frame;
    将所述帧作为从所述构成第一语音的第一帧序列中确定的第一目标帧。The frame is taken as a first target frame determined from the first frame sequence constituting the first speech.
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述从关键字序列中选取一个关键字确定为目标关键字,包括:The method according to any one of claims 1 to 3, wherein the selecting a keyword from the keyword sequence to determine the target keyword comprises:
    从所述语音关键词包括的所述关键字序列中,确定与最近一次匹配成功的关键字模板对应的关键字相邻的下一关键字;Determining, from the sequence of keywords included in the voice keyword, a next keyword adjacent to a keyword corresponding to a keyword template that has been successfully matched last time;
    若所述下一关键字被连续确定为目标关键字的次数未达到预设的阈值,则将所述下一关键字确定为目标关键字;If the number of times the next keyword is continuously determined as the target keyword does not reach the preset threshold, the next keyword is determined as the target keyword;
    若所述下一关键字被连续确定为目标关键字的次数达到所述阈值,则将所述关键字序列中的第一个关键字确定为目标关键字。If the number of times the next keyword is continuously determined as the target keyword reaches the threshold, the first keyword in the keyword sequence is determined as the target keyword.
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述关键字模板的生成过程包括:The method according to any one of claims 1 to 4, wherein the process of generating the keyword template comprises:
    确定包括所述目标关键字的第二语音,所述第二语音由第二帧序列构成;Determining a second speech comprising the target keyword, the second speech being composed of a second sequence of frames;
    将所述第二语音作为预设的语音模型的输入信息,确定分别与所述第二帧序列中的每个帧对应的终层特征向量;Determining, by using the second voice as input information of a preset voice model, a final layer feature vector corresponding to each frame in the second frame sequence;
    根据分别与每个帧对应的终层特征向量,从所述第二帧序列中确定第二目标帧;Determining a second target frame from the second frame sequence according to a final layer feature vector corresponding to each frame respectively;
    根据将所述第二目标帧作为所述语音模型的输入信息所得到的与所述第二目标帧对应的隐层特征向量,生成与所述目标关键字对应的关键字模板。Generating a keyword template corresponding to the target keyword according to the hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as input information of the voice model.
  6. 根据权利要求5所述的方法,其特征在于,所述帧对应的终层特征向量,包括:所述帧分别与所述语音模型中预设的文字集中的每个文字之间的相似度,所述目标关键字为所述文件集中的一个文字;The method according to claim 5, wherein the end layer feature vector corresponding to the frame comprises: a similarity between the frame and each text in a preset text set in the voice model, The target keyword is a text in the file set;
    所述根据分别与每个帧对应的终层特征向量,从所述第二帧序列中确定第二目标帧,包括:Determining the second target frame from the second frame sequence according to the end layer feature vector corresponding to each frame respectively, including:
    根据分别与每个帧对应的终层特征向量,从所述第二帧序列中选取与所述目标关键字的相似程度最高的帧作为第二目标帧;其中,帧与所述目标关键字的相似程度根据所述帧分别与所述文字集中的每个文字之间的相似度确定。And selecting, according to the final layer feature vector corresponding to each frame, a frame with the highest degree of similarity to the target keyword as the second target frame; wherein, the frame and the target keyword The degree of similarity is determined based on the similarity between the frames and each of the texts in the set of words.
  7. 根据权利要求6所述的方法,其特征在于,所述根据分别与每个帧对应的终层特征向量,从所述第二帧序列中选取与所述目标关键字的相似程度最高的帧作为第二目标帧,包括:The method according to claim 6, wherein the frame having the highest degree of similarity to the target keyword is selected from the second frame sequence according to a final layer feature vector corresponding to each frame respectively. The second target frame includes:
    从所述第二帧序列中确定至少一个第一候选帧,所述第一候选帧与所述目标关键字的相似度小于所述第一候选帧与所述文字集中的至少一个文字的相似度,所述至少一个文字的个数小于预设数值;Determining at least one first candidate frame from the second frame sequence, the similarity between the first candidate frame and the target keyword is smaller than the similarity between the first candidate frame and at least one character in the text set The number of the at least one character is less than a preset value;
    从所述至少一个第一候选帧中确定至少一个第二候选帧,所述至少一个第二候选帧为所述至少一个第一候选帧中与所述目标关键字的相似度最大的各第一候选帧;Determining at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is the first one of the at least one first candidate frame having the greatest similarity with the target keyword Candidate frame
    从所述至少一个第二候选帧中确定第二目标帧,按照相似度从高到低的顺序,所述第二目标帧与所述目标关键字的相似度位于所述第二目标帧与各文字的相似度中的排名,高于除所述第二目标帧外的每个所述第二候选帧与所述目 标关键字的相似度位于所述第二候选帧与各文字的相似度中的排名。Determining a second target frame from the at least one second candidate frame, the similarity between the second target frame and the target keyword is located in the second target frame and each according to a sequence of similarity from high to low a ranking in the similarity of the characters, the degree of similarity of each of the second candidate frames and the target keyword being higher than the second target frame is located in the similarity between the second candidate frame and each character Ranking.
  8. 一种语音关键词识别装置,其特征在于,包括:A voice keyword recognition device, comprising:
    第一目标帧确定单元,用于从构成第一语音的第一帧序列中选取一个第一目标帧;a first target frame determining unit, configured to select a first target frame from a first frame sequence constituting the first voice;
    目标关键字确定单元,用于从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;a target keyword determining unit, configured to select a keyword from the keyword sequence as the target keyword, wherein the keyword sequence belongs to the voice keyword;
    匹配单元,用于若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;a matching unit, configured to determine, according to the keyword template corresponding to each keyword in the keyword sequence, that the key template of the first target frame is successfully matched with the keyword template corresponding to the target keyword Whether the hidden layer feature vector of the frame located in the first voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
    识别单元,用于若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。The identifying unit is configured to determine, if the keyword template corresponding to each keyword in the keyword sequence is determined one by one, that the hidden layer feature vector of the frame located in the first voice is successfully matched, The voice keyword is included in a voice.
  9. 根据权利要求8所述的装置,其特征在于,还包括:返回执行单元,用于:在匹配失败的情况下,返回执行所述从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧的步骤。The apparatus according to claim 8, further comprising: a return execution unit, configured to: when the matching fails, return to perform execution of the frame selected from the first frame sequence constituting the first speech as The step of the first target frame.
  10. 根据权利要求9所述的装置,其特征在于,所述第一目标帧确定单元,包括:The apparatus according to claim 9, wherein the first target frame determining unit comprises:
    第一确定单元,用于从所述构成第一语音的第一帧序列中确定第一个从未被确定为第一目标帧的帧;a first determining unit, configured to determine, from the first sequence of frames constituting the first voice, a frame that is never determined to be the first target frame;
    第二确定单元,用于将所述帧作为从所述构成第一语音的第一帧序列中确定的第一目标帧。And a second determining unit, configured to use the frame as the first target frame determined from the first frame sequence constituting the first voice.
  11. 根据权利要求8至10中任一项所述的装置,其特征在于,所述目标关键字确定单元,包括:The device according to any one of claims 8 to 10, wherein the target keyword determining unit comprises:
    第三确定单元,用于从所述语音关键词包括的所述关键字序列中,确定与最近一次匹配成功的关键字模板对应的关键字相邻的下一关键字;a third determining unit, configured to determine, from the keyword sequence included in the voice keyword, a next keyword adjacent to a keyword corresponding to a keyword template that has been successfully matched last time;
    第四确定单元,用于若所述下一关键字被连续确定为目标关键字的次数未达到预设的阈值,将所述下一关键字确定为目标关键字;a fourth determining unit, configured to determine the next keyword as a target keyword if the number of times the next keyword is continuously determined as the target keyword does not reach a preset threshold;
    第五确定单元,用于若所述下一关键字被连续确定为目标关键字的次数达到所述阈值,将所述关键字序列中的第一个关键字确定为目标关键字。And a fifth determining unit, configured to determine, as the target keyword, the first keyword in the keyword sequence if the number of times the next keyword is continuously determined as the target keyword reaches the threshold.
  12. 根据权利要求8至11中任一项所述的装置,其特征在于,还包括关键字模板生成单元,所述关键字模板生成单元,包括:The device according to any one of claims 8 to 11, further comprising a keyword template generating unit, the keyword template generating unit comprising:
    第二语音确定单元,用于确定包括所述目标关键字的第二语音,所述第二语音由第二帧序列构成;a second voice determining unit, configured to determine a second voice that includes the target keyword, where the second voice is composed of a second sequence of frames;
    终层特征向量确定单元,用于将所述第二语音作为预设的语音模型的输入信息,确定分别与所述第二帧序列中的每个帧对应的终层特征向量;a final layer feature vector determining unit, configured to use the second voice as input information of a preset voice model, and determine a final layer feature vector corresponding to each frame in the second frame sequence;
    第二目标帧确定单元,用于根据分别与每个帧对应的终层特征向量,从所述第二帧序列中确定第二目标帧;a second target frame determining unit, configured to determine a second target frame from the second frame sequence according to a final layer feature vector corresponding to each frame respectively;
    关键字模板生成子单元,用于根据将所述第二目标帧作为所述语音模型的输入信息所得到的与所述第二目标帧对应的隐层特征向量,生成与所述目标关键字对应的关键字模板。a keyword template generating subunit, configured to generate a hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as input information of the voice model, and generate a mapping corresponding to the target keyword Keyword template.
  13. 根据权利要求12所述的装置,其特征在于,所述帧对应的终层特征向量,包括:所述帧分别与所述语音模型中预设的文字集中的每个文字之间的相似度,所述目标关键字为所述文件集中的一个文字;The apparatus according to claim 12, wherein the end layer feature vector corresponding to the frame comprises: a similarity between the frame and each text in a preset text set in the voice model, The target keyword is a text in the file set;
    所述第二目标帧确定单元,具体用于:根据分别与每个帧对应的终层特征向量,从所述第二帧序列中选取与所述目标关键字的相似程度最高的帧作为第二目标帧;其中,帧与所述目标关键字的相似程度根据所述帧分别与所述文字集中的每个文字之间的相似度确定。The second target frame determining unit is configured to: select, according to the final layer feature vector corresponding to each frame, a frame with the highest degree of similarity to the target keyword as the second frame from the second frame sequence. a target frame; wherein a degree of similarity between the frame and the target keyword is determined according to a similarity between the frame and each of the characters in the set of characters.
  14. 根据权利要求13所述的装置,其特征在于,所述第二目标帧确定单元,包括:The apparatus according to claim 13, wherein the second target frame determining unit comprises:
    第一候选帧确定单元,用于从所述第二帧序列中确定至少一个第一候选帧,所述第一候选帧与所述目标关键字的相似度小于所述第一候选帧与所述文字集中的至少一个文字的相似度,所述至少一个文字的个数小于预设数值;a first candidate frame determining unit, configured to determine at least one first candidate frame from the second frame sequence, where a similarity between the first candidate frame and the target keyword is smaller than the first candidate frame and the a similarity of at least one character in the text set, the number of the at least one text being less than a preset value;
    第二候选帧确定单元,用于从所述至少一个第一候选帧中确定至少一个第二候选帧,所述至少一个第二候选帧为所述至少一个第一候选帧中与所述目标关键字的相似度最大的各第一候选帧;a second candidate frame determining unit, configured to determine at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is the target key in the at least one first candidate frame Each of the first candidate frames having the largest similarity of words;
    第二目标帧确定子单元,用于从所述至少一个第二候选帧中确定第二目标 帧,按照相似度从高到低的顺序,所述第二目标帧与所述目标关键字的相似度位于所述第二目标帧与各文字的相似度中的排名,高于除所述第二目标帧外的每个所述第二候选帧与所述目标关键字的相似度位于所述第二候选帧与各文字的相似度中的排名。a second target frame determining subunit, configured to determine a second target frame from the at least one second candidate frame, the second target frame is similar to the target keyword according to a sequence of similarity from high to low a ranking of a degree of similarity between the second target frame and each character, and a similarity between each of the second candidate frames and the target keyword except the second target frame is located at the first The ranking in the similarity between the two candidate frames and each text.
  15. 一种终端,其特征在于,包括存储器和处理器,所述存储器用于存储程序,所述处理器调用所述程序,所述程序用于:A terminal, comprising: a memory for storing a program, the processor calling the program, the program for:
    从构成第一语音的第一帧序列中选取一个第一目标帧;Selecting a first target frame from a sequence of first frames constituting the first voice;
    从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
    若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;If the key layer template corresponding to the target keyword is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
    若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。If it is determined that the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
  16. 一种语音关键词识别服务器,其特征在于,包括存储器和处理器,所述存储器用于存储程序,所述处理器调用所述程序,所述程序用于:A voice keyword recognition server, comprising: a memory for storing a program, the processor calling the program, the program for:
    从构成第一语音的第一帧序列中选取一个第一目标帧;Selecting a first target frame from a sequence of first frames constituting the first voice;
    从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
    若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;If the key layer template corresponding to the target keyword is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
    若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。If it is determined that the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
  17. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得 计算机执行如权利要求1至7中任一项所述的方法。A computer readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7.
  18. 一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行如权利要求1至7任一项所述的方法。A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 7.
PCT/CN2018/079769 2017-05-27 2018-03-21 Speech keyword identification method and device, terminal and server WO2018219023A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710391388.6A CN107230475B (en) 2017-05-27 2017-05-27 Voice keyword recognition method and device, terminal and server
CN201710391388.6 2017-05-27

Publications (1)

Publication Number Publication Date
WO2018219023A1 true WO2018219023A1 (en) 2018-12-06

Family

ID=59934556

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/079769 WO2018219023A1 (en) 2017-05-27 2018-03-21 Speech keyword identification method and device, terminal and server

Country Status (3)

Country Link
CN (3) CN110349572B (en)
TW (1) TWI690919B (en)
WO (1) WO2018219023A1 (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349572B (en) * 2017-05-27 2021-10-22 腾讯科技(深圳)有限公司 Voice keyword recognition method and device, terminal and server
CN107564517A (en) 2017-07-05 2018-01-09 百度在线网络技术(北京)有限公司 Voice awakening method, equipment and system, cloud server and computer-readable recording medium
CN110444193B (en) * 2018-01-31 2021-12-14 腾讯科技(深圳)有限公司 Method and device for recognizing voice keywords
CN108564941B (en) * 2018-03-22 2020-06-02 腾讯科技(深圳)有限公司 Voice recognition method, device, equipment and storage medium
CN108492827B (en) * 2018-04-02 2019-07-30 百度在线网络技术(北京)有限公司 Wake-up processing method, device and the storage medium of application program
CN108665900B (en) * 2018-04-23 2020-03-03 百度在线网络技术(北京)有限公司 Cloud wake-up method and system, terminal and computer readable storage medium
CN108615526B (en) 2018-05-08 2020-07-07 腾讯科技(深圳)有限公司 Method, device, terminal and storage medium for detecting keywords in voice signal
CN109192224B (en) * 2018-09-14 2021-08-17 科大讯飞股份有限公司 Voice evaluation method, device and equipment and readable storage medium
CN109215632B (en) * 2018-09-30 2021-10-08 科大讯飞股份有限公司 Voice evaluation method, device and equipment and readable storage medium
CN110503970B (en) * 2018-11-23 2021-11-23 腾讯科技(深圳)有限公司 Audio data processing method and device and storage medium
CN110322871A (en) * 2019-05-30 2019-10-11 清华大学 A kind of sample keyword retrieval method based on acoustics characterization vector
CN110648668A (en) * 2019-09-24 2020-01-03 上海依图信息技术有限公司 Keyword detection device and method
CN110706703A (en) * 2019-10-16 2020-01-17 珠海格力电器股份有限公司 Voice wake-up method, device, medium and equipment
CN110827806B (en) * 2019-10-17 2022-01-28 清华大学深圳国际研究生院 Voice keyword detection method and system
CN112837680A (en) * 2019-11-25 2021-05-25 马上消费金融股份有限公司 Audio keyword retrieval method, intelligent outbound method and related device
CN111292753A (en) * 2020-02-28 2020-06-16 广州国音智能科技有限公司 Offline voice recognition method, device and equipment
CN111128138A (en) * 2020-03-30 2020-05-08 深圳市友杰智新科技有限公司 Voice wake-up method and device, computer equipment and storage medium
CN111723204B (en) * 2020-06-15 2021-04-02 龙马智芯(珠海横琴)科技有限公司 Method and device for correcting voice quality inspection area, correction equipment and storage medium
CN111798840B (en) * 2020-07-16 2023-08-08 中移在线服务有限公司 Voice keyword recognition method and device
CN112259101B (en) * 2020-10-19 2022-09-23 腾讯科技(深圳)有限公司 Voice keyword recognition method and device, computer equipment and storage medium
CN112259077B (en) * 2020-10-20 2024-04-09 网易(杭州)网络有限公司 Speech recognition method, device, terminal and storage medium
CN116523970B (en) * 2023-07-05 2023-10-20 之江实验室 Dynamic three-dimensional target tracking method and device based on secondary implicit matching

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593519A (en) * 2008-05-29 2009-12-02 夏普株式会社 Detect method and apparatus and the search method and the system of voice keyword
CN104766608A (en) * 2014-01-07 2015-07-08 深圳市中兴微电子技术有限公司 Voice control method and voice control device
CN105117384A (en) * 2015-08-19 2015-12-02 小米科技有限责任公司 Classifier training method, and type identification method and apparatus
CN105740686A (en) * 2016-01-28 2016-07-06 百度在线网络技术(北京)有限公司 Application control method and device
CN107230475A (en) * 2017-05-27 2017-10-03 腾讯科技(深圳)有限公司 A kind of voice keyword recognition method, device, terminal and server

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4224250B2 (en) * 2002-04-17 2009-02-12 パイオニア株式会社 Speech recognition apparatus, speech recognition method, and speech recognition program
CN101188110B (en) * 2006-11-17 2011-01-26 陈健全 Method for improving text and voice matching efficiency
CN102053993B (en) * 2009-11-10 2014-04-09 阿里巴巴集团控股有限公司 Text filtering method and text filtering system
CN102081638A (en) * 2010-01-29 2011-06-01 蓝盾信息安全技术股份有限公司 Method and device for matching keywords
CN102915729B (en) * 2011-08-01 2014-11-26 佳能株式会社 Speech keyword spotting system and system and method of creating dictionary for the speech keyword spotting system
JP5810946B2 (en) * 2012-01-31 2015-11-11 富士通株式会社 Specific call detection device, specific call detection method, and computer program for specific call detection
KR101493006B1 (en) * 2013-03-21 2015-02-13 디노플러스 (주) Apparatus for editing of multimedia contents and method thereof
US20140337030A1 (en) * 2013-05-07 2014-11-13 Qualcomm Incorporated Adaptive audio frame processing for keyword detection
US9786296B2 (en) * 2013-07-08 2017-10-10 Qualcomm Incorporated Method and apparatus for assigning keyword model to voice operated function
CN104143328B (en) * 2013-08-15 2015-11-25 腾讯科技(深圳)有限公司 A kind of keyword spotting method and apparatus
CN104143329B (en) * 2013-08-19 2015-10-21 腾讯科技(深圳)有限公司 Carry out method and the device of voice keyword retrieval
CN103577548B (en) * 2013-10-12 2017-02-08 优视科技有限公司 Method and device for matching characters with close pronunciation
US10032449B2 (en) * 2014-09-03 2018-07-24 Mediatek Inc. Keyword spotting system for achieving low-latency keyword recognition by using multiple dynamic programming tables reset at different frames of acoustic data input and related keyword spotting method
US10045140B2 (en) * 2015-01-07 2018-08-07 Knowles Electronics, Llc Utilizing digital microphones for low power keyword detection and noise suppression
US20160284349A1 (en) * 2015-03-26 2016-09-29 Binuraj Ravindran Method and system of environment sensitive automatic speech recognition
US9990917B2 (en) * 2015-04-13 2018-06-05 Intel Corporation Method and system of random access compression of transducer data for automatic speech recognition decoding
CN106161755A (en) * 2015-04-20 2016-11-23 钰太芯微电子科技(上海)有限公司 A kind of key word voice wakes up system and awakening method and mobile terminal up
CN106297776B (en) * 2015-05-22 2019-07-09 中国科学院声学研究所 A kind of voice keyword retrieval method based on audio template
US20170061959A1 (en) * 2015-09-01 2017-03-02 Disney Enterprises, Inc. Systems and Methods For Detecting Keywords in Multi-Speaker Environments
TWI639153B (en) * 2015-11-03 2018-10-21 絡達科技股份有限公司 Electronic apparatus and voice trigger method therefor
CN105575386B (en) * 2015-12-18 2019-07-30 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN105679316A (en) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 Voice keyword identification method and apparatus based on deep neural network
US9805714B2 (en) * 2016-03-22 2017-10-31 Asustek Computer Inc. Directional keyword verification method applicable to electronic device and electronic device using the same
CN105930413A (en) * 2016-04-18 2016-09-07 北京百度网讯科技有限公司 Training method for similarity model parameters, search processing method and corresponding apparatuses

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593519A (en) * 2008-05-29 2009-12-02 夏普株式会社 Detect method and apparatus and the search method and the system of voice keyword
CN104766608A (en) * 2014-01-07 2015-07-08 深圳市中兴微电子技术有限公司 Voice control method and voice control device
CN105117384A (en) * 2015-08-19 2015-12-02 小米科技有限责任公司 Classifier training method, and type identification method and apparatus
CN105740686A (en) * 2016-01-28 2016-07-06 百度在线网络技术(北京)有限公司 Application control method and device
CN107230475A (en) * 2017-05-27 2017-10-03 腾讯科技(深圳)有限公司 A kind of voice keyword recognition method, device, terminal and server

Also Published As

Publication number Publication date
CN107230475A (en) 2017-10-03
CN110349572B (en) 2021-10-22
CN110349572A (en) 2019-10-18
CN110444199A (en) 2019-11-12
TWI690919B (en) 2020-04-11
CN110444199B (en) 2022-01-07
CN107230475B (en) 2022-04-05
TW201832221A (en) 2018-09-01

Similar Documents

Publication Publication Date Title
WO2018219023A1 (en) Speech keyword identification method and device, terminal and server
CN108491433B (en) Chat response method, electronic device and storage medium
US11164568B2 (en) Speech recognition method and apparatus, and storage medium
WO2018133761A1 (en) Method and device for man-machine dialogue
US10210243B2 (en) Method and system for enhanced query term suggestion
WO2017166650A1 (en) Voice recognition method and device
JP2018005218A (en) Automatic interpretation method and apparatus
WO2018165932A1 (en) Generating responses in automated chatting
WO2021072955A1 (en) Decoding network construction method, voice recognition method, device and apparatus, and storage medium
JP2017162190A (en) Similar document search program, similar document search device, and similar document search method
WO2020244065A1 (en) Character vector definition method, apparatus and device based on artificial intelligence, and storage medium
US20210200813A1 (en) Human-machine interaction method, electronic device, and storage medium
WO2020177592A1 (en) Painting question answering method and device, painting question answering system, and readable storage medium
WO2014036827A1 (en) Text correcting method and user equipment
US10802605B2 (en) Input method, device, and electronic apparatus
WO2020151690A1 (en) Statement generation method, device and equipment and storage medium
CN111651578B (en) Man-machine conversation method, device and equipment
JP2020004382A (en) Method and device for voice interaction
WO2018076450A1 (en) Input method and apparatus, and apparatus for input
WO2020006488A1 (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus
TWI660340B (en) Voice controlling method and system
JP6553180B2 (en) System and method for language detection
WO2018028319A1 (en) Method and device for sorting contacts
JP6563350B2 (en) Data classification apparatus, data classification method, and program
CN112800314B (en) Method, system, storage medium and equipment for search engine query automatic completion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18809091

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18809091

Country of ref document: EP

Kind code of ref document: A1