TW201928740A - Keyword confirmation method and apparatus - Google Patents

Keyword confirmation method and apparatus Download PDF

Info

Publication number
TW201928740A
TW201928740A TW107135162A TW107135162A TW201928740A TW 201928740 A TW201928740 A TW 201928740A TW 107135162 A TW107135162 A TW 107135162A TW 107135162 A TW107135162 A TW 107135162A TW 201928740 A TW201928740 A TW 201928740A
Authority
TW
Taiwan
Prior art keywords
keyword
probability
audio
audio data
mute
Prior art date
Application number
TW107135162A
Other languages
Chinese (zh)
Inventor
劉勇
姚海濤
Original Assignee
香港商阿里巴巴集團服務有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 香港商阿里巴巴集團服務有限公司 filed Critical 香港商阿里巴巴集團服務有限公司
Publication of TW201928740A publication Critical patent/TW201928740A/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Abstract

A keyword confirmation method and apparatus are provided. A keyword confirmation method includes: obtaining first audio data, the first audio data being recognized as a keyword; obtaining a pronunciation similarity probability of a similar pronunciation unit corresponding to at least one fragment of the first audio data and second audio data; determining that multiple contiguous silence fragments exist in second audio data contiguous in time with the first audio data; utilizing the silence probability, as well as a pronunciation similarity probability corresponding to fragment(s) of the first audio data and/or a pronunciation similarity probability corresponding to fragment(s) of the second audio data, evaluating whether the second audio data is silence; and confirming the first audio data as an effective keyword.

Description

關鍵詞確認方法和裝置Keyword confirmation method and device

本申請涉及電腦領域,特別是涉及一種關鍵詞確認方法和裝置。The present application relates to the field of computers, and in particular, to a method and a device for confirming a keyword.

在電腦人機互動領域中,利用關鍵詞對例如車載終端等電子裝置進行喚醒,以開啟人機互動功能或者執行某項功能,已被多種電子裝置使用。   例如,當使用者說出如啟動系統、放大地圖等特定關鍵詞時,電子裝置檢測到使用者的上述關鍵詞後執行相應的啟動系統、放大地圖的功能。通過這種語音控制的方式極大地增加了使用者的便利。   但是,如何判定使用者發出的關鍵詞,而不會造成誤判,例如將不是關鍵詞的正常對話識別為關鍵詞,或者將關鍵詞錯誤地識別為非關鍵詞而拒絕啟動,成為本領域需要解決的問題。   為了解決這一問題,現有技術提出的一種解決方案是,通過將使用者發出的語音與關鍵詞庫中的關鍵詞對比,如果能夠匹配,則認為使用者發出的語音指令為關鍵詞,電子裝置對應地執行相應指令;如果未能匹配,則不是關鍵詞,電子裝置不執行指令。   但是,這一方案容易將使用者會話中不是關鍵詞的內容識別為關鍵詞,例如使用者說出的「現在還不需要啟動系統」或者「我就是想知道放大地圖的功能能不能用」,這類的正常對話中包含「啟動系統」或「放大地圖」的關鍵詞,會被電子裝置識別為關鍵詞,從而錯誤地執行指令。In the field of computer human-computer interaction, keywords are used to wake up electronic devices such as car terminals to enable human-computer interaction functions or perform certain functions, and have been used by a variety of electronic devices. For example, when the user speaks certain keywords such as starting the system and zooming in on the map, the electronic device detects the keywords of the user and executes the corresponding function of activating the system and zooming in on the map. This way of voice control greatly increases user convenience. However, how to determine the keywords sent by the user without causing misjudgment, such as identifying normal dialogs that are not keywords as keywords, or erroneously identifying keywords as non-keywords and refusing to start, has become a solution in this field. The problem. In order to solve this problem, a solution proposed in the prior art is to compare the voice issued by the user with the keywords in the keyword database, and if it can be matched, the voice command issued by the user is regarded as the keyword, and the electronic device The corresponding instruction is executed accordingly; if it fails to match, it is not a keyword and the electronic device does not execute the instruction. However, this solution is easy to identify content that is not a keyword in the user's conversation as a keyword, such as "The system does not need to be activated yet" or "I just want to know if the map zoom function is available", Keywords of this type of normal dialogue that include "start the system" or "enlarge the map" will be recognized as keywords by the electronic device, and the instruction will be executed incorrectly.

鑒於上述問題,本發明一實施例提出一種關鍵詞確認方法和裝置,以解決現有技術存在的問題。   為了解決上述問題,本申請一實施例公開一種關鍵詞確認方法,包括:   獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段;   確認所述第一音頻資料為有效關鍵詞。   為了解決上述問題,本申請第二實施例公開一種關鍵詞確認方法,包括:   獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   判定與所述第一音頻資料時間上連續的第二音頻資料的多個片段的累積靜音概率;   判定所述第一音頻資料的多個片段的累積關鍵詞概率;   當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞。   為了解決上述問題,本申請一實施例公開一種車載終端的關鍵詞確認方法,包括:   通過車載音頻採集裝置獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段;   確認所述第一音頻資料為有效關鍵詞,其中所述有效關鍵詞用於喚醒車載終端執行所述關鍵詞對應的指令。   為了解決上述問題,本申請一實施例公開一種車載終端的關鍵詞確認方法,包括:   通過車載音頻採集裝置獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   判定與所述第一音頻資料時間上連續的第二音頻資料的多個片段的累積靜音概率;   判定所述第一音頻資料的多個片段的累積關鍵詞概率;   當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞,其中所述有效關鍵詞用於喚醒車載終端執行所述關鍵詞對應的指令。   為了解決上述問題,本申請一實施例公開一種關鍵詞確認裝置,包括:   音頻資料獲取模組,用於獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   靜音片段判定模組,用於判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段;   有效關鍵詞判定模組,用於確認所述第一音頻資料為有效關鍵詞。   為了解決上述問題,本申請一實施例公開一種關鍵詞確認裝置,包括:   音頻資料獲取模組,用於獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   累計靜音片段判定模組,用於判定與所述第一音頻資料時間上連續的第二音頻資料的多個片段的累積靜音概率;   累積關鍵詞概率判定模組,用於判定所述第一音頻資料的多個片段的累積關鍵詞概率;   有效關鍵詞判定模組,用於當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞。   本申請一實施例還公開一種終端設備,包括:   一個或多個處理器;和   其上儲存有指令的一個或多個機器可讀媒體,當由所述一個或多個處理器執行時,使得所述終端設備執行上述的方法。   本申請一實施例還公開一個或多個機器可讀媒體,其上儲存有指令,當由一個或多個處理器執行時,使得終端設備執行上述的方法。   由上述可知,本申請實施例提出的關鍵詞確認方法,至少包括以下優點:   本發明實施例提出的關鍵詞確認方法中,利用了使用者的一般習慣,即在發出關鍵詞的之前或之後會有停頓而產生靜音,通過檢測關鍵詞前後是否存在靜音來檢測該關鍵詞是否為有效關鍵詞,提高了檢測準確率,避免了識別錯誤;同時在針對靜音的檢測中,利用了音頻資料的靜音片段的檢測,提高了判斷是否為靜音的準確性,進一步避免了將關鍵詞誤判為非關鍵詞。In view of the above problems, an embodiment of the present invention provides a method and a device for confirming a keyword to solve the problems existing in the prior art. In order to solve the above-mentioned problem, an embodiment of the present application discloses a keyword confirmation method, including: 第一 acquiring first audio data, the first audio data being identified as a keyword; determining a first continuous time series with the first audio data. There are multiple consecutive mute segments in the second audio material; (1) Confirm that the first audio material is a valid keyword. In order to solve the above problem, a second embodiment of the present application discloses a keyword confirmation method, including: acquiring first audio data, the first audio data being identified as a keyword; judging that the first audio data is continuous in time with the first audio data Cumulative silence probability of multiple segments of the second audio material; determining the cumulative keyword probability of multiple fragments of the first audio material; when the relationship between the cumulative silence probability and the cumulative keyword probability satisfies a second preset condition To confirm that the first audio data is a valid keyword. In order to solve the above problems, an embodiment of the present application discloses a keyword confirmation method for a vehicle terminal, including: 获取 acquiring first audio data through a vehicle audio acquisition device, where the first audio data is identified as a keyword; judging with the first There are multiple consecutive mute segments in a second audio material that is continuous in time in an audio material; confirm that the first audio material is a valid keyword, and the valid keyword is used to wake up the vehicle terminal to execute the keyword corresponding instruction. In order to solve the above problems, an embodiment of the present application discloses a keyword confirmation method for a vehicle terminal, including: 获取 acquiring first audio data through a vehicle audio acquisition device, where the first audio data is identified as a keyword; judging with the first Cumulative probability of silence of multiple segments of a second audio material continuous in audio data in time; judging the cumulative keyword probability of multiple segments of the first audio material; the relationship between the cumulative silence probability and the cumulative keyword probability When the second preset condition is satisfied, it is confirmed that the first audio data is a valid keyword, and the valid keyword is used to wake up the vehicle terminal to execute an instruction corresponding to the keyword. In order to solve the above problems, an embodiment of the present application discloses a keyword confirmation device, including: an audio data acquisition module for acquiring a first audio data, the first audio data is identified as a keyword; a silent segment determination module To determine that there are multiple consecutive silent segments in the second audio material that is temporally continuous with the first audio material; an effective keyword determination module, which is used to confirm that the first audio material is a valid keyword. In order to solve the above problems, an embodiment of the present application discloses a keyword confirmation device, including: : an audio data acquisition module for acquiring a first audio data, the first audio data is identified as a keyword; a cumulative silence segment determination module A group for determining a cumulative silence probability of a plurality of fragments of a second audio material that is temporally continuous with the first audio material; a cumulative keyword probability determination module for determining a plurality of fragments of the first audio material Cumulative keyword probability; A valid keyword determination module configured to confirm that the first audio data is a valid keyword when the relationship between the cumulative silence probability and the cumulative keyword probability meets a second preset condition. An embodiment of the present application also discloses a terminal device including: (i) one or more processors; and one or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause The terminal device executes the above method.一 An embodiment of the present application also discloses one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause the terminal device to execute the above method. It can be known from the above that the keyword confirmation method provided in the embodiment of the present application includes at least the following advantages: 中 The keyword confirmation method provided in the embodiment of the present invention uses the general habit of the user, that is, before or after the keyword is issued, Silence occurs when there is a pause. By detecting whether there is silence before and after the keyword to detect whether the keyword is a valid keyword, the detection accuracy is improved, and recognition errors are avoided. At the same time, the silence of the audio data is used in the detection of silence. The detection of the fragments improves the accuracy of determining whether to mute or not, and further avoids misjudgement of keywords as non-keywords.

下面將結合本申請實施例中的附圖,對本申請實施例中的技術方案進行清楚、完整地描述,顯然,所描述的實施例僅僅是本申請一部分實施例,而不是全部的實施例。基於本申請中的實施例,本領域普通技術人員所獲得的所有其他實施例,都屬於本申請保護的範圍。   本申請的核心思想之一在於,提出一種關鍵詞確認方法,利有關鍵詞前後的靜音來判斷關鍵詞是否確實為有效關鍵詞,同時在針對靜音的檢測中,利用了音頻資料的連續靜音片段或者多個片段的累積靜音概率來進行判斷,提高了判斷的準確性。 第一實施例   本發明第一實施例提出一種關鍵詞確認方法。圖1所示為本發明一實施例的正常的關鍵詞及前後音頻資料的示意圖,如圖1所示,按照使用者一般的習慣,在發出關鍵詞之前或者之後會有停頓導致靜音,所以正常的關鍵詞前後可以采到靜音。利用這一方式可以判斷使用者發出的語音是否為關鍵詞。   圖1中是正常的關鍵詞的三種可能情形,分別包括:在關鍵詞前有靜音、在關鍵詞後有靜音、在關鍵詞前後均有靜音。以下將關鍵詞對應的音頻資料作為第一音頻資料,將關鍵詞前後的靜音部分對應的音頻資料作為第二音頻資料。   圖2所示為本發明第一實施例的關鍵詞確認方法的步驟流程圖。如圖2所示,本發明實施例的關鍵詞確認方法例如可以包括如下步驟:   S101,獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   在這一步驟中,執行主體,例如是車載終端、手機、平板電腦等電子裝置,可以獲取音頻資料,其至少包括第一音頻資料和前/後的第二音頻資料。第二音頻資料與第一音頻資料在時間上連續。此時檢測到的第一音頻資料已被識別為關鍵詞,即,此時檢測到的音頻資料已確認與預存的關鍵詞匹配。   如圖1所示,電子裝置可以獲取並檢測如圖1所示中關鍵詞對應的第一音頻資料,以及關鍵詞之前、之後或之前和之後的第二音頻資料,在實際使用中,電子裝置的聲音採集裝置例如麥克風可以持續採集音頻,音頻資料例如是按照「幀」為單位獲取的,一幀例如為10ms,則在檢測到第一音頻資料為關鍵詞之後,獲取該第一音頻資料前/後若干幀,例如10幀的第二音頻資料,進行後續分析。   在一實施例中,需要進一步判斷第一音頻資料是否為「有效關鍵詞」,只有在後續確認為有效關鍵詞時才可以根據有效關鍵詞執行相應的指令。   S102,判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段;   在這一步驟中,將第二音頻資料的片段輸入電子裝置的聲音單元匹配模型後,可以獲知其與聲音單元庫中的靜音單元的相似度,作為該片段的靜音概率。例如,針對第二音頻資料的片段,將其輸入聲音單元匹配模型後,計算出其與靜音單元的相似度為90%,則將90%作為該片段對應的靜音概率,當這一靜音概率滿足一定的要求時,則認為第二音頻資料的該片段為靜音片段。   在一實施例中,可以將多個片段輸入電子裝置的聲音單元匹配模型,分別獲取該片段對應的靜音概率,並利用該靜音概率判定該片段是否為靜音片段。   在判斷該片段為靜音片段之後,可以判斷該第二音頻資料中是否包括多個連續的靜音片段。例如,針對第二音頻資料的多個片段,獲知了每一個片段是否為靜音片段後可以檢測這些靜音片段中是否為連續的靜音片段,例如,每一個片段具有一個是否為靜音片段的標識f,當檢測到時間上連續的三個片段均具有靜音標識f時,則認為第二音頻資料中存在多個連續的靜音片段。   S103,確認所述第一音頻資料為有效關鍵詞。   在這一步驟中,當第二音頻資料中存在多個連續的靜音片段時,則判斷第二音頻資料為靜音,由此可以確認該關鍵詞為有效關鍵詞,後續可以根據該有效關鍵詞執行對應的指令。   例如,當第二音頻資料中包括多個(例如3個以上)連續的靜音片段時,認為第二音頻資料為靜音,繼而判斷出第一音頻資料為有效關鍵詞。   值得注意的是,前述和後述的關鍵詞可以包括多種內容。例如,用於喚醒電子裝置中的作業系統的喚醒詞、使用者的語音指令、指令中的關鍵參數等。例如,使用者針對電子裝置的語音操作過程中,輸入的「開啟系統」、「調頻到87.6」、「87.6」等等,均屬於本發明實施例提出的「關鍵詞」的範疇,本發明並不特別限制。   由上述可知,本發明第一實施例提出的關鍵詞確認方法至少具有如下技術效果:   本發明實施例提出的關鍵詞確認方法中,利用了使用者的一般習慣,即在發出關鍵詞的之前或之後會有停頓而產生靜音,通過檢測關鍵詞前後是否存在靜音來檢測該關鍵詞是否為有效關鍵詞,提高了檢測準確率,避免了識別錯誤;同時在針對靜音的檢測中,利用了音頻資料的靜音片段的檢測,提高了判斷是否為靜音的準確性,進一步避免了將關鍵詞誤判為非關鍵詞。 第二實施例   本發明第二實施例提出一種關鍵詞確認方法。圖3所示為本發明第二實施例的關鍵詞確認方法的步驟流程圖。如圖3所示,本發明實施例的關鍵詞確認方法如下步驟:   S201,獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   S202,判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段;   S203,確認所述第一音頻資料為有效關鍵詞。   上述步驟S201至步驟S203與上一實施例的步驟S101至步驟S103相同或相似,在此不再贅述。本實施例重點說明與上一實施例的不同之處。   在本發明關鍵詞確認方法的一實施例中,所述步驟S202,即判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段的的步驟可以包括如下子步驟:   S2021,判定所述片段的發音相似概率,所述發音相似概率為所述片段與多個發音單元之間的最大相似概率;   在這一步驟中,第二音頻資料的片段例如可以為上述的音頻幀,也可以為其他單位的片段,在此並不限制,只要是按照特定的規律,例如時間、儲存方式等對音頻資料劃分獲得的片段,均屬於本發明保護的範圍。例如,該片段可以為10ms或者20ms一幀的音頻幀,也可以為1s的音頻段落等。   發音單元可以是根據使用者的發音劃分獲得的音素、音節、字、詞等單元,例如,當使用者發出的是「斑馬」,發音單元是以音素為單位劃分的,音素是比音節更小的發音單元,從現有的音素集中可知,「斑馬」對應的音素「b a_h nn_h m a_l a_l」;當發音單元是以音節為單位劃分的,「斑馬」對應的音節是「b an m a」;當發音單元是以字的發音為單位劃分的,「斑馬」對應的劃分方式是「ban ma」;當發音單元是以詞為單位劃分的,「斑馬」對應的劃分方式是「banma」。   針對每一種劃分方式,可以構建對應的聲音單元庫。詞庫中除了包含上述的發音單元,還可以包括靜音單元等。當第二音頻資料的一個片段與聲音單元庫中預存的其中一個發音單元的相似概率最高,則認為該片段與該發音單元匹配,則將該發音單元作為相似發音單元,同時將該片段與該相似發音單元的相似率作為發音相似概率。   針對第二音頻資料中的片段,將該片段輸入電子裝置的聲音單元匹配模型進行判斷,如果該片段與聲音單元庫中的第五發音單元的相似度最高,為80%,與第六發音單元的相似度其次,為70%,則可以記錄相似度80%的第五發音單元,作為片段對應的相似發音單元,並記錄發音相似概率80%,用於後續處理。   S2022,判定所述片段的靜音概率,所述靜音概率為所述片段與靜音單元的相似概率;   在這一步驟中,將第二音頻資料的片段輸入電子裝置的聲音單元匹配模型後,可以獲知其與聲音單元庫中的靜音單元的相似度,作為該片段的靜音概率。例如,針對第二音頻資料的片段,將其輸入聲音單元匹配模型後,計算出其與靜音單元的相似度為90%,則將90%作為該片段對應的靜音概率。   值得注意的是,上述的靜音單元可以是預存在聲音單元庫中的,可以通過大量資料反覆運算訓練模型獲得,例如綜合考慮聲音的能量、環境噪音(包括風聲、音樂聲、汽車鳴笛等)等獲得靜音單元,並不限於絕對的無聲。靜音單元的長度、屬性等可以與發明單元對應。例如當發音單元是按照音素來劃分的,則靜音單元可以為靜音音素;當發音單元是按照音節來劃分的,則靜音單元可以為靜音音節,在此並不限制。   S2023,當所述發音相似概率與所述靜音概率的關係滿足預設條件時,判定所述片段為靜音片段。   所述預設條件例如包括:   所述片段的發音相似概率與靜音概率的差值的絕對值小於第一臨限值。   在這一步驟中,可以利用之前獲得的第二音頻資料的片段對應的發音相似概率和對應的靜音概率,判斷該第二音頻資料的片段是否為靜音。   如上述可知,本發明實施例提出的方案中,靜音概率的判斷中並不是將音頻資料的片段與絕對靜音進行對比,而是將發音相似概率和對應的靜音概率做對比,綜合考慮了環境雜訊等因素,因此本發明提供的方案可以避免因為靜音判斷不準確而拒絕正確的關鍵詞。   利用發音相似概率和所述靜音概率判斷音頻資料的片段是否為靜音的方式有多種,在此舉例進行介紹。   例如,片段滿足「發音相似概率pmax(indexframe)和所述靜音概率psil(indexframe)的差值的絕對值小於15%」這一要求,即:則認定該片段為靜音片段。   在本發明關鍵詞確認方法的一實施例中,上述子步驟S2022中,靜音概率的判斷也可以是利用最大相似概率對應的發音單元與靜音單元的相似概率對應的發音單元與靜音單元的相似概率來判定。即,子步驟S2022可以替換為如下子步驟:   S2024:判定所述片段的靜音概率,所述靜音概率為所述最大相似概率對應的發音單元與靜音單元的相似概率。   在子步驟S2021中已判定了該片段的發音相似概率,例如前述舉例中,第二音頻資料中的片段通過電子裝置的聲音單元匹配模型進行判斷獲知,該片段與聲音單元庫中的第五發音單元的相似度最高,為80%,則此處最大相似概率80%對應的第五發音單元為相似發音單元。在這一子步驟中,可以計算第五發音單元與靜音單元的相似概率,作為片段的靜音概率。   依據以上列舉的方式和本領域技術人員的技術能力,本領域技術人員可以利用這些發音相似概率和靜音概率設置諸多判斷方式,以判斷第二音頻資料的片段是否為靜音,本發明並不特別限制。   在步驟S2023或步驟S2024之後,即判定所述片段為靜音片段之後,所述步驟S202,即判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段的的步驟還可以包括如下子步驟:   S2025,根據所判定的靜音片段,判定所述第二音頻資料中存在多個連續的所述靜音片段。   在這一步驟中,可以判斷該第二音頻資料中是否包括多個連續的靜音片段。例如,針對第二音頻資料的多個片段,例如,在子步驟S2023或S2024為每一個片段設置一個為靜音片段的標識f,當檢測到時間上連續的三個片段均具有靜音標識f時,則認為第二音頻資料中存在多個連續的靜音片段。   在本發明關鍵詞確認方法的一實施例中,上述子步驟S2025,即判定所述第二音頻資料中存在多個連續的所述靜音片段的步驟中,上述「多個」可以為三個以上,即,子步驟S2025可以為:   判定所述第二音頻資料中存在三個以上連續的靜音片段。   在本發明關鍵詞確認方法的一實施例中,所述步驟S201,即獲取音頻資料的步驟之前,所述方法還可以包括:   S200,檢測所採集到的音頻資料中是否包括關鍵詞。   在這一步驟中,電子裝置的關鍵詞庫中可以預存多個關鍵詞,比如「你好斑馬」、「開啟系統」、「放大地圖」,「縮小地圖」,「退出導航」等。第一音頻資料中的關鍵詞可能是這裡面的任何一個,可以利用關鍵詞詞庫,計算輸入的第一音頻資料與這些關鍵詞相似的概率,選出概率最高且高於設定臨限值的詞,作為檢測到的關鍵詞。   具體地,例如可以利用本發明的聲音單元匹配的方式,將音頻資料切分為多個片段,當一個片段與聲音單元庫中預存的其中一個發音單元的相似概率最高,則認為該片段與該發音單元匹配,則將該發音單元作為相似發音單元,同時將該片段與該相似發音單元的相似率作為發音相似概率。   在針對一段語音,例如第一音頻資料時,將多個片段的發音相似概率進行處理,例如經過相乘,獲得概率最大的路徑,將該路徑對應的詞作為匹配的關鍵詞。   在本發明關鍵詞確認方法的一實施例中,所述關鍵詞具有屬性資訊,所述步驟S203,確認所述第一音頻資料為有效關鍵詞的步驟可以包括:   當所述關鍵詞的屬性資訊為主關鍵詞時,且所述關鍵詞之前的第二音頻資料為靜音時,確認所述關鍵詞為有效主關鍵詞。   在本發明實施例中,每一個關鍵詞可以對應一個屬性資訊,該資訊記載了關鍵詞為主關鍵詞還是副關鍵詞。電子裝置的關鍵詞庫中預存的多個關鍵詞例如可以分類為主關鍵詞和副關鍵詞,比如「你好斑馬」、「開啟系統」等設置為主關鍵詞,「放大地圖」,「縮小地圖」,「退出導航」等設置為副關鍵詞。   對於主關鍵詞,如果考慮主關鍵詞後面可以沒有內容,也可能會直接接識別的語音,例如「你好斑馬請幫我查一下到中關村的路」,則可以設置關鍵詞之前的音頻資料為靜音,且關鍵詞的屬性資訊為主關鍵詞,則確認這一關鍵詞為主關鍵詞;而不去檢測關鍵詞之後是否為靜音。   在本發明關鍵詞確認方法的一實施例中,所述關鍵詞具有屬性資訊,所述關鍵詞具有屬性資訊,所述步驟S203,確認所述第一音頻資料為有效關鍵詞的步驟可以包括:   當所述關鍵詞的屬性資訊為副關鍵詞時,且所述關鍵詞之前和之後的第二音頻資料均為靜音時,確認所述關鍵詞為有效副關鍵詞。   在這一步驟中,副關鍵詞可能是使用者要求電子裝置直接執行的命令,例如「放大地圖」。可以設置關鍵詞之前和之後的內容均為靜音,並且關鍵詞的屬性為副關鍵詞,才確認該關鍵詞為副關鍵詞。當使用者說出「我就是想試試放大地圖能不能用」或「不知道能不能放大地圖」或「放大地圖就可以了」這類語音時,雖然可以檢測到關鍵詞,但是不滿足前後靜音的條件,仍不會將其判斷為有效關鍵詞。   綜上所述,本實施例提出的關鍵詞確認方法至少具有如下優點:   本發明實施例提出的應用於車載終端的關鍵詞確認方法中,利用了使用者的一般習慣,即在發出關鍵詞的之前或之後會有停頓而產生靜音,通過檢測關鍵詞前後是否存在靜音來檢測該關鍵詞是否為有效關鍵詞,提高了檢測準確率,避免了識別錯誤;同時在針對靜音的檢測中,利用了音頻資料的靜音片段的檢測,提高了判斷是否為靜音的準確性,進一步避免了將關鍵詞誤判為非關鍵詞。   除此之外,本實施例提出的關鍵詞確認方法至少還包括如下優點:   本發明實施例提出的關鍵詞確認方法的一可選實施例中,提出了優選的判斷方法,利用第二音頻資料的累積靜音概率和第一音頻資料的累積關鍵詞概率的比值和/或第二音頻資料的發音相似概率和靜音概率的差值判斷第二音頻資料是否為靜音,使判斷結果更加準確;此外針對關鍵詞的不同類型——主關鍵詞和副關鍵詞,設置了不同的進一步確認方式,使判斷結果更加可靠。 第三實施例   本發明第三實施例提出一種關鍵詞確認方法。圖5所示為本發明第三實施例的關鍵詞確認方法的步驟流程圖。如圖5所示,本發明實施例的關鍵詞確認方法如下步驟:   S301,獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   在這一步驟中,執行主體,例如是車載終端、手機、平板電腦等電子裝置,可以獲取音頻資料,其至少包括第一音頻資料和前/後的第二音頻資料。第二音頻資料與第一音頻資料在時間上連續。此時檢測到的第一音頻資料已被識別為關鍵詞,即,此時檢測到的音頻資料已確認與預存的關鍵詞匹配。   電子裝置可以獲取並檢測關鍵詞對應的第一音頻資料,以及關鍵詞之前、之後或之前和之後的第二音頻資料,在實際使用中,電子裝置的聲音採集裝置例如麥克風可以持續採集音頻,音頻資料例如是按照「幀」為單位獲取的,一幀例如為10ms,則在檢測到第一音頻資料為關鍵詞之後,獲取該第一音頻資料前/後若干幀,例如10幀的第二音頻資料,進行後續分析。   S302,判定與所述第一音頻資料時間上連續的第二音頻資料的多個片段的累積靜音概率;   第二音頻資料的多個片段的累積靜音概率p(sil)可以利用第二音頻資料的每一個片段的靜音概率的乘積或者總和獲得。每一個片段可以通過前述第一和第二實施例中提供的方式計算出靜音概率,在步驟S302中,可以將這些靜音概率通過相加或相乘的方式獲得累積靜音概率。   S303,判定所述第一音頻資料的多個片段的累積關鍵詞概率;   在這一步驟中,第一音頻資料的累積關鍵詞概率p(kws)可以是第一音頻資料的多個片段對應的發音相似概率的乘積;例如,針對第一音頻資料的第一片段和第二片段,將這兩個片段輸入電子裝置的聲音單元匹配模型進行判斷,如果第一片段與聲音單元庫中的第一發音單元的相似度最高,為90%,與第二發音單元相似度其次,為80%;則將第一發音單元作為第一片段的相似發音單元,90%作為第一片段的發音相似概率;第二片段的發音相似概率例如為70%,則第一音頻資料的累積關鍵詞概率為80%×70%。   值得注意的是,第一音頻資料的累積關鍵詞概率p(kws)可以是從前序的處理中獲得,並非限定為即時計算獲取。本發明並不限定其獲取方式。   S304,當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞。   在本發明關鍵詞確認方法的一實施例中,所述第二預設條件包括:   所述累積靜音概率與累積關鍵詞概率比值大於第二臨限值。   在這一實施例中,舉例來說,p(sil)/ p(kws)的比值越大,則第二音頻資料為靜音的判斷越準確。因此可以設置第二預設條件包括所述累積靜音概率與累積關鍵詞概率比值大於第二臨限值,當p(sil)/ p(kws)大於第二臨限值(例如1.5)時認為第二音頻資料為靜音。   在本發明一可選實施例中,所述步驟S301,即獲取音頻資料的步驟之前,所述方法還包括:   S300,檢測所採集到的音頻資料中是否包括關鍵詞。   在這一步驟中,電子裝置的關鍵詞庫中可以預存多個關鍵詞,比如「你好斑馬」、「開啟系統」、「放大地圖」,「縮小地圖」,「退出導航」等。第一音頻資料中的關鍵詞可能是這裡面的任何一個,可以利用關鍵詞詞庫,計算輸入的第一音頻資料與這些關鍵詞相似的概率,選出概率最高且高於設定臨限值的詞,作為檢測到的關鍵詞。   具體地,例如可以利用本發明的聲音單元匹配的方式,將音頻資料切分為多個片段,當一個片段與聲音單元庫中預存的其中一個發音單元的相似概率最高,則認為該片段與該發音單元匹配,則將該發音單元作為相似發音單元,同時將該片段與該相似發音單元的相似率作為發音相似概率。   在針對一段語音,例如第一音頻資料時,將多個片段的發音相似概率進行處理,例如經過相乘,獲得概率最大的路徑,將該路徑對應的詞作為匹配的關鍵詞。   在本發明一可選實施例中,所述關鍵詞具有屬性資訊,所述當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞的步驟包括:   當所述關鍵詞的屬性資訊為主關鍵詞時,且所述關鍵詞之前的第二音頻資料為靜音時,確認所述關鍵詞為有效主關鍵詞。   在本發明實施例中,每一個關鍵詞可以對應一個屬性資訊,該資訊記載了關鍵詞為主關鍵詞還是副關鍵詞。電子裝置的關鍵詞庫中預存的多個關鍵詞例如可以分類為主關鍵詞和副關鍵詞,比如「你好斑馬」、「開啟系統」等設置為主關鍵詞,「放大地圖」,「縮小地圖」,「退出導航」等設置為副關鍵詞。   對於主關鍵詞,如果考慮主關鍵詞後面可以沒有內容,也可能會直接接識別的語音,例如「你好斑馬請幫我查一下到中關村的路」,則可以設置關鍵詞之前的音頻資料為靜音,且關鍵詞的屬性資訊為主關鍵詞,則確認這一關鍵詞為主關鍵詞;而不去檢測關鍵詞之後是否為靜音。   在本發明一可選實施例中,所述關鍵詞具有屬性資訊,所述當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞的步驟包括:   當所述關鍵詞的屬性資訊為副關鍵詞時,且所述關鍵詞之前和之後的第二音頻資料均為靜音時,確認所述關鍵詞為有效副關鍵詞。   在這一步驟中,副關鍵詞可能是使用者要求電子裝置直接執行的命令,例如「放大地圖」。可以設置關鍵詞之前和之後的內容均為靜音,並且關鍵詞的屬性為副關鍵詞,才確認該關鍵詞為副關鍵詞。當使用者說出「我就是想試試放大地圖能不能用」或「不知道能不能放大地圖」或「放大地圖就可以了」這類語音時,雖然可以檢測到關鍵詞,但是不滿足前後靜音的條件,仍不會將其判斷為有效關鍵詞。   綜上所述,本實施例提出的關鍵詞確認方法至少具有如下優點:   本發明實施例提出的關鍵詞確認方法中,利用了使用者的一般習慣,即在發出關鍵詞的之前或之後會有停頓而產生靜音,通過檢測關鍵詞前後是否存在靜音來檢測該關鍵詞是否為有效關鍵詞,提高了檢測準確率,避免了識別錯誤;同時在針對靜音的檢測中,利用了音頻資料的靜音片段的檢測,提高了判斷是否為靜音的準確性,進一步避免了將關鍵詞誤判為非關鍵詞。   除此之外,本實施例提出的關鍵詞確認方法至少還包括如下優點:   本發明實施例提出的關鍵詞確認方法的一可選實施例中,提出了優選的判斷方法,利用第二音頻資料的累積靜音概率和第一音頻資料的累積關鍵詞概率的比值和/或第二音頻資料的發音相似概率和靜音概率的差值判斷第二音頻資料是否為靜音,使判斷結果更加準確;此外針對關鍵詞的不同類型——主關鍵詞和副關鍵詞,設置了不同的進一步確認方式,使判斷結果更加可靠。 第四實施例   本發明第四實施例提出一種車載終端的關鍵詞確認方法。圖6是包括車輛環境的車載終端的示意圖。圖7是本申請第五實施例的車載終端的關鍵詞確認方法的流程圖。如圖6所示,車輛包括設置在車內的車載終端200,車載終端200包括揚聲器400和麥克風700,還可以包括螢幕、按鍵等(圖未示)。揚聲器400除了可以集成於車載終端,還可以設置在車輛內部的其他位置,供乘坐者600收聽資訊。車載終端200具有計算處理功能,其可以安裝作業系統和應用程式,還可以通過互聯網500與伺服器300遠端聯網進行資料互動。   如圖7所示,本發明實施例的車載終端的關鍵詞確認方法如下步驟:   S401,通過車載音頻採集裝置獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   在這一步驟中,車載終端200可以獲取音頻資料,其至少包括第一音頻資料和前/後的第二音頻資料。第二音頻資料與第一音頻資料在時間上連續。此時檢測到的第一音頻資料已被識別為關鍵詞,即,此時麥克風700檢測到的音頻資料已確認與預存的關鍵詞匹配。   如圖1所示,車載終端200可以獲取並檢測如圖1所示中關鍵詞對應的第一音頻資料,以及關鍵詞之前、之後或之前和之後的第二音頻資料,在實際使用中,車載終端的聲音採集裝置例如麥克風可以持續採集音頻,音頻資料例如是按照「幀」為單位獲取的,一幀例如為10ms,則在檢測到第一音頻資料為關鍵詞之後,獲取該第一音頻資料前/後若干幀,例如10幀的第二音頻資料,進行後續分析。   S402,判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段;   在這一步驟中,將第二音頻資料的片段輸入車載終端的聲音單元匹配模型後,可以獲知其與聲音單元庫中的靜音單元的相似度,作為該片段的靜音概率。例如,針對第二音頻資料的片段,將其輸入聲音單元匹配模型後,計算出其與靜音單元的相似度為90%,則將90%作為該片段對應的靜音概率,當這一靜音概率滿足一定的要求時,則認為第二音頻資料的該片段為靜音片段。   在一實施例中,可以將多個片段輸入車載終端的聲音單元匹配模型,分別獲取該片段對應的靜音概率,並利用該靜音概率判定該片段是否為靜音片段。   S403,確認所述第一音頻資料為有效關鍵詞,其中所述有效關鍵詞用於喚醒車載終端執行所述關鍵詞對應的指令。   在這一步驟中,當第二音頻資料中存在多個連續的靜音片段時,則判斷第二音頻資料為靜音時,可以確認該關鍵詞為有效關鍵詞,後續可以根據該有效關鍵詞執行對應的指令。   例如,針對第二音頻資料的多個片段,前述已獲知了每一個片段是否為靜音片段。在這一步驟中可以檢測這些靜音片段中是否為連續的靜音片段,當包括多個(例如3個以上)連續的靜音片段時,認為第二音頻資料為靜音,繼而判斷出第一音頻資料為有效關鍵詞,其中所述有效關鍵詞用於喚醒車載終端執行所述關鍵詞對應的指令。   綜上所述,本實施例提出的車載終端的關鍵詞確認方法至少具有如下優點:   本發明實施例提出的應用於車載終端的關鍵詞確認方法中,利用了使用者的一般習慣,即在發出關鍵詞的之前或之後會有停頓而產生靜音,通過檢測關鍵詞前後是否存在靜音來檢測該關鍵詞是否為有效關鍵詞,提高了檢測準確率,避免了識別錯誤;同時在針對靜音的檢測中,利用了音頻資料的靜音片段的檢測,提高了判斷是否為靜音的準確性,進一步避免了將關鍵詞誤判為非關鍵詞。 第五實施例   本發明第五實施例提出一種車載終端的關鍵詞確認方法。圖8是本申請第六實施例的車載終端的關鍵詞確認方法的流程圖。如圖8所示,本發明實施例的車載終端的關鍵詞確認方法如下步驟:   S501,通過車載音頻採集裝置獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   在這一步驟中,車載終端可以獲取音頻資料,其至少包括第一音頻資料和前/後的第二音頻資料。第二音頻資料與第一音頻資料在時間上連續。此時檢測到的第一音頻資料已被識別為關鍵詞,即,此時檢測到的音頻資料已確認與預存的關鍵詞匹配。   車載終端可以獲取並檢測關鍵詞對應的第一音頻資料,以及關鍵詞之前、之後或之前和之後的第二音頻資料,在實際使用中,車載終端的聲音採集裝置例如麥克風可以持續採集音頻,音頻資料例如是按照「幀」為單位獲取的,一幀例如為10ms,則在檢測到第一音頻資料為關鍵詞之後,獲取該第一音頻資料前/後若干幀,例如10幀的第二音頻資料,進行後續分析。   S502,判定與所述第一音頻資料時間上連續的第二音頻資料的多個片段的累積靜音概率;   第二音頻資料的多個片段的累積靜音概率p(sil)可以利用第二音頻資料的每一個片段的靜音概率的乘積或者總和獲得。每一個片段可以通過前述第一和第二實施例中提供的方式計算出靜音概率,在步驟S502中,可以將這些靜音概率通過相加或相乘的方式獲得累積靜音概率。   S503,判定所述第一音頻資料的多個片段的累積關鍵詞概率;   在這一步驟中,第一音頻資料的累積關鍵詞概率p(kws)可以是第一音頻資料的多個片段對應的發音相似概率的乘積;例如,針對第一音頻資料的第一片段和第二片段,將這兩個片段輸入車載終端的聲音單元匹配模型進行判斷,如果第一片段與聲音單元庫中的第一發音單元的相似度最高,為90%,與第二發音單元相似度其次,為80%;則將第一發音單元作為第一片段的相似發音單元,90%作為第一片段的發音相似概率;第二片段的發音相似概率例如為70%,則第一音頻資料的累積關鍵詞概率為80%×70%。   S504,當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞,其中所述有效關鍵詞用於喚醒車載終端執行所述關鍵詞對應的指令。   在這一實施例中,舉例來說,p(sil)/ p(kws)的比值越大,則第二音頻資料為靜音的判斷越準確。因此可以設置第二預設條件包括所述累積靜音概率與累積關鍵詞概率比值大於第二臨限值,當p(sil)/ p(kws)大於第二臨限值(例如1.5)時認為第二音頻資料為靜音。   在本發明關鍵詞確認方法的一實施例中,所述第二預設條件包括:   所述累積靜音概率與累積關鍵詞概率比值大於第二臨限值。   綜上所述,本實施例提出的車載終端的關鍵詞確認方法至少具有如下優點:   本發明實施例提出的應用於車載終端的關鍵詞確認方法中,利用了使用者的一般習慣,即在發出關鍵詞的之前或之後會有停頓而產生靜音,通過檢測關鍵詞前後是否存在靜音來檢測該關鍵詞是否為有效關鍵詞,提高了檢測準確率,避免了識別錯誤;同時在針對靜音的檢測中,利用了音頻資料的靜音片段的檢測,提高了判斷是否為靜音的準確性,進一步避免了將關鍵詞誤判為非關鍵詞。   值得注意的是,儘管上述第四和第五實施例提出了車載終端的關鍵詞確認方法,但是本領域技術人員可以明確的是,本發明提出的關鍵詞確認方法可以不限制於車載終端,還可以應用於其他各種智慧設備中。例如手機、伺服器、智慧家居硬體等各種具有計算、處理功能的智慧設備。智慧家居硬體例如包括微波爐、烤箱、洗衣機、洗碗機、空調、路由器、智慧音箱,電視,電冰箱,吸塵器等。   在一個實施例中,當上述的關鍵詞確認方法應用於智慧音箱中時,上述關鍵詞例如可以包括「播放音樂」或「下一曲」等。當智慧音箱通過聲音接收裝置接收到「播放音樂」這一關鍵詞(第一音頻資料)時,該智慧音箱即通過判斷第二音頻資料是否為靜音,繼而確認該關鍵詞為有效關鍵詞。在一實施例中,在確認為有效關鍵詞之後,智慧音箱可以開始執行有效關鍵詞對應的播放音樂的指令。   以上實施例以智慧音箱為例進行了說明,但是本領域技術人員可以明確的是,本發明提供的關鍵詞確認方法可以應用於各式智慧設備,在此並不限制。 第六實施例   本發明第六實施例提出一種關鍵詞確認裝置。圖9是本申請第七實施例的關鍵詞確認裝置的方塊圖。如圖9所示,本發明實施例的關鍵詞確認裝置包括:   音頻資料獲取模組601,用於獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   靜音片段判定模組602,用於判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段;   有效關鍵詞判定模組603,用於確認所述第一音頻資料為有效關鍵詞。   綜上所述,本實施例提出的關鍵詞確認裝置至少具有如下優點:   本發明實施例提出的關鍵詞確認裝置中,利用了使用者的一般習慣,即在發出關鍵詞的之前或之後會有停頓而產生靜音,通過檢測關鍵詞前後是否存在靜音來檢測該關鍵詞是否為有效關鍵詞,提高了檢測準確率,避免了識別錯誤;同時在針對靜音的檢測中,利用了音頻資料的靜音片段的檢測,提高了判斷是否為靜音的準確性,進一步避免了將關鍵詞誤判為非關鍵詞。 第七實施例   本發明第七實施例提出一種關鍵詞確認裝置。圖10是本申請第八實施例的關鍵詞確認裝置的方塊圖。如圖10所示,本發明實施例的關鍵詞確認裝置包括:   音頻資料獲取模組701,用於獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   累積靜音片段判定模組702,用於判定與所述第一音頻資料時間上連續的第二音頻資料的多個片段的累積靜音概率;   累積關鍵詞概率判定模組703,用於判定所述第一音頻資料的多個片段的累積關鍵詞概率;   有效關鍵詞判定模組704,用於當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞。   綜上所述,本實施例提出的關鍵詞確認裝置至少具有如下優點:   本發明實施例提出的關鍵詞確認裝置中,利用了使用者的一般習慣,即在發出關鍵詞的之前或之後會有停頓而產生靜音,通過檢測關鍵詞前後是否存在靜音來檢測該關鍵詞是否為有效關鍵詞,提高了檢測準確率,避免了識別錯誤;同時在針對靜音的檢測中,利用了音頻資料的靜音片段的檢測,提高了判斷是否為靜音的準確性,進一步避免了將關鍵詞誤判為非關鍵詞。   對於裝置實施例而言,由於其與方法實施例基本相似,所以描述得比較簡單,相關之處參見方法實施例的部分說明即可。   圖11為本申請一實施例提供的終端設備的硬體結構示意圖。如圖11所示,該終端設備可以包括輸入裝置90、處理器91、輸出設備92、記憶體93和至少一個通訊匯流排94。通訊匯流排94用於實現元件之間的通訊連接。記憶體93可能包含高速RAM記憶體,也可能還包括非揮發性儲存NVM,例如至少一個磁碟記憶體,記憶體93中可以儲存各種程式,用於完成各種處理功能以及實現本實施例的方法步驟。   可選的,上述處理器91例如可以為中央處理器(Central Processing Unit,簡稱CPU)、應用專用積體電路(ASIC)、數位訊號處理器(DSP)、數位訊號處理設備(DSPD)、可程式設計邏輯裝置(PLD)、現場可程式設計閘陣列(FPGA)、控制器、微控制器、微處理器或其他電子元件實現,該處理器91通過有線或無線連接耦合到上述輸入裝置90和輸出設備92。   可選的,上述輸入裝置90可以包括多種輸入裝置,例如可以包括面向使用者的使用者介面、面向設備的設備介面、軟體的可程式設計介面、攝像頭、感測器中至少一種。可選的,該面向設備的設備介面可以是用於設備與設備之間進行資料傳輸的有線介面、還可以是用於設備與設備之間進行資料傳輸的硬體插入介面(例如USB介面、串口等);可選的,該面向使用者的使用者介面例如可以是面向使用者的控制按鍵、用於接收語音輸入的語音輸入裝置以及使用者接收使用者觸摸輸入的觸摸感知設備(例如具有觸摸感應功能的觸控式螢幕、觸控板等);可選的,上述軟體的可程式設計介面例如可以是供使用者編輯或者修改程式的入口,例如晶片的輸入引腳介面或者輸入介面等;可選的,上述收發信機可以是具有通訊功能的射頻收發晶片、基帶處理晶片以及收發天線等。麥克風等聲音輸入裝置可以接收語音資料。輸出設備92可以包括顯示器、音響等輸出設備。   在本實施例中,該終端設備的處理器包括用於執行各設備中資料處理裝置各模組的功能,具體功能和技術效果參照上述實施例即可,此處不再贅述。   圖12為本申請另一實施例提供的終端設備的硬體結構示意圖。圖12是對圖11在實現過程中的一個具體的實施例。如圖12所示,本實施例的終端設備包括處理器101以及記憶體102。   處理器101執行記憶體102所存放的電腦程式代碼,實現上述實施例中圖1至圖7的方法。   記憶體102被配置為儲存各種類型的資料以支援在終端設備的操作。這些資料的示例包括用於在終端設備上操作的任何應用程式或方法的指令,例如消息,圖片,視頻等。記憶體102可能包含隨機存取記憶體(random access memory,簡稱RAM),也可能還包括非揮發性記憶體(non-volatile memory),例如至少一個磁碟記憶體。   可選地,處理器101設置在處理組件100中。該終端設備還可以包括:通訊元件103,電源元件104,多媒體元件105,音頻元件106,輸入/輸出介面107和/或感測器元件108。終端設備具體所包含的組件等依據實際需求設定,本實施例對此不作限定。   處理組件100通常控制終端設備的整體操作。處理元件100可以包括一個或多個處理器101來執行指令,以完成上述圖1至圖7方法的全部或部分步驟。此外,處理元件100可以包括一個或多個模組,便於處理元件100和其他元件之間的互動。例如,處理元件100可以包括多媒體模組,以方便多媒體元件105和處理元件100之間的互動。   電源元件104為終端設備的各種元件提供電力。電源元件104可以包括電源管理系統,一個或多個電源,及其他與為終端設備生成、管理和分配電力相關聯的組件。   多媒體元件105包括在終端設備和使用者之間的提供一個輸出介面的顯示幕。在一些實施例中,顯示幕可以包括液晶顯示器(LCD)和觸控面板(TP)。如果顯示幕包括觸控面板,顯示幕可以被實現為觸控式螢幕,以接收來自使用者的輸入訊號。觸控面板包括一個或多個觸摸感測器以感測觸摸、滑動和觸控面板上的手勢。所述觸摸感測器可以不僅感測觸摸或滑動動作的邊界,而且還檢測與所述觸摸或滑動操作相關的持續時間和壓力。   音頻元件106被配置為輸出和/或輸入音頻訊號。例如,音頻元件106包括一個麥克風(MIC),當終端設備處於操作模式,如語音辨識模式時,麥克風被配置為接收外部音頻訊號。所接收的音頻訊號可以被進一步儲存在記憶體102或經由通訊元件103發送。在一些實施例中,音頻元件106還包括一個揚聲器,用於輸出音頻訊號。   輸入/輸出介面107為處理元件100和週邊介面模組之間提供介面,上述週邊介面模組可以是點擊輪,按鈕等。這些按鈕可包括但不限於:音量按鈕、啟動按鈕和鎖定按鈕。   感測器元件108包括一個或多個感測器,用於為終端設備提供各個方面的狀態評估。例如,感測器元件108可以檢測到終端設備的打開/關閉狀態,元件的相對定位,使用者與終端設備接觸的存在或不存在。感測器元件108可以包括接近感測器,被配置用來在沒有任何的物理接觸時檢測附近物體的存在,包括檢測使用者與終端設備間的距離。在一些實施例中,該感測器元件108還可以包括攝像頭等。   通訊元件103被配置為便於終端設備和其他設備之間有線或無線方式的通訊。終端設備可以接入基於通訊標準的無線網路,如WiFi,2G或3G,或它們的組合。在一個實施例中,該終端設備中可以包括SIM卡插槽,該SIM卡插槽用於插入SIM卡,使得終端設備可以登錄GPRS網路,通過互聯網與服務端建立通訊。   由上可知,在圖12實施例中所涉及的通訊元件103、音頻元件106以及輸入/輸出介面107、感測器元件108均可以作為圖11實施例中的輸入裝置的實現方式。   本申請實施例提供了一種終端設備,包括:一個或多個處理器;和其上儲存有指令的一個或多個機器可讀媒體,當由所述一個或多個處理器執行時,使得所述終端設備執行如本申請實施例中一個或多個所述的視頻摘要的生成方法。   在一實施例中,上述終端設備可以包括車載終端、移動終端(例如手機、平板電腦、個人數位助理等)、伺服器、物聯網設備或智慧家居硬體等各種具有計算、處理功能的智慧終端機設備。智慧家居硬體例如包括微波爐、烤箱、洗衣機、洗碗機、空調、路由器、智慧音箱,電視,電冰箱,吸塵器等。上述智慧終端機設備可以安裝應用程式,提供人機互動的操作介面,執行前述各實施例的關鍵詞確認方法。   例如,這些智慧終端機設備可以通過自身或者外接的音頻接收部件接收音頻資料,在確認該第一音頻資料前後的第二音頻資料為靜音後,確認第一音頻資料為有效關鍵詞。例如,針對手機,通過這一方式可以判斷使用者發出的語音指令是否為指示手機中安裝的應用程式執行對應操作的指令——例如開啟音樂、導航等;針對物聯網設備或智慧家居硬體,通過這一方式可以判斷使用者發出的語音指令是否為指示其中安裝的軟體或者系統執行對應的操作的指令——例如連接其他設備、調高空調溫度、開啟烤箱的高溫烘烤模式等。在此並不特別限制。因此,通過上述舉例說明可知,本發明可以應用於各類終端設備。   本說明書中的各個實施例均採用遞進的方式描述,每個實施例重點說明的都是與其他實施例的不同之處,各個實施例之間相同相似的部分互相參見即可。   儘管已描述了本申請實施例的優選實施例,但本領域內的技術人員一旦得知了基本創造性概念,則可對這些實施例做出另外的變更和修改。所以,所附申請專利範圍意欲解釋為包括優選實施例以及落入本申請實施例範圍的所有變更和修改。   最後,還需要說明的是,在本文中,諸如第一和第二等之類的關係術語僅僅用來將一個實體或者操作與另一個實體或操作區分開來,而不一定要求或者暗示這些實體或操作之間存在任何這種實際的關係或者順序。而且,術語「包括」、「包含」或者其任何其他變體意在涵蓋非排他性的包含,從而使得包括一系列要素的過程、方法、物品或者終端設備不僅包括那些要素,而且還包括沒有明確列出的其他要素,或者是還包括為這種過程、方法、物品或者終端設備所固有的要素。在沒有更多限制的情況下,由語句「包括一個……」限定的要素,並不排除在包括所述要素的過程、方法、物品或者終端設備中還存在另外的相同要素。   以上對本申請所提供的一種關鍵詞確認方法和裝置,進行了詳細介紹,本文中應用了具體個例對本申請的原理及實施方式進行了闡述,以上實施例的說明只是用於幫助理解本申請的方法及其核心思想;同時,對於本領域的一般技術人員,依據本申請的思想,在具體實施方式及應用範圍上均會有改變之處,綜上所述,本說明書內容不應理解為對本申請的限制。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art fall within the protection scope of this application. One of the core ideas of this application is to propose a keyword confirmation method that uses the silence before and after the keyword to determine whether the keyword is indeed a valid keyword. At the same time, in the detection of silence, continuous silence segments of audio data are used. Or the cumulative silence probability of multiple segments is used for judgment, and the accuracy of the judgment is improved. First Embodiment 第一 A first embodiment of the present invention proposes a keyword confirmation method. Fig. 1 is a schematic diagram of normal keywords and audio data before and after an embodiment of the present invention. As shown in Fig. 1, according to the general habits of users, there will be pauses before or after the keywords are issued, resulting in silence, so it is normal. Keywords before and after can be muted. In this way, it is possible to determine whether the voice uttered by the user is a keyword. In Figure 1, there are three possible situations for normal keywords, including: mute before keywords, mute after keywords, and mute before and after keywords. In the following, the audio material corresponding to the keyword is used as the first audio material, and the audio material corresponding to the mute part before and after the keyword is used as the second audio material. FIG. 2 is a flowchart showing steps of a keyword confirmation method according to a first embodiment of the present invention. As shown in FIG. 2, the keyword confirmation method according to the embodiment of the present invention may include, for example, the following steps: S101, obtaining first audio data, where the first audio data is identified as a keyword; In this step, an execution subject, For example, an electronic device such as a car terminal, a mobile phone, or a tablet computer can obtain audio data, which includes at least a first audio data and a front / back second audio data. The second audio material and the first audio material are continuous in time. The first audio data detected at this time has been identified as a keyword, that is, the audio data detected at this time has been confirmed to match the pre-stored keywords. As shown in FIG. 1, the electronic device can acquire and detect the first audio data corresponding to the keywords shown in FIG. 1 and the second audio data before, after, or before and after the keywords. In actual use, the electronic device A sound collection device such as a microphone can continuously capture audio. For example, audio data is acquired in units of "frames", and a frame is, for example, 10ms. After detecting the first audio data as a keyword, before acquiring the first audio data / Several frames, such as 10 frames of second audio data, for subsequent analysis. In an embodiment, it is necessary to further determine whether the first audio data is a “valid keyword”, and only when it is subsequently confirmed to be a valid keyword can a corresponding instruction be executed according to the valid keyword. S102. Determine that there are multiple consecutive mute segments in the second audio material that is temporally continuous with the first audio material. ; In this step, after inputting the segment of the second audio material into the sound unit matching model of the electronic device, The similarity with the mute unit in the sound unit library can be obtained as the mute probability of the segment. For example, for a segment of the second audio material, after matching its input sound unit model, and calculating that the similarity with the mute unit is 90%, then 90% is used as the mute probability corresponding to the segment. When this mute probability satisfies When certain requirements are met, the segment of the second audio material is considered to be a mute segment. In one embodiment, a plurality of segments may be input into a sound unit matching model of the electronic device, and the mute probability corresponding to the segment is obtained, and the mute probability is used to determine whether the segment is a mute segment.判断 After determining that the segment is a mute segment, it may be determined whether the second audio material includes multiple consecutive mute segments. For example, for multiple segments of the second audio material, after knowing whether each segment is a mute segment, it can be detected whether the mute segments are continuous mute segments. For example, each segment has an identifier f of whether it is a mute segment, When it is detected that three consecutive segments in time have the mute flag f, it is considered that there are multiple consecutive mute segments in the second audio material. (S103) Confirm that the first audio data is a valid keyword. In this step, when there are multiple consecutive mute segments in the second audio material, the second audio material is determined to be mute, thereby confirming that the keyword is a valid keyword, and subsequent execution can be performed based on the valid keyword. Corresponding instruction. For example, when the second audio material includes multiple (for example, three or more) consecutive mute segments, the second audio material is considered to be mute, and then the first audio material is determined to be a valid keyword. It is worth noting that the keywords mentioned above and below can include multiple contents. For example, the wake-up word used to wake up the operating system in the electronic device, the user's voice command, and key parameters in the command. For example, during the voice operation performed by the user on the electronic device, the input “turn on the system”, “frequency modulation to 87.6”, “87.6”, etc., all belong to the category of “keywords” proposed in the embodiments of the present invention. Not particularly limited. It can be known from the above that the keyword confirmation method provided by the first embodiment of the present invention has at least the following technical effects: 中 The keyword confirmation method provided by the embodiment of the present invention uses the general habit of the user, that is, before the keyword is issued or After that, there will be a pause to generate silence. By detecting whether there is silence before and after the keyword to detect whether the keyword is a valid keyword, the detection accuracy is improved, and recognition errors are avoided. At the same time, audio data is used in the detection of silence. The detection of the mute segment improves the accuracy of judging whether it is mute or not, and further avoids misjudgement of keywords as non-keywords. Second Embodiment 第二 A second embodiment of the present invention proposes a keyword confirmation method. FIG. 3 is a flowchart of steps in a keyword confirmation method according to a second embodiment of the present invention. As shown in FIG. 3, a keyword confirmation method according to an embodiment of the present invention includes the following steps: S201, obtaining first audio data, where the first audio data is identified as a keyword; S202, judging time with the first audio data There are multiple consecutive mute segments in the continuous second audio material; ; S203, confirm that the first audio material is a valid keyword. The above steps S201 to S203 are the same as or similar to steps S101 to S103 of the previous embodiment, and are not repeated here. This embodiment focuses on the differences from the previous embodiment. In an embodiment of the keyword confirmation method of the present invention, the step S202, that is, the step of determining that there are multiple consecutive mute segments in the second audio material that is temporally continuous with the first audio material may include the following steps: Step: S2021, determine the pronunciation similarity probability of the segment, the pronunciation similarity probability is the maximum similarity probability between the segment and multiple pronunciation units; In this step, the segment of the second audio material may be the above, for example The audio frame may also be a fragment of other units, which is not limited herein, as long as it is a fragment obtained by dividing audio data according to a specific rule, such as time, storage method, etc., all belong to the protection scope of the present invention. For example, the segment can be an audio frame of 10ms or 20ms, or an audio paragraph of 1s. The pronunciation unit can be a unit of phonemes, syllables, words, words, etc. obtained according to the user's pronunciation. For example, when the user issues a "zebra", the pronunciation unit is divided by phonemes, and the phonemes are smaller than the syllables. From the existing phoneme set, we can know that the phoneme "b a_h nn_h m a_l a_l" corresponding to "Zebra"; when the pronunciation unit is divided by syllables, the syllable corresponding to "Zebra" is "b an ma"; When the pronunciation unit is divided by the pronunciation of the word, the corresponding division of "Zebra" is "ban ma"; when the pronunciation unit is divided by the word, the corresponding division of "Zebra" is "banma". For each division, a corresponding sound unit library can be constructed. In addition to the pronunciation unit mentioned above, the thesaurus may also include a mute unit and the like. When a segment of the second audio material has the highest probability of similarity with one of the pronunciation units pre-stored in the sound unit library, it is considered that the segment matches the pronunciation unit, then the pronunciation unit is regarded as a similar pronunciation unit, and the segment and the The similarity rate of similar pronunciation units is used as the pronunciation similarity probability. For the fragment in the second audio data, the fragment is input to the sound unit matching model of the electronic device for judgment. If the fragment has the highest similarity with the fifth pronunciation unit in the sound unit library, it is 80%, and it is the same as the sixth pronunciation unit. The second similarity is 70%, and the fifth pronunciation unit with 80% similarity can be recorded as the similar pronunciation unit corresponding to the segment, and the similarity probability of pronunciation is 80% for subsequent processing. S2022: Determine the mute probability of the segment, where the mute probability is a similar probability between the segment and the mute unit; In this step, after the segment of the second audio data is input into the sound unit matching model of the electronic device, it can be learned Its similarity to the mute unit in the sound unit library is used as the mute probability of the segment. For example, for a segment of the second audio material, after the input sound unit is matched with the model, the similarity between the second audio material and the mute unit is calculated to be 90%, and 90% is used as the mute probability corresponding to the segment. It is worth noting that the above-mentioned mute unit can be pre-stored in the sound unit library, which can be obtained through repeated training of a large amount of data, such as comprehensive consideration of sound energy, environmental noise (including wind sound, music sound, car whistle, etc.) Waiting for the mute unit is not limited to absolute silence. The length, attributes, etc. of the mute unit may correspond to the invention unit. For example, when the pronunciation unit is divided according to phonemes, the mute unit may be a mute phoneme; when the pronunciation unit is divided according to syllables, the mute unit may be a mute syllable, which is not limited herein. S2023: When the relationship between the pronunciation similarity probability and the silence probability satisfies a preset condition, determine that the segment is a silence segment. The preset conditions include, for example: The absolute value of the difference between the pronunciation similarity probability and the silence probability of the segment is less than the first threshold. In this step, the pronunciation similarity probability and the corresponding mute probability corresponding to the previously obtained segment of the second audio material may be used to determine whether the segment of the second audio material is muted. As can be seen from the above, in the solution proposed by the embodiment of the present invention, the judgment of the silence probability is not to compare the segments of the audio data with absolute silence, but to compare the similarity probability of pronunciation with the corresponding silence probability, comprehensively considering the environment noise Therefore, the solution provided by the present invention can avoid rejecting correct keywords due to inaccurate silence judgment. There are many ways to judge whether a segment of an audio material is muted by using the probability of sound similarity and the probability of silence, which is described here by way of example. For example, the segment satisfies the requirement that the absolute value of the difference between the pronunciation similarity probability pmax (indexframe) and the silence probability psil (indexframe) is less than 15%, namely:The segment is considered to be a silent segment. In an embodiment of the keyword confirmation method of the present invention, in the above sub-step S2022, the judgment of the silence probability may also be the similarity probability of the pronunciation unit and the silence unit corresponding to the similarity probability of the pronunciation unit and the silence unit corresponding to the maximum similarity probability. To judge. That is, the sub-step S2022 may be replaced with the following sub-steps: S2024: determine the mute probability of the segment, where the mute probability is a similarity probability between the pronunciation unit and the mute unit corresponding to the maximum similarity probability. In substep S2021, the pronunciation similarity probability of the segment has been determined. For example, in the foregoing example, the segment in the second audio data is determined through the sound unit matching model of the electronic device, and the segment and the fifth pronunciation in the sound unit library are determined. The unit has the highest similarity, which is 80%, and the fifth pronunciation unit corresponding to the 80% maximum similarity probability here is a similar pronunciation unit. In this sub-step, the similarity probability of the fifth sounding unit and the mute unit can be calculated as the mute probability of the segment. According to the methods listed above and the technical capabilities of those skilled in the art, those skilled in the art can use these probabilistic similarity probabilities and mute probabilities to set many judgment methods to determine whether the second audio data segment is muted. The present invention is not particularly limited . After step S2023 or step S2024, that is, after determining that the segment is a mute segment, the step S202, that is, determining that there are multiple consecutive mute segments in a second audio material that is temporally continuous with the first audio material The step may further include the following sub-steps: S2025: Determine, according to the determined mute segment, that there are multiple consecutive mute segments in the second audio material. In this step, it can be determined whether the second audio material includes multiple consecutive mute segments. For example, for multiple segments of the second audio material, for example, in a sub-step S2023 or S2024, an identifier f for each segment is set as a mute segment. When it is detected that three consecutive segments in time have a mute flag f, It is considered that there are multiple consecutive mute segments in the second audio material. In an embodiment of the keyword confirmation method of the present invention, in the above sub-step S2025, that is, in the step of determining that there are multiple consecutive silent segments in the second audio material, the "multiple" may be three or more That is, sub-step S2025 may be: determine that there are more than three consecutive mute segments in the second audio material. In an embodiment of the keyword confirmation method of the present invention, before step S201, that is, the step of obtaining audio data, the method may further include: S200, detecting whether the collected audio data includes keywords. In this step, multiple keywords can be pre-stored in the keyword library of the electronic device, such as "hello zebra", "open system", "zoom in map", "zoom out map", "exit navigation" and so on. The keywords in the first audio material may be any of them. You can use the keyword thesaurus to calculate the probability that the input first audio material is similar to these keywords, and select the words with the highest probability and higher than the set threshold. As the detected keywords. Specifically, for example, the sound unit matching method of the present invention can be used to divide the audio data into multiple segments. When a segment has the highest similarity probability with one of the pronunciation units pre-stored in the sound unit library, the segment is considered to be similar to the When the sound unit matches, the sound unit is regarded as a similar sound unit, and the similarity rate between the segment and the similar sound unit is used as the sound similarity probability.针对 In the case of a piece of speech, such as the first audio material, the probabilities of pronunciation similarity of multiple segments are processed. For example, after multiplication, the path with the highest probability is obtained, and the word corresponding to the path is used as the matching keyword. In an embodiment of the keyword confirmation method of the present invention, the keyword has attribute information, and the step S203, the step of confirming that the first audio material is a valid keyword may include: when the attribute information of the keyword When the main keyword and the second audio material before the keyword are muted, confirm that the keyword is a valid main keyword. In the embodiment of the present invention, each keyword may correspond to an attribute information, and the information records whether the keyword is a primary keyword or a secondary keyword. The multiple keywords pre-stored in the keyword library of the electronic device can be categorized as, for example, primary and secondary keywords, such as "hello zebra", "open system", etc., and set as the primary keywords, "zoom in map", "zoom out" Map, "" exit navigation, "and so on. For the main keyword, if you consider that there may be no content after the main keyword, or it may directly recognize the voice, such as "Hello Zebra, please help me find the way to Zhongguancun", you can set the audio information before the keyword as If the keyword is muted and the attribute information of the keyword is the main keyword, confirm that the keyword is the main keyword; do not check whether the keyword is muted after the keyword. In an embodiment of the keyword confirmation method of the present invention, the keyword has attribute information, and the keyword has attribute information. The step S203, the step of confirming that the first audio data is a valid keyword may include: When the attribute information of the keyword is a sub-keyword, and the second audio data before and after the keyword are both mute, confirm that the keyword is a valid sub-keyword. In this step, the sub-keyword may be a command that the user asks the electronic device to execute directly, such as "zoom in the map". You can set the content before and after the keyword to be mute, and the attribute of the keyword is a sub-keyword, and then confirm that the keyword is a sub-keyword. When the user says "I just want to try to enlarge the map", "I do n’t know if I can enlarge the map" or "Enlarge the map", the keywords can be detected, but they are not satisfied The condition of silence will not be judged as a valid keyword. In summary, the keyword confirmation method provided in this embodiment has at least the following advantages: 中 The keyword confirmation method applied to a vehicle terminal according to the embodiment of the present invention utilizes the general habits of users, that is, There will be pauses before and after to generate silence. By detecting whether there is silence before and after the keyword to detect whether the keyword is a valid keyword, the detection accuracy is improved, and recognition errors are avoided. At the same time, in the detection of silence, the use of Detection of mute segments of audio materials improves the accuracy of determining whether to mute or not, and further avoids misjudgement of keywords as non-keywords. In addition, the keyword confirmation method proposed in this embodiment includes at least the following advantages: 的 In an optional embodiment of the keyword confirmation method provided in the embodiment of the present invention, a preferred judgment method is proposed, which uses the second audio data The ratio of the cumulative probability of silence to the cumulative keyword probability of the first audio material and / or the difference between the similarity probability of the pronunciation of the second audio material and the probability of silence, to determine whether the second audio material is muted, so that the judgment result is more accurate; Different types of keywords—main keywords and subsidiary keywords—provide different ways of further confirmation, making the judgment results more reliable. Third Embodiment 第三 A third embodiment of the present invention proposes a keyword confirmation method. FIG. 5 is a flowchart of steps in a keyword confirmation method according to a third embodiment of the present invention. As shown in FIG. 5, the keyword confirmation method according to the embodiment of the present invention includes the following steps: S301, acquiring first audio data, where the first audio data is identified as a keyword; In this step, the execution subject, for example, a vehicle Electronic devices such as a terminal, a mobile phone, and a tablet computer can obtain audio data, which includes at least a first audio data and a front / back second audio data. The second audio material and the first audio material are continuous in time. The first audio data detected at this time has been identified as a keyword, that is, the audio data detected at this time has been confirmed to match the pre-stored keywords. The electronic device can acquire and detect the first audio data corresponding to the keyword and the second audio data before, after, or before and after the keyword. In actual use, the sound collection device of the electronic device such as a microphone can continuously collect audio, audio The data is acquired in units of "frame", for example, a frame is 10ms, and after detecting the first audio material as a keyword, several frames before / after the first audio material are acquired, such as 10 frames of the second audio Data for subsequent analysis. S302. Determine a cumulative mute probability of a plurality of segments of the second audio material that is continuous with the first audio material in time; The cumulative mute probability p (sil) of the plurality of segments of the second audio material may use the second audio material. The product or sum of the silence probabilities of each segment is obtained. For each segment, the mute probability can be calculated in the manner provided in the foregoing first and second embodiments. In step S302, the mute probability can be obtained by adding or multiplying the mute probability. S303: Determine the cumulative keyword probability of multiple segments of the first audio material; In this step, the cumulative keyword probability p (kws) of the first audio material may be corresponding to multiple segments of the first audio material The product of the probabilities of pronunciation similarity; for example, for the first segment and the second segment of the first audio material, these two segments are input into the sound unit matching model of the electronic device for judgment. The similarity of the pronunciation unit is the highest, which is 90%, and the similarity with the second pronunciation unit is 80%; then the first pronunciation unit is used as the similar pronunciation unit of the first segment, and 90% is used as the pronunciation similarity probability of the first segment; The pronunciation similarity probability of the second segment is, for example, 70%, and the cumulative keyword probability of the first audio material is 80% × 70%. It is worth noting that the cumulative keyword probability p (kws) of the first audio material can be obtained from the pre-processing, and is not limited to the instant calculation. The invention does not limit the way of obtaining it. S304: When the relationship between the cumulative silence probability and the cumulative keyword probability satisfies a second preset condition, confirm that the first audio material is a valid keyword. In an embodiment of the keyword confirmation method of the present invention, the second preset condition includes: The ratio of the cumulative silence probability to the cumulative keyword probability is greater than a second threshold.这一 In this embodiment, for example, the larger the ratio of p (sil) / p (kws), the more accurate the judgment of the second audio data as mute. Therefore, a second preset condition may be set including that the ratio of the cumulative silence probability to the cumulative keyword probability is greater than the second threshold, and when p (sil) / p (kws) is greater than the second threshold (for example, 1. 5) When the second audio data is considered to be muted.一 In an optional embodiment of the present invention, before step S301, that is, the step of obtaining audio data, the method further includes: S300, detecting whether the collected audio data includes keywords. In this step, multiple keywords can be pre-stored in the keyword library of the electronic device, such as "hello zebra", "open system", "zoom in map", "zoom out map", "exit navigation" and so on. The keywords in the first audio material may be any of them. You can use the keyword thesaurus to calculate the probability that the input first audio material is similar to these keywords, and select the words with the highest probability and higher than the set threshold. As the detected keywords. Specifically, for example, the sound unit matching method of the present invention can be used to divide the audio data into multiple segments. When a segment has the highest similarity probability with one of the pronunciation units pre-stored in the sound unit library, the segment is considered to be similar to the When the sound unit matches, the sound unit is regarded as a similar sound unit, and the similarity rate between the segment and the similar sound unit is used as the sound similarity probability.针对 In the case of a piece of speech, such as the first audio material, the probabilities of pronunciation similarity of multiple segments are processed. For example, after multiplication, the path with the highest probability is obtained, and the word corresponding to the path is used as the matching keyword. In an optional embodiment of the present invention, the keyword has attribute information, and when the relationship between the cumulative silence probability and the cumulative keyword probability satisfies a second preset condition, confirming that the first audio data is valid The step of keywords includes: When the attribute information of the keywords is the main keyword, and the second audio data before the keywords is muted, confirm that the keywords are valid main keywords. In the embodiment of the present invention, each keyword may correspond to an attribute information, and the information records whether the keyword is a primary keyword or a secondary keyword. The multiple keywords pre-stored in the keyword library of the electronic device can be categorized as, for example, primary and secondary keywords, such as "hello zebra", "open system", etc., and set as the primary keywords, "zoom in map", "zoom out Map, "" exit navigation, "and so on. For the main keyword, if you consider that there may be no content after the main keyword, or it may directly recognize the voice, such as "Hello Zebra, please help me find the way to Zhongguancun", you can set the audio information before the keyword as If the keyword is muted and the attribute information of the keyword is the main keyword, confirm that the keyword is the main keyword; do not check whether the keyword is muted after the keyword. In an optional embodiment of the present invention, the keyword has attribute information, and when the relationship between the cumulative silence probability and the cumulative keyword probability satisfies a second preset condition, confirming that the first audio data is valid The keyword step includes: When the attribute information of the keyword is a sub-keyword, and the second audio data before and after the keyword are both mute, confirm that the keyword is a valid sub-keyword. In this step, the sub-keyword may be a command that the user asks the electronic device to execute directly, such as "zoom in the map". You can set the content before and after the keyword to be mute, and the attribute of the keyword is a sub-keyword, and then confirm that the keyword is a sub-keyword. When the user says "I just want to try to enlarge the map", "I do n’t know if I can enlarge the map" or "Enlarge the map", the keywords can be detected, but they are not satisfied The condition of silence will not be judged as a valid keyword. In summary, the keyword confirmation method provided in this embodiment has at least the following advantages: 中 The keyword confirmation method provided in the embodiment of the present invention uses the general habit of the user, that is, before or after the keyword is issued, The silence is caused by the pause. By detecting whether there is silence before and after the keyword to detect whether the keyword is a valid keyword, the detection accuracy is improved, and recognition errors are avoided. At the same time, the silence segment of the audio data is used in the detection of silence. Detection improves the accuracy of judging whether it is muted or not, and further avoids misjudgement of keywords as non-keywords. In addition, the keyword confirmation method proposed in this embodiment includes at least the following advantages: 的 In an optional embodiment of the keyword confirmation method provided in the embodiment of the present invention, a preferred judgment method is proposed, which uses the second audio data The ratio of the cumulative probability of silence to the cumulative keyword probability of the first audio material and / or the difference between the similarity probability of the pronunciation of the second audio material and the probability of silence, to determine whether the second audio material is muted, making the judgment result more accurate; Different types of keywords—main keywords and subsidiary keywords—provide different ways of further confirmation, making the judgment results more reliable. Fourth Embodiment 第四 A fourth embodiment of the present invention proposes a keyword confirmation method for a vehicle terminal. FIG. 6 is a schematic diagram of a vehicle-mounted terminal including a vehicle environment. FIG. 7 is a flowchart of a keyword confirmation method for a vehicle-mounted terminal according to a fifth embodiment of the present application. As shown in FIG. 6, the vehicle includes an in-vehicle terminal 200 provided in the vehicle. The in-vehicle terminal 200 includes a speaker 400 and a microphone 700, and may further include a screen, buttons, etc. (not shown). In addition to being integrated into the vehicle terminal, the speaker 400 may also be provided in other positions inside the vehicle for the occupant 600 to listen to information. The in-vehicle terminal 200 has a calculation processing function, which can be installed with an operating system and an application program, and can also be remotely connected to the server 300 through the Internet 500 for data interaction. As shown in FIG. 7, the method for confirming a keyword of a vehicle-mounted terminal according to an embodiment of the present invention is as follows: 401 S401, acquiring first audio data through a vehicle-mounted audio acquisition device, where the first audio data is identified as a keyword; at this step In the vehicle-mounted terminal 200, audio data can be acquired, which includes at least a first audio data and a front / rear second audio data. The second audio material and the first audio material are continuous in time. The first audio data detected at this time has been identified as a keyword, that is, the audio data detected by the microphone 700 at this time has been confirmed to match the pre-stored keywords. As shown in FIG. 1, the in-vehicle terminal 200 can acquire and detect the first audio data corresponding to the keywords shown in FIG. 1 and the second audio data before, after, or before and after the keywords. In actual use, the on-vehicle The terminal's sound collection device such as a microphone can continuously collect audio. For example, audio data is acquired in units of "frames", and a frame is, for example, 10 ms. After detecting the first audio data as a keyword, the first audio data is acquired. Several frames before / after, such as 10 frames of second audio data, are subsequently analyzed. S402. It is determined that there are multiple consecutive mute segments in the second audio material that is temporally continuous with the first audio material. In this step, after inputting the segment of the second audio material into the sound unit matching model of the vehicle terminal, The similarity with the mute unit in the sound unit library can be obtained as the mute probability of the segment. For example, for a segment of the second audio material, after matching its input sound unit model, and calculating that the similarity with the mute unit is 90%, then 90% is used as the mute probability corresponding to the segment. When this mute probability satisfies When certain requirements are met, the segment of the second audio material is considered to be a mute segment. In one embodiment, multiple segments can be input into the sound unit matching model of the vehicle terminal, and the mute probability corresponding to the segment is obtained, and the mute probability is used to determine whether the segment is a mute segment. (S403) Confirm that the first audio data is a valid keyword, and the valid keyword is used to wake up the vehicle terminal to execute an instruction corresponding to the keyword. In this step, when there are multiple consecutive mute segments in the second audio material, when it is determined that the second audio material is mute, the keyword can be confirmed to be a valid keyword, and the corresponding can be performed according to the valid keyword in the future. Instructions. For example, for the multiple segments of the second audio material, it is known whether each of the segments is a mute segment. In this step, it is possible to detect whether the mute clips are continuous mute clips. When a plurality of (for example, three or more) continuous mute clips are included, the second audio data is considered to be mute, and the first audio data is determined to A valid keyword, wherein the valid keyword is used to wake up a vehicle-mounted terminal to execute an instruction corresponding to the keyword. In summary, the keyword confirmation method for an in-vehicle terminal provided in this embodiment has at least the following advantages: 中 The keyword confirmation method applied to an in-vehicle terminal according to the embodiment of the present invention uses the general habit of the user, that is, There is a pause before or after the keyword to generate silence. By detecting whether there is silence before and after the keyword to detect whether the keyword is a valid keyword, the detection accuracy is improved, and recognition errors are avoided. Using the detection of mute segments of audio data, the accuracy of determining whether to mute is improved, and further avoiding the misjudgment of keywords as non-keywords. Fifth Embodiment 第五 A fifth embodiment of the present invention provides a keyword confirmation method for a vehicle-mounted terminal. FIG. 8 is a flowchart of a keyword confirmation method for a vehicle-mounted terminal according to a sixth embodiment of the present application. As shown in FIG. 8, the method for confirming a keyword of an in-vehicle terminal according to an embodiment of the present invention is as follows: S501: Obtain a first audio material through an in-vehicle audio acquisition device, and the first audio material is identified as a keyword; In the vehicle-mounted terminal, audio data can be obtained, which includes at least a first audio data and a front / rear second audio data. The second audio material and the first audio material are continuous in time. The first audio data detected at this time has been identified as a keyword, that is, the audio data detected at this time has been confirmed to match the pre-stored keywords. The vehicle-mounted terminal can acquire and detect the first audio data corresponding to the keyword, and the second audio material before, after, or before and after the keyword. In actual use, the sound collection device of the vehicle terminal such as a microphone can continuously collect audio, audio The data is acquired in units of "frame", for example, a frame is 10ms, and after detecting the first audio material as a keyword, several frames before / after the first audio material are acquired, such as 10 frames of the second audio Data for subsequent analysis. S502. Determine a cumulative mute probability of a plurality of segments of the second audio material that is continuous with the first audio material in time; The cumulative mute probability p (sil) of the plurality of segments of the second audio material may use the second audio material. The product or sum of the silence probabilities of each segment is obtained. For each segment, the mute probability can be calculated in the manner provided in the foregoing first and second embodiments. In step S502, the mute probability can be obtained by adding or multiplying the mute probability. S503. Determine the cumulative keyword probability of multiple segments of the first audio material. In this step, the cumulative keyword probability p (kws) of the first audio material may be corresponding to multiple segments of the first audio material. The product of the probabilities of pronunciation similarity; for example, for the first segment and the second segment of the first audio material, these two segments are input into the sound unit matching model of the vehicle terminal for judgment. The similarity of the pronunciation unit is the highest, which is 90%, and the similarity with the second pronunciation unit is 80%; then the first pronunciation unit is used as the similar pronunciation unit of the first segment, and 90% is used as the pronunciation similarity probability of the first segment; The pronunciation similarity probability of the second segment is, for example, 70%, and the cumulative keyword probability of the first audio material is 80% × 70%. S504. When the relationship between the cumulative silence probability and the cumulative keyword probability satisfies a second preset condition, confirm that the first audio data is a valid keyword, and the valid keyword is used to wake the vehicle terminal to execute the key. Word corresponding instruction.这一 In this embodiment, for example, the larger the ratio of p (sil) / p (kws), the more accurate the judgment of the second audio data as mute. Therefore, a second preset condition may be set including that the ratio of the cumulative silence probability to the cumulative keyword probability is greater than the second threshold, and when p (sil) / p (kws) is greater than the second threshold (for example, 1. 5) When the second audio data is considered to be muted. In an embodiment of the keyword confirmation method of the present invention, the second preset condition includes: The ratio of the cumulative silence probability to the cumulative keyword probability is greater than a second threshold. In summary, the keyword confirmation method for an in-vehicle terminal provided in this embodiment has at least the following advantages: 中 The keyword confirmation method applied to an in-vehicle terminal according to the embodiment of the present invention uses the general habit of the user, that is, There is a pause before or after the keyword to generate silence. By detecting whether there is silence before and after the keyword to detect whether the keyword is a valid keyword, the detection accuracy is improved, and recognition errors are avoided; meanwhile, in the detection of silence, Using the detection of mute segments of audio data, the accuracy of determining whether to mute is improved, and further avoiding the misjudgment of keywords as non-keywords. It is worth noting that although the above-mentioned fourth and fifth embodiments propose a keyword confirmation method for a vehicle terminal, those skilled in the art can clearly understand that the keyword confirmation method proposed by the present invention may not be limited to a vehicle terminal, and It can be applied to various other smart devices. Such as mobile phones, servers, smart home hardware and other smart devices with computing and processing functions. Smart home hardware includes, for example, microwave ovens, ovens, washing machines, dishwashers, air conditioners, routers, smart speakers, TVs, refrigerators, vacuum cleaners, and the like.一个 In one embodiment, when the above keyword confirmation method is applied to a smart speaker, the above keywords may include, for example, "play music" or "next song". When the smart speaker receives the keyword "play music" (first audio data) through the sound receiving device, the smart speaker judges whether the second audio data is muted, and then confirms that the keyword is a valid keyword. In an embodiment, after being confirmed as a valid keyword, the smart speaker may start executing an instruction for playing music corresponding to the valid keyword. The above embodiments are described by taking a smart speaker as an example, but those skilled in the art can clearly understand that the keyword confirmation method provided by the present invention can be applied to various smart devices, which is not limited herein. Sixth Embodiment 第六 A sixth embodiment of the present invention provides a keyword confirmation device. FIG. 9 is a block diagram of a keyword confirmation device according to a seventh embodiment of the present application. As shown in FIG. 9, a keyword confirmation device according to an embodiment of the present invention includes: an audio data acquisition module 601, configured to acquire a first audio data, the first audio data is identified as a keyword; a silent segment determination module 602 To determine that there are multiple consecutive mute segments in the second audio material that is temporally continuous with the first audio material; effective keyword determination module 603, which is used to confirm that the first audio material is a valid keyword. In summary, the keyword confirmation device provided in this embodiment has at least the following advantages: 中 In the keyword confirmation device provided in the embodiment of the present invention, the general habit of the user is used, that is, before or after the keyword is issued, The silence is caused by the pause. By detecting whether there is silence before and after the keyword to detect whether the keyword is a valid keyword, the detection accuracy is improved, and recognition errors are avoided. At the same time, the silence segment of the audio data is used in the detection of silence. Detection improves the accuracy of judging whether it is muted or not, and further avoids misjudgement of keywords as non-keywords. Seventh Embodiment 第七 A seventh embodiment of the present invention provides a keyword confirmation device. FIG. 10 is a block diagram of a keyword confirmation apparatus according to an eighth embodiment of the present application. As shown in FIG. 10, a keyword confirmation device according to an embodiment of the present invention includes: an audio data acquisition module 701, configured to acquire a first audio data, where the first audio data is identified as a keyword; a cumulative mute segment determination module 702, configured to determine a cumulative silence probability of a plurality of segments of a second audio material that is temporally continuous with the first audio material; a cumulative keyword probability determination module 703, configured to determine a plurality of the first audio materials The cumulative keyword probability of the segment; A valid keyword determination module 704 is configured to confirm that the first audio data is a valid keyword when the relationship between the cumulative silence probability and the cumulative keyword probability meets a second preset condition. In summary, the keyword confirmation device provided in this embodiment has at least the following advantages: 中 In the keyword confirmation device provided in the embodiment of the present invention, the general habit of the user is used, that is, before or after the keyword is issued, The silence is caused by the pause. By detecting whether there is silence before and after the keyword to detect whether the keyword is a valid keyword, the detection accuracy is improved, and recognition errors are avoided. At the same time, the silence segment of the audio data is used in the detection of silence Detection improves the accuracy of judging whether it is muted or not, and further avoids misjudgement of keywords as non-keywords. As for the device embodiment, since it is basically similar to the method embodiment, it is described relatively simply. For relevant parts, refer to the description of the method embodiment. FIG. 11 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown in FIG. 11, the terminal device may include an input device 90, a processor 91, an output device 92, a memory 93, and at least one communication bus 94. The communication bus 94 is used to implement a communication connection between the components. The memory 93 may include high-speed RAM memory, and may also include non-volatile storage NVM, such as at least one magnetic disk memory. The memory 93 may store various programs for performing various processing functions and implementing the method of this embodiment. step. Optionally, the processor 91 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), or a programmable Design logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components, the processor 91 is coupled to the input device 90 and output through a wired or wireless connection Equipment 92. Optionally, the input device 90 may include multiple input devices, for example, it may include at least one of a user-oriented user interface, a device-oriented device interface, a software-programmable interface, a camera, and a sensor. Optionally, the device-oriented device interface may be a wired interface for data transmission between the device and a hardware insertion interface (for example, a USB interface, a serial port) for data transmission between the device and the device. Optionally, the user-oriented user interface may be, for example, a user-oriented control button, a voice input device for receiving voice input, and a touch-sensing device (such as Touch screen, touchpad, etc.); optionally, the programmable interface of the above software can be, for example, an entry for a user to edit or modify a program, such as an input pin interface or input interface of a chip; Optionally, the above-mentioned transceiver may be a radio frequency transceiver chip having a communication function, a baseband processing chip, and a transceiver antenna. A voice input device such as a microphone can receive voice data. The output device 92 may include output devices such as a display and an audio. In this embodiment, the processor of the terminal device includes functions for executing each module of the data processing device in each device. For specific functions and technical effects, reference may be made to the foregoing embodiments, and details are not described herein again. FIG. 12 is a schematic diagram of a hardware structure of a terminal device according to another embodiment of the present application. FIG. 12 is a specific embodiment of the implementation process of FIG. 11. As shown in FIG. 12, the terminal device in this embodiment includes a processor 101 and a memory 102. The processor 101 executes the computer program code stored in the memory 102 to implement the methods in FIG. 1 to FIG. 7 in the foregoing embodiment. The memory 102 is configured to store various types of data to support operation at the terminal device. Examples of these materials include instructions for any application or method used to operate on the terminal device, such as messages, pictures, videos, etc. The memory 102 may include random access memory (RAM), and may also include non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. Optionally, the processor 101 is provided in the processing component 100. The terminal device may further include a communication element 103, a power source element 104, a multimedia element 105, an audio element 106, an input / output interface 107, and / or a sensor element 108. The specific components and the like included in the terminal device are set according to actual requirements, which is not limited in this embodiment. The processing unit 100 generally controls the overall operation of the terminal device. The processing element 100 may include one or more processors 101 to execute instructions to complete all or part of the steps of the methods in FIG. 1 to FIG. 7 described above. In addition, the processing element 100 may include one or more modules to facilitate interaction between the processing element 100 and other elements. For example, the processing element 100 may include a multimedia module to facilitate the interaction between the multimedia element 105 and the processing element 100. The power source element 104 supplies power to various elements of the terminal device. The power supply element 104 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for end devices. The multimedia element 105 includes a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a liquid crystal display (LCD) and a touch panel (TP). If the display screen includes a touch panel, the display screen can be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or sliding action, but also detect duration and pressure related to the touch or sliding operation. The audio element 106 is configured to output and / or input audio signals. For example, the audio element 106 includes a microphone (MIC). When the terminal device is in an operation mode, such as a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal can be further stored in the memory 102 or transmitted via the communication element 103. In some embodiments, the audio element 106 further includes a speaker for outputting audio signals. The input / output interface 107 provides an interface between the processing element 100 and a peripheral interface module. The peripheral interface module may be a click wheel, a button, or the like. These buttons may include, but are not limited to, a volume button, a start button, and a lock button. The sensor element 108 includes one or more sensors for providing various aspects of status assessment to the terminal device. For example, the sensor element 108 may detect the on / off state of the terminal device, the relative positioning of the element, the presence or absence of a user's contact with the terminal device. The sensor element 108 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor element 108 may further include a camera and the like. The communication element 103 is configured to facilitate wired or wireless communication between the terminal device and other devices. The terminal device can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot, and the SIM card slot is used to insert a SIM card, so that the terminal device can log in to the GPRS network and establish communication with the server through the Internet. As can be seen from the above, the communication element 103, the audio element 106, the input / output interface 107, and the sensor element 108 involved in the embodiment of FIG. 12 can be implemented as the input device in the embodiment of FIG. 11. An embodiment of the present application provides a terminal device, including: one or more processors; and one or more machine-readable media having instructions stored thereon, which, when executed by the one or more processors, cause all the The terminal device executes the method for generating a video digest according to one or more of the embodiments of the present application. In one embodiment, the above-mentioned terminal devices may include various smart terminals with computing and processing functions, such as vehicle terminals, mobile terminals (such as mobile phones, tablets, personal digital assistants, etc.), servers, IoT devices, or smart home hardware. Machine equipment. Smart home hardware includes, for example, microwave ovens, ovens, washing machines, dishwashers, air conditioners, routers, smart speakers, TVs, refrigerators, vacuum cleaners, and the like. The above-mentioned smart terminal device may install an application program, provide a human-machine interactive operation interface, and execute the keyword confirmation method of the foregoing embodiments. For example, these smart terminal devices can receive audio data through themselves or an external audio receiving component. After confirming that the second audio data before and after the first audio data is muted, confirm that the first audio data is a valid keyword. For example, for mobile phones, in this way, you can determine whether the voice command issued by the user is a command that instructs an application installed in the phone to perform corresponding operations-such as turning on music, navigation, etc .; for IoT devices or smart home hardware, In this way, it can be determined whether the voice command issued by the user is a command instructing the software or system installed therein to perform corresponding operations-such as connecting other equipment, increasing the temperature of the air conditioner, turning on the high-temperature baking mode of the oven, and the like. It is not particularly limited here. Therefore, it can be known from the foregoing examples that the present invention can be applied to various types of terminal equipment.的 Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. Although the preferred embodiments of the embodiments of the present application have been described, those skilled in the art can make other changes and modifications to these embodiments once they know the basic inventive concepts. Therefore, the scope of the appended application patents is intended to be construed as including the preferred embodiments and all changes and modifications that fall within the scope of the embodiments of this application. Finally, it should be noted that in this article, relational terms such as first and second are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. There is any such actual relationship or order between OR operations. Moreover, the terms "including", "comprising", or any other variation thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or end device that includes a series of elements includes not only those elements but also those that are not explicitly listed Other elements, or elements inherent to such a process, method, article, or terminal. Without more restrictions, the elements defined by the sentence "including a ..." do not exclude that there are other identical elements in the process, method, article or terminal device including the elements. The keyword confirmation method and device provided in this application have been described in detail above. The specific examples are used in this article to explain the principle and implementation of this application. The description of the above embodiments is only used to help understand this application. Method and its core ideas; meanwhile, for a person of ordinary skill in the art, according to the ideas of the present application, there will be changes in the specific implementation and application scope. In summary, the contents of this specification should not be interpreted as Application restrictions.

90‧‧‧輸入設備90‧‧‧ input device

91‧‧‧處理器91‧‧‧ processor

92‧‧‧輸出設備92‧‧‧output device

93‧‧‧記憶體93‧‧‧Memory

94‧‧‧通訊匯流排94‧‧‧communication bus

100‧‧‧處理組件100‧‧‧Processing components

101‧‧‧處理器101‧‧‧ processor

102‧‧‧記憶體102‧‧‧Memory

103‧‧‧通訊組件103‧‧‧Communication components

104‧‧‧電源組件104‧‧‧Power components

105‧‧‧多媒體組件105‧‧‧Multimedia components

106‧‧‧音訊組件106‧‧‧Audio components

107‧‧‧輸入/輸出介面107‧‧‧ input / output interface

108‧‧‧感測器組件108‧‧‧Sensor components

200‧‧‧車載終端200‧‧‧car terminal

300‧‧‧伺服器300‧‧‧Server

400‧‧‧揚聲器400‧‧‧Speaker

500‧‧‧互聯網500‧‧‧Internet

600‧‧‧乘坐者600‧‧‧ passengers

700‧‧‧麥克風700‧‧‧ Microphone

S101、S102、S103、S200、S201、S202、S203、S2021、S2022、S2023、S2024、S2025、S300、S301、S302、S303、S304、S401、S402、S403、S501、S502、S503、S504‧‧‧步驟S101, S102, S103, S200, S201, S202, S203, S2021, S2022, S2023, S2024, S2025, S300, S301, S302, S303, S304, S401, S402, S403, S501, S502, S503, S504‧‧‧‧ step

為了更清楚地說明本發明實施例或現有技術中的技術方案,下面將對實施例或現有技術描述中所需要使用的附圖作一簡單地介紹,顯而易見地,下面描述中的附圖是本發明的一些實施例,對於本領域普通技術人員來講,在不付出創造性勞動的前提下,還可以根據這些附圖獲得其他的附圖。   圖1所示為本發明一實施例的正常的關鍵詞及前後音頻資料的示意圖   圖2是本申請第一實施例的關鍵詞確認方法的流程圖。   圖3是本申請第二實施例的關鍵詞確認方法的流程圖。   圖4是圖3中步驟的子步驟流程圖。   圖5是本申請第三實施例的關鍵詞確認方法的流程圖。   圖6是包括車輛環境的車載終端的示意圖。   圖7是本申請第五實施例的車載終端的關鍵詞確認方法的流程圖。   圖8是本申請第六實施例的車載終端的關鍵詞確認方法的流程圖。   圖9是本申請第七實施例的關鍵詞確認裝置的方塊圖。   圖10是本申請第八實施例的關鍵詞確認裝置的方塊圖。   圖11示意性地示出了用於執行根據本發明的方法的終端設備的方塊圖。   圖12示意性地示出了用於保持或者攜帶實現根據本發明的方法的程式碼的儲存單元。In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are For some embodiments of the invention, for those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor. FIG. 1 shows a schematic diagram of normal keywords and audio data before and after an embodiment of the present invention. FIG. 2 is a flowchart of a keyword confirmation method according to the first embodiment of the present application. 3 is a flowchart of a keyword confirmation method according to a second embodiment of the present application. FIG. 4 is a sub-step flowchart of the steps in FIG. 3. 5 is a flowchart of a keyword confirmation method according to a third embodiment of the present application. FIG. 6 is a schematic diagram of a vehicle-mounted terminal including a vehicle environment. 7 is a flowchart of a keyword confirmation method for a vehicle-mounted terminal according to a fifth embodiment of the present application. 8 is a flowchart of a keyword confirmation method for a vehicle-mounted terminal according to a sixth embodiment of the present application. 9 is a block diagram of a keyword confirmation apparatus according to a seventh embodiment of the present application. 10 is a block diagram of a keyword confirmation apparatus according to an eighth embodiment of the present application. FIG. 11 schematically shows a block diagram of a terminal device for performing a method according to the present invention. Fig. 12 schematically shows a storage unit for holding or carrying a code for implementing a method according to the present invention.

Claims (20)

一種關鍵詞確認方法,其包括:   獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段;   確認所述第一音頻資料為有效關鍵詞。A method for confirming a keyword, comprising: acquiring a first audio material, the first audio material being identified as a keyword; determining that there are multiple consecutive mutes in a second audio material that is temporally continuous with the first audio material Fragment; confirm that the first audio material is a valid keyword. 如申請專利範圍第1項所述的方法,其中所述判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段的步驟包括:   判定所述片段的發音相似概率,所述發音相似概率為所述片段與多個發音單元之間的最大相似概率;   判定所述片段的靜音概率,所述靜音概率為所述片段與靜音單元的相似概率;   當所述發音相似概率與所述靜音概率的關係滿足預設條件時,判定所述片段為靜音片段;   根據所判定的靜音片段,判定所述第二音頻資料中存在多個連續的所述靜音片段。The method according to item 1 of the scope of patent application, wherein the step of determining that there are multiple consecutive silence segments in the second audio material that is temporally continuous with the first audio material includes: determining that the pronunciation of the fragment is similar Probability, the pronunciation similarity probability is the maximum probability of similarity between the segment and multiple pronunciation units; determining the silence probability of the segment, the silence probability is the similarity probability between the fragment and the silence unit; when the pronunciation When the relationship between the similarity probability and the mute probability meets a preset condition, the segment is determined to be a mute segment; based on the determined mute segment, it is determined that a plurality of consecutive mute segments exist in the second audio material. 如申請專利範圍第1項所述的方法,其中所述判定所述第二音頻資料的至少一個片段是否為靜音片段的步驟包括:   判定所述片段的發音相似概率,所述發音相似概率為所述片段與多個發音單元之間的最大相似概率;   判定所述片段的靜音概率,所述靜音概率為所述最大相似概率對應的發音單元與靜音單元的相似概率;   當所述發音相似概率與所述靜音概率的關係滿足預設條件時,判定所述片段為靜音片段;   根據所判定的靜音片段,判定所述第二音頻資料中存在多個連續的所述靜音片段。The method according to item 1 of the scope of patent application, wherein the step of determining whether at least one segment of the second audio material is a silent segment includes: determining a pronunciation similarity probability of the segment, where the pronunciation similarity probability is The maximum similarity probability between the segment and multiple pronunciation units; determining the silence probability of the segment, the silence probability is the similarity probability between the pronunciation unit and the silence unit corresponding to the maximum similarity probability; when the pronunciation similarity probability and When the relationship of the mute probability meets a preset condition, the segment is determined to be a mute segment; based on the determined mute segment, it is determined that a plurality of consecutive mute segments exist in the second audio material. 如申請專利範圍第2或3項所述的方法,其中所述判定所述第二音頻資料中存在多個連續的所述靜音片段的步驟包括:   判定所述第二音頻資料中存在三個以上連續的靜音片段。The method according to item 2 or 3 of the scope of patent application, wherein the step of determining that there are multiple consecutive silence segments in the second audio material includes: determining that there are more than three in the second audio material Continuous mute clips. 如申請專利範圍第2或3項所述的方法,其中所述預設條件包括:   所述片段的發音相似概率與靜音概率的差值的絕對值小於第一臨限值。The method according to item 2 or 3 of the scope of patent application, wherein the preset conditions include: The absolute value of the difference between the pronunciation similarity probability and the silence probability of the segment is less than the first threshold. 如申請專利範圍第1項所述的方法,其中所述獲取音頻資料的步驟之前,所述方法還包括:   檢測所採集到的音頻資料中是否包括關鍵詞。The method according to item 1 of the scope of patent application, wherein before the step of obtaining audio data, the method further comprises: detecting whether the collected audio data includes keywords. 如申請專利範圍第6項所述的方法,其中所述關鍵詞具有屬性資訊,所述確認所述第一音頻資料為有效關鍵詞的步驟包括:   當所述關鍵詞的屬性資訊為主關鍵詞時,且所述關鍵詞之前的第二音頻資料為靜音時,確認所述關鍵詞為有效主關鍵詞。The method according to item 6 of the scope of patent application, wherein the keyword has attribute information, and the step of confirming that the first audio material is a valid keyword includes: When the attribute information of the keyword is the main keyword When the second audio data before the keyword is muted, confirm that the keyword is a valid main keyword. 如申請專利範圍第6項所述的方法,其中所述關鍵詞具有屬性資訊,所述確認所述第一音頻資料為有效關鍵詞的步驟包括:   當所述關鍵詞的屬性資訊為副關鍵詞時,且所述關鍵詞之前和之後的第二音頻資料均為靜音時,確認所述關鍵詞為有效副關鍵詞。The method according to item 6 of the scope of patent application, wherein the keyword has attribute information, and the step of confirming that the first audio material is a valid keyword includes: When the attribute information of the keyword is a secondary keyword When the second audio material before and after the keyword is mute, it is confirmed that the keyword is a valid sub-keyword. 一種關鍵詞確認方法,其包括:   獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   判定與所述第一音頻資料時間上連續的第二音頻資料的多個片段的累積靜音概率;   判定所述第一音頻資料的多個片段的累積關鍵詞概率;   當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞。A keyword confirmation method, comprising: : acquiring a first audio material, said first audio material being identified as a keyword; determining a cumulative mute of a plurality of segments of a second audio material that is temporally continuous with said first audio material Probability; determining the cumulative keyword probability of multiple segments of the first audio material; confirming that the first audio material is valid when the relationship between the cumulative silence probability and the cumulative keyword probability meets a second preset condition word. 如申請專利範圍第9項所述的方法,其中所述第二預設條件包括:   所述累積靜音概率與累積關鍵詞概率比值的絕對值大於第二臨限值。The method according to item 9 of the scope of patent application, wherein the second preset condition includes: The absolute value of the ratio of the cumulative silence probability to the cumulative keyword probability is greater than a second threshold. 如申請專利範圍第9項所述的方法,其中所述獲取音頻資料的步驟之前,所述方法還包括:   檢測所採集到的音頻資料中是否包括關鍵詞。The method according to item 9 of the scope of patent application, wherein before the step of obtaining audio data, the method further includes: detecting whether the collected audio data includes keywords. 如申請專利範圍第11項所述的方法,其中所述關鍵詞具有屬性資訊,所述當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞的步驟包括:   當所述關鍵詞的屬性資訊為主關鍵詞時,且所述關鍵詞之前的第二音頻資料為靜音時,確認所述關鍵詞為有效主關鍵詞。The method according to item 11 of the scope of patent application, wherein the keyword has attribute information, and when the relationship between the cumulative silence probability and the cumulative keyword probability meets a second preset condition, confirming the first audio The step that the data is a valid keyword includes: When the attribute information of the keyword is the main keyword, and the second audio data before the keyword is muted, confirm that the keyword is a valid main keyword. 如申請專利範圍第11項所述的方法,其中所述關鍵詞具有屬性資訊,所述當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞的步驟包括:   當所述關鍵詞的屬性資訊為副關鍵詞時,且所述關鍵詞之前和之後的第二音頻資料均為靜音時,確認所述關鍵詞為有效副關鍵詞。The method according to item 11 of the scope of patent application, wherein the keyword has attribute information, and when the relationship between the cumulative silence probability and the cumulative keyword probability meets a second preset condition, confirming the first audio The step of the data being a valid keyword includes: 确认 When the attribute information of the keyword is a sub-keyword and the second audio data before and after the keyword are both mute, confirm that the keyword is a valid sub-key word. 一種車載終端的關鍵詞確認方法,其包括:   通過車載音頻採集裝置獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段;   確認所述第一音頻資料為有效關鍵詞,其中所述有效關鍵詞用於喚醒車載終端執行所述關鍵詞對應的指令。A keyword confirmation method for a vehicle terminal includes: : acquiring first audio data through a vehicle audio acquisition device, the first audio data being identified as a keyword; determining a second audio time continuous with the first audio data There are multiple consecutive mute segments in the data; Confirm that the first audio data is a valid keyword, and the valid keyword is used to wake up the vehicle terminal to execute the instruction corresponding to the keyword. 一種車載終端的關鍵詞確認方法,其包括:   通過車載音頻採集裝置獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   判定與所述第一音頻資料時間上連續的第二音頻資料的多個片段的累積靜音概率;   判定所述第一音頻資料的多個片段的累積關鍵詞概率;   當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞,其中所述有效關鍵詞用於喚醒車載終端執行所述關鍵詞對應的指令。A keyword confirmation method for a vehicle terminal includes: : acquiring first audio data through a vehicle audio acquisition device, the first audio data being identified as a keyword; determining a second audio time continuous with the first audio data The cumulative silence probability of multiple segments of the material; determining the cumulative keyword probability of the multiple segments of the first audio material; 确认 when the relationship between the cumulative silence probability and the cumulative keyword probability meets a second preset condition, confirming the The first audio material is a valid keyword, and the valid keyword is used to wake up a vehicle-mounted terminal to execute an instruction corresponding to the keyword. 一種關鍵詞確認裝置,其包括:   音頻資料獲取模組,用於獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   靜音片段判定模組,用於判定與所述第一音頻資料時間上連續的第二音頻資料中存在多個連續的靜音片段;   有效關鍵詞判定模組,用於確認所述第一音頻資料為有效關鍵詞。A keyword confirmation device, comprising: an audio data acquisition module for acquiring a first audio data, the first audio data being identified as a keyword; a silent segment determination module for determining a relationship with the first audio There are multiple consecutive mute segments in the second audio material that is continuous in data time; A valid keyword determination module is used to confirm that the first audio material is a valid keyword. 一種關鍵詞確認裝置,其包括:   音頻資料獲取模組,用於獲取第一音頻資料,所述第一音頻資料被識別為關鍵詞;   累計靜音片段判定模組,用於判定與所述第一音頻資料時間上連續的第二音頻資料的多個片段的累積靜音概率;   累積關鍵詞概率判定模組,用於判定所述第一音頻資料的多個片段的累積關鍵詞概率;   有效關鍵詞判定模組,用於當所述累積靜音概率與累積關鍵詞概率的關係滿足第二預設條件時,確認所述第一音頻資料為有效關鍵詞。A keyword confirmation device, comprising: an audio data acquisition module for acquiring a first audio data, the first audio data being identified as a keyword; a cumulative mute segment determination module for determining a relationship with the first Cumulative silence probability of multiple segments of the second audio material that are continuous in audio material in time; cumulative keyword probability determination module for determining the cumulative keyword probability of multiple fragments of the first audio material; effective keyword determination A module configured to confirm that the first audio data is a valid keyword when the relationship between the cumulative silence probability and the cumulative keyword probability meets a second preset condition. 一種終端設備,其包括:   一個或多個處理器;和   其上儲存有指令的一個或多個機器可讀媒體,當由所述一個或多個處理器執行時,使得所述終端設備執行如申請專利範圍第1-15項中一個或多個所述的方法。A terminal device includes: (i) one or more processors; and one or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the terminal device to execute a protocol such as Apply one or more of the methods described in claims 1-15. 如申請專利範圍第18項所述的終端設備,其中所述終端設備包括車載終端、移動終端、伺服器、物聯網設備或智慧家居硬體。The terminal device according to item 18 of the scope of patent application, wherein the terminal device includes a vehicle terminal, a mobile terminal, a server, an IoT device, or a smart home hardware. 一個或多個機器可讀媒體,其上儲存有指令,當由一個或多個處理器執行時,使得終端設備執行如申請專利範圍第1-15項中一個或多個所述的方法。One or more machine-readable media having instructions stored thereon, when executed by one or more processors, cause the terminal device to perform the method as described in one or more of items 1-15 of the scope of patent application.
TW107135162A 2017-12-08 2018-10-05 Keyword confirmation method and apparatus TW201928740A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711294885.0A CN109903751B (en) 2017-12-08 2017-12-08 Keyword confirmation method and device
??201711294885.0 2017-12-08

Publications (1)

Publication Number Publication Date
TW201928740A true TW201928740A (en) 2019-07-16

Family

ID=66697140

Family Applications (1)

Application Number Title Priority Date Filing Date
TW107135162A TW201928740A (en) 2017-12-08 2018-10-05 Keyword confirmation method and apparatus

Country Status (4)

Country Link
US (1) US10950221B2 (en)
CN (1) CN109903751B (en)
TW (1) TW201928740A (en)
WO (1) WO2019113529A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112397102B (en) * 2019-08-14 2022-07-08 腾讯科技(深圳)有限公司 Audio processing method and device and terminal
JP7191792B2 (en) * 2019-08-23 2022-12-19 株式会社東芝 Information processing device, information processing method and program
KR102096965B1 (en) * 2019-09-10 2020-04-03 방일성 English learning method and apparatus applying principle of turning bucket
CN111986654B (en) * 2020-08-04 2024-01-19 云知声智能科技股份有限公司 Method and system for reducing delay of voice recognition system

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4393271A (en) 1978-02-14 1983-07-12 Nippondenso Co., Ltd. Method for selectively displaying a plurality of information
JP3284832B2 (en) 1995-06-22 2002-05-20 セイコーエプソン株式会社 Speech recognition dialogue processing method and speech recognition dialogue device
CN1430204A (en) * 2001-12-31 2003-07-16 佳能株式会社 Method and equipment for waveform signal analysing, fundamental tone detection and sentence detection
US8078465B2 (en) * 2007-01-23 2011-12-13 Lena Foundation System and method for detection and analysis of speech
JP4882899B2 (en) * 2007-07-25 2012-02-22 ソニー株式会社 Speech analysis apparatus, speech analysis method, and computer program
US8954334B2 (en) 2011-10-15 2015-02-10 Zanavox Voice-activated pulser
US10276156B2 (en) 2012-02-29 2019-04-30 Nvidia Corporation Control using temporally and/or spectrally compact audio commands
US8924209B2 (en) 2012-09-12 2014-12-30 Zanavox Identifying spoken commands by templates of ordered voiced and unvoiced sound intervals
US9202463B2 (en) 2013-04-01 2015-12-01 Zanavox Voice-activated precision timing
US9390708B1 (en) * 2013-05-28 2016-07-12 Amazon Technologies, Inc. Low latency and memory efficient keywork spotting
US9454976B2 (en) 2013-10-14 2016-09-27 Zanavox Efficient discrimination of voiced and unvoiced sounds
US9373321B2 (en) * 2013-12-02 2016-06-21 Cypress Semiconductor Corporation Generation of wake-up words
US8990079B1 (en) 2013-12-15 2015-03-24 Zanavox Automatic calibration of command-detection thresholds
US9953632B2 (en) * 2014-04-17 2018-04-24 Qualcomm Incorporated Keyword model generation for detecting user-defined keyword
US9530407B2 (en) 2014-06-11 2016-12-27 Honeywell International Inc. Spatial audio database based noise discrimination
CN105654943A (en) * 2015-10-26 2016-06-08 乐视致新电子科技(天津)有限公司 Voice wakeup method, apparatus and system thereof
US10074363B2 (en) * 2015-11-11 2018-09-11 Apptek, Inc. Method and apparatus for keyword speech recognition
JP6690484B2 (en) * 2016-09-15 2020-04-28 富士通株式会社 Computer program for voice recognition, voice recognition device and voice recognition method
KR20180118461A (en) 2017-04-21 2018-10-31 엘지전자 주식회사 Voice recognition module and and voice recognition method
US10424299B2 (en) 2017-09-29 2019-09-24 Intel Corporation Voice command masking systems and methods

Also Published As

Publication number Publication date
US10950221B2 (en) 2021-03-16
WO2019113529A1 (en) 2019-06-13
CN109903751A (en) 2019-06-18
CN109903751B (en) 2023-07-07
US20190180734A1 (en) 2019-06-13

Similar Documents

Publication Publication Date Title
US10438595B2 (en) Speaker identification and unsupervised speaker adaptation techniques
JP6938583B2 (en) Voice trigger for digital assistant
US10818289B2 (en) Method for operating speech recognition service and electronic device for supporting the same
US10657967B2 (en) Method and apparatus for executing voice command in electronic device
TW201928740A (en) Keyword confirmation method and apparatus
US9354842B2 (en) Apparatus and method of controlling voice input in electronic device supporting voice recognition
CN108108142A (en) Voice information processing method, device, terminal device and storage medium
US11133009B2 (en) Method, apparatus, and terminal device for audio processing based on a matching of a proportion of sound units in an input message with corresponding sound units in a database
US8255218B1 (en) Directing dictation into input fields
JP7166294B2 (en) Audio processing method, device and storage medium
CN106297801A (en) Method of speech processing and device
CN109101517B (en) Information processing method, information processing apparatus, and medium
KR20210032875A (en) Voice information processing method, apparatus, program and storage medium
US20120053937A1 (en) Generalizing text content summary from speech content
CN109064720B (en) Position prompting method and device, storage medium and electronic equipment
WO2021136298A1 (en) Voice processing method and apparatus, and intelligent device and storage medium
CN114694661A (en) First terminal device, second terminal device and voice awakening method
US10839802B2 (en) Personalized phrase spotting during automatic speech recognition
CN110853633A (en) Awakening method and device
CN113096651A (en) Voice signal processing method and device, readable storage medium and electronic equipment
CN115361180B (en) Voice processing method based on physical key, electronic equipment, device and medium
CN116052653A (en) Fault detection method, device, electronic equipment and readable storage medium
CN114121042A (en) Voice detection method and device under wake-up-free scene and electronic equipment
CN116264078A (en) Speech recognition processing method and device, electronic equipment and readable medium