TW202018696A

TW202018696A - Voice recognition method and device and computing device

Info

Publication number: TW202018696A
Application number: TW108129251A
Authority: TW
Inventors: 趙情恩; 索宏彬; 劉剛; 著卓; 贇雷; 張平; 孫堯
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2018-11-12
Filing date: 2019-08-16
Publication date: 2020-05-16
Also published as: WO2020098523A1; CN111179940A

Abstract

Disclosed is a voice recognition method, comprising the steps of: receiving audio data including a first voice; determining whether there is a user matching the first voice; if there is no user matching the first voice, storing the audio data; and clustering multiple pieces of stored audio data so as to determine a new user therefrom. Further disclosed are a corresponding voice recognition device and system as well as a computing device.

Description

Voice recognition method, device and computing equipment

本發明涉及語音識別技術領域，尤其是一種語音識別方法、裝置及計算設備。The invention relates to the technical field of speech recognition, in particular to a speech recognition method, device and computing equipment.

隨著諸如行動終端和智慧型音箱之類終端設備的廣泛使用，人們越來越習慣於使用語音來與這些終端設備進行互動。其中，終端設備可以採用聲紋識別技術來識別用戶身份。聲紋識別(Voiceprint Identification)，又稱說話人識別(Speaker Identification)，該技術是從說話人發出的語音訊號中提取語音特徵，並據此對說話人進行身份驗證的生物識別技術。其中，聲紋是指人類語音中攜帶言語資訊的聲波頻譜。同指紋一樣，聲紋具備獨特的生物學特徵，具有身份識別的作用，不僅具有特定性，而且具有相對的穩定性。通常地，說話者需要預先在終端設備上註冊聲紋，而後終端設備藉由聲紋來識別該用戶，從而可以藉由分析該用戶的語音對應的指令來分析用戶行為，以便為該用戶提供個性化、定制化服務，例如歌曲推薦等。由於目前大部分終端設備的用戶沒有主動註冊聲紋，所以無法準確識別用戶，從而無法分析用戶行為向其提供個性化服務，或者向其提供的個性化服務很難達到比較好的效果。因此，需要提供一種更優越的語音識別方案，以便為用戶提供服務。With the widespread use of terminal devices such as mobile terminals and smart speakers, people are more and more accustomed to using voice to interact with these terminal devices. Among them, the terminal device may use voiceprint recognition technology to identify the user's identity. Voiceprint Identification (Voiceprint Identification), also known as Speaker Identification (Speaker Identification), this technology is a biometric technology that extracts voice features from the voice signal sent by the speaker and authenticates the speaker accordingly. Among them, voiceprint refers to the sound wave spectrum that carries verbal information in human speech. Like fingerprints, voiceprints have unique biological characteristics and the role of identification, not only specific, but also relatively stable. Generally, the speaker needs to register the voiceprint on the terminal device in advance, and then the terminal device recognizes the user by the voiceprint, so that the user's behavior can be analyzed by analyzing the command corresponding to the user's voice to provide the user with personality Customized services, such as song recommendation, etc. Since most users of terminal devices do not actively register voiceprints at present, they cannot accurately identify users, and thus cannot analyze user behaviors to provide personalized services to them, or it is difficult to achieve better results with personalized services provided to them. Therefore, there is a need to provide a more superior voice recognition solution in order to provide services for users.

為此，本發明實施例提供了一種語音識別方法、裝置及計算設備，以力圖解決或者至少緩解上面存在的至少一個問題。根據本發明實施例的一個方面，提供了一種語音識別方法，包括步驟：接收包括第一語音的聲音資料；判斷是否存在與第一語音相匹配的用戶；在不存在與第一語音相匹配的用戶的情況下，儲存該聲音資料；對所儲存的多條聲音資料進行聚類，以便從多條聲音資料中確定新用戶。可選地，在根據本發明實施例的語音識別方法中，用戶對應有用戶設定檔，用戶設定檔包括用戶的聲紋，判斷是否存在與第一語音相匹配的用戶的步驟包括：判斷第一語音是否與用戶的聲紋相匹配，以判斷是否存在與第一語音相匹配的用戶。可選地，在根據本發明實施例的語音識別方法中，對所儲存的多條聲音資料進行聚類，以便從多條聲音資料中確定新用戶的步驟包括：基於多條聲音資料中兩兩之間的相似評分，將多條聲音資料劃分為多個集合；基於集合的樣本密度和樣本數量確定至少一個目標集合，目標集合對應於新用戶；為目標集合對應的新用戶創建用戶設定檔，並使用目標集合中的至少部分聲音資料來產生該新用戶的聲紋。可選地，在根據本發明實施例的語音識別方法中，使用目標集合中的至少部分聲音資料來產生該新用戶的聲紋的步驟包括：根據到目標集合的質心的距離來確定目標集合中用於產生新用戶的聲紋的聲音資料。可選地，在根據本發明實施例的語音識別方法中，用戶設定檔包括指示用戶是否為主動註冊的用戶標記，為目標集合對應的新用戶創建用戶設定檔的步驟包括：將為目標集合對應的新用戶所創建的用戶設定檔中的用戶標識置為非主動註冊；以及方法還包括步驟：在存在與第一語音相匹配的用戶且對應的用戶標記指示用戶為非主動註冊的情況下，記錄來自用戶的聲音資料條數。可選地，在根據本發明實施例的語音識別方法中，還包括步驟：在記錄來自用戶的聲音資料條數之後，判斷聲音資料條數是否在特定時間段內達到特定數量；若否，刪除該用戶對應的用戶設定檔。可選地，在根據本發明實施例的語音識別方法中，用戶設定檔還包括與用戶相關聯的終端設備的設備標識，該方法包括步驟：接收發送聲音資料的終端設備的設備標識；基於設備標識判斷是否存在與終端設備相關聯的用戶；如果不存在，則儲存該聲音資料。可選地，在根據本發明實施例的語音識別方法中，還包括步驟：在存在與第一語音相匹配的用戶的情況下，將第一語音對應的指令與用戶相關聯地儲存。可選地，在根據本發明實施例的語音識別方法中，還包括步驟：接收包括第二語音的聲音資料，第二語音用於主動註冊新用戶；為主動註冊的新用戶創建用戶設定檔，並使用包括第二語音的聲音資料來產生新用戶的聲紋；以及將為主動註冊的新用戶所創建的用戶設定檔中的用戶標識置為主動註冊。可選地，在根據本發明實施例的語音識別方法中，還包括步驟：接收發送包括第二語音的聲音資料的終端設備的設備標識；將設備標識與主動註冊的新用戶相關聯地儲存至對應的用戶設定檔。可選地，在根據本發明實施例的語音識別方法中，判斷第一語音是否與用戶的聲紋相匹配的步驟包括：根據包括第一語音的聲音資料，提取第一語音的語音特徵；基於所述第一語音的語音特徵得到第一語音與用戶的聲紋之間的相似評分；根據相似評分來確定第一語音是否與用戶的聲紋相匹配。根據本發明實施例的另一方面，提供了一種用戶識別方法，包括步驟：接收包括第一語音的聲音資料；判斷是否存在與第一語音相匹配的用戶；在不存在與第一語音相匹配的用戶的情況下，儲存聲音資料；對所儲存的多條聲音資料進行聚類，以便從多條聲音資料中確定新用戶，並對新用戶進行行為分析。根據本發明實施例的另一方面，提供了一種語音識別裝置，包括：通信模組，適於接收包括第一語音的聲音資料；語音識別模組，適於判斷是否存在與第一語音相匹配的用戶；在不存在與第一語音相匹配的用戶的情況下，將該聲音資料儲存至聲音儲存模組；聲音儲存模組，適於儲存聲音資料；以及用戶發現模組，適於對聲音儲存模組所儲存的多條聲音資料進行聚類，以便從多條聲音資料中確定新用戶。根據本發明實施例的另一方面，提供了一種用戶識別裝置，包括：通信模組，適於接收包括第一語音的聲音資料；語音識別模組，適於判斷是否存在與第一語音相匹配的用戶；在不存在與第一語音相匹配的用戶的情況下，將該聲音資料儲存至聲音儲存模組；聲音儲存模組，適於儲存聲音資料；以及用戶發現模組，適於對聲音儲存模組所儲存的多條聲音資料進行聚類，以便從多條聲音資料中確定新用戶，並對新用戶進行行為分析。根據本發明實施例的另一方面，提供了一種語音識別系統，包括終端設備和伺服器，其中終端設備適於接收說話人的語音，並將包括語音的聲音資料發送至伺服器；伺服器駐留有根據本發明的語音識別裝置。根據本發明實施例的又一方面，提供了一種計算設備，包括：至少一個處理器；和儲存有程式指令的儲存器，其中，程式指令被配置為適於由至少一個處理器執行，程式指令包括用於執行根據本發明的語音識別方法的指令。根據本發明實施例的語音識別方案，藉由對所儲存的多條聲音資料進行聚類來從中確定新用戶，整個新用戶確定過程用戶是無感知的，省去了用戶的主動註冊操作，提高了用戶的使用體驗。To this end, embodiments of the present invention provide a voice recognition method, apparatus, and computing device, to try to solve or at least alleviate at least one of the above problems. According to an aspect of an embodiment of the present invention, there is provided a voice recognition method, including the steps of: receiving sound data including a first voice; judging whether there is a user matching the first voice; when there is no matching the first voice In the case of a user, store the sound data; cluster the stored multiple sound data to determine a new user from the multiple sound data. Optionally, in the voice recognition method according to an embodiment of the present invention, the user corresponds to a user profile, the user profile includes the user's voiceprint, and the step of determining whether there is a user matching the first voice includes: determining the first Whether the voice matches the user's voiceprint to determine whether there is a user matching the first voice. Optionally, in the voice recognition method according to an embodiment of the present invention, the step of clustering the stored multiple pieces of sound data to determine a new user from the multiple pieces of sound data includes: based on two of the multiple pieces of sound data Between similar scores, divide multiple sound data into multiple sets; determine at least one target set based on the sample density and number of samples of the set, the target set corresponds to the new user; create a user profile for the new user corresponding to the target set, And use at least part of the sound data in the target set to generate the voiceprint of the new user. Optionally, in the speech recognition method according to an embodiment of the present invention, the step of using at least part of the sound data in the target set to generate the voiceprint of the new user includes: determining the target set according to the distance from the centroid of the target set The sound data used to generate the voiceprint of the new user. Optionally, in the voice recognition method according to an embodiment of the present invention, the user profile includes a user flag indicating whether the user is actively registered, and the step of creating a user profile for a new user corresponding to the target set includes: corresponding to the target set The user ID in the user profile created by the new user is set to inactive registration; and the method further includes the step of: in the case where there is a user matching the first voice and the corresponding user mark indicates that the user is inactive registration, Record the number of audio data from the user. Optionally, in the voice recognition method according to an embodiment of the present invention, the method further includes the steps of: after recording the number of voice data from the user, determine whether the number of voice data reaches a specific number within a specific time period; if not, delete The user profile corresponding to this user. Optionally, in the voice recognition method according to an embodiment of the present invention, the user profile also includes the device identification of the terminal device associated with the user. The method includes the steps of: receiving the device identification of the terminal device sending the sound material; based on the device The logo determines whether there is a user associated with the terminal device; if it does not exist, the sound data is stored. Optionally, in the voice recognition method according to the embodiment of the present invention, the method further includes the step of: when there is a user matching the first voice, storing the instruction corresponding to the first voice in association with the user. Optionally, in the voice recognition method according to an embodiment of the present invention, the method further includes the steps of: receiving sound data including a second voice, which is used to actively register a new user; creating a user profile for the actively registered new user, And use the sound data including the second voice to generate the voiceprint of the new user; and set the user ID in the user profile created for the actively registered new user to be actively registered. Optionally, in the voice recognition method according to an embodiment of the present invention, the method further includes the steps of: receiving the device identification of the terminal device that sends the sound data including the second voice; and storing the device identification in association with the actively registered new user to The corresponding user profile. Optionally, in the voice recognition method according to an embodiment of the present invention, the step of determining whether the first voice matches the user's voiceprint includes: extracting the voice characteristics of the first voice based on the sound data including the first voice; based on The voice feature of the first voice obtains a similarity score between the first voice and the user's voiceprint; according to the similarity score, it is determined whether the first voice matches the user's voiceprint. According to another aspect of an embodiment of the present invention, a user recognition method is provided, including the steps of: receiving sound data including a first voice; judging whether there is a user matching the first voice; when there is no match with the first voice In the case of users, store sound data; cluster multiple stored sound data to identify new users from the multiple sound data and conduct behavior analysis on the new users. According to another aspect of an embodiment of the present invention, there is provided a voice recognition device, including: a communication module adapted to receive sound data including a first voice; a voice recognition module adapted to determine whether there is a match with the first voice Users; in the absence of a user that matches the first voice, store the sound data to a sound storage module; a sound storage module, suitable for storing sound data; and a user discovery module, suitable for sound The multiple sound data stored in the storage module are clustered to identify new users from the multiple sound data. According to another aspect of an embodiment of the present invention, a user recognition device is provided, including: a communication module adapted to receive sound data including a first voice; a voice recognition module adapted to determine whether there is a match with the first voice Users; in the absence of a user that matches the first voice, store the sound data to a sound storage module; a sound storage module, suitable for storing sound data; and a user discovery module, suitable for sound The multiple sound data stored in the storage module are clustered to determine the new user from the multiple sound data and conduct behavior analysis on the new user. According to another aspect of an embodiment of the present invention, there is provided a voice recognition system, including a terminal device and a server, wherein the terminal device is adapted to receive the speaker's voice and send voice data including the voice to the server; the server resides There is a voice recognition device according to the present invention. According to yet another aspect of the embodiments of the present invention, a computing device is provided, including: at least one processor; and a storage storing program instructions, wherein the program instructions are configured to be executed by at least one processor, and the program instructions Including instructions for performing the speech recognition method according to the invention. According to the voice recognition scheme of the embodiment of the present invention, a new user is determined by clustering multiple pieces of stored sound data. The entire new user determination process is user-agnostic, eliminating the user's active registration operation and improving The user experience.

下面將參照圖式更詳細地描述本公開的示例性實施例。雖然圖式中顯示了本公開的示例性實施例，然而應當理解，可以以各種形式實現本公開而不應被這裡闡述的實施例所限制。相反，提供這些實施例是為了能夠更透徹地理解本公開，並且能夠將本公開的範圍完整的傳達給本領域的技術人員。圖1示出了根據本發明一個實施例的語音識別系統100的示意圖。如圖1所示，語音識別系統100包括終端設備102和伺服器106。終端設備102是任何說話人語音的接收方。說話人可以使用語音經由終端設備102與伺服器106互動。終端設備102可以是藉由諸如區域網路(LAN)或者如網際網路的廣域網路(WAN)的一個或多個網路105耦合至伺服器104的計算設備。例如，終端設備102可以是桌面型計算設備、膝上型計算設備、平板型計算設備、行動電話計算設備、音箱計算設備、車輛的計算設備(例如，車載通信系統、車載娛樂系統、車載導航系統)、包括計算設備的可穿戴裝置(例如，具有計算設備的手錶、具有計算設備的眼鏡)或者包括計算設備的家居裝置(例如，具有計算設備的音箱、具有計算設備的電視、具有計算設備的洗衣機)。儘管說話人有可能會操作多個計算設備，但為簡潔起見，本公開中的示例將針對說話人操作終端設備102。終端設備102可以操作一個或多個應用程式和/或組件，它們可以涉及向說話人提供通知以及提供各種類型的訊號。這些應用程式和/或組件可以包括但不限於麥克風103、輸出設備104、諸如全球定位系統(“GPS”)組件(圖1未示出)的位置坐標組件等等。在一些實施例中，這些應用程式和/或組件中的一個或多個可以在由說話人操作的多個終端設備上運行。圖1中並未示出的終端設備102的其他組件包括但不限於氣壓計、相機、光線感測器、存在感測器、溫度計、健康感測器(例如，心率監視器、血糖儀、血壓計)、加速計、陀螺儀等等。在一些實施方式中，輸出設備104可以包括揚聲器(多個揚聲器)、螢幕、觸控螢幕、一個或多個通知燈(例如，發光二極體)、印表機等等中的一個或多個。在一些實施方式中，輸出設備104可以被用於基於響應於說話人語音而調用的一個或多個操作(諸如打開程式、播放歌曲、發送電子郵件或者文字消息、拍照等操作)提供輸出。終端設備102包括用於儲存資料和軟體應用的一個或多個儲存器、用於存取資料和執行應用程式的一個或多個處理器以及促進藉由網路的通信的其他組件。在一些實施方式中，終端設備102可以被配置成例如使用麥克風103來感測一個或多個可聽聲音(例如，說話人口述的語音)，並且可以基於所感測到的一個或多個可聽聲音(也被稱作“聲音輸入”)而將聲音資料提供至各種其他計算設備。那些其他計算設備(其示例將會在下文更詳細地描述)可以基於聲音資料而執行各種操作以識別匹配聲音資料。在各種實施方式中，聲音資料可以包括：一個或多個說話人口述語音的原始記錄；記錄的壓縮版本；經由終端設備102的麥克風103所獲得的聲音輸入的一個或多個特徵的指示，諸如音高、音調、聲音和/或音量；和/或經由麥克風103所獲得的聲音輸入的轉錄等等。在一些實施方式中，終端設備102將包括說話人語音的聲音資料發送至伺服器106。伺服器106中駐留有語音識別裝置200。當然，在另一些實施方式中，終端設備102中也可以駐留有語音識別裝置200。也就是說，直接在終端設備102上執行下述處理。圖2示出了根據本發明一個實施例的語音識別裝置200的結構方塊圖。如圖2所示，語音識別裝置200包括通信模組210、語音識別模組220、聲音儲存模組230和用戶發現模組240。通信模組210可以從終端設備102接收到包括第一語音的聲音資料，這裡第一語音通常用於指示終端設備102執行操作。語音識別模組220對該聲音資料進行語音識別，以得到第一語音對應的指令。而後語音識別模組220經由通信模組210向終端設備102返回對該指令的響應結果，以便終端設備102至少根據該響應結果執行相應操作。例如，在一種實施方式中，終端設備102可以實現為具有計算設備的音箱。音箱接收到說話人所說的語音——“播放歌曲青花瓷”，將包括該語音的聲音資料發送至伺服器106。伺服器106向音箱返回相應的響應結果——《青花瓷》的聲音文件。音箱根據響應結果執行相應操作——播放該聲音文件。當然，對聲音資料進行語音識別以得到指令的過程也可以在終端設備102上進行。即，終端設備102對聲音資料進行語音識別，而後將聲音資料和識別得到的指令發送至語音識別裝置200。語音識別模組220還判斷是否存在與第一語音相匹配的用戶。通常地，用戶指的是語音識別系統標識其身份的說話人。根據一種實施方式，用戶對應有記錄與用戶相關的資料的用戶設定檔，這些用戶設定檔可以儲存在與語音識別裝置200相耦接的用戶資料儲存設備中，也可以儲存在語音識別裝置200所包括的用戶資料儲存模組(圖2未示出)中。通常地，可以採用用戶的諸如指紋、聲紋和虹膜之類的生物特徵來唯一標識用戶。在本發明的一些實施方式中，可以採用聲紋來唯一標識用戶，聲紋是指說話人語音中攜帶言語資訊的聲波頻譜，可以唯一標識說話人。語音識別模組220可以採用各種聲紋識別技術來判斷是否存在與第一語音相匹配的用戶。具體地，在各種實施方式中，用戶設定檔可以包括用戶的聲紋。語音識別模組220可以藉由判斷第一語音是否與用戶的聲紋相匹配來判斷是否存在與第一語音相匹配的用戶。下面將詳細介紹判斷第一語音是否與用戶的聲紋相匹配的過程。在一些實施方式中，聲音資料可以在被由語音識別模組220匹配至用戶之前經受不同層次的預處理。在一些實施例中，這種預處理可以促進語音識別模組220進行更加高效的語音識別。在各種實施方式中，預處理可以由終端設備102或者由另一組件來執行，諸如語音識別裝置200的組件。在一些實施方式中，語音識別模組220本身可以預處理聲音資料。作為預處理的非限制性示例，聲音資料可以最初例如由終端設備102的麥克風103捕捉，作為原始資料(例如，以諸如wav文件的“無損”形式或者諸如MP3文件的“有損”形式)。這種原始資料可以例如由終端設備102或者語音識別裝置200的一個或多個組件進行預處理，以促進語音識別。在各種實施方式中，預處理可以包括：採樣；量化；去除非語音的聲音資料和靜默的聲音資料；對包括語音的聲音資料進行分框、加窗，以供後續處理等等。經過預處理之後，語音識別模組220可以根據包括第一語音的聲音資料來提取第一語音的語音特徵，並基於第一語音的語音特徵將第一語音與用戶的聲紋進行匹配。在一些實施方式中，語音特徵可以是濾波器組Fbank (Filter Bank)、梅爾頻率倒頻譜係數MFCC(Mel Frequency Cepstral Coefficents)、感知線性預測係數PLP、深度特徵Deep Feature、以及能量規整譜係數PNCC等特徵中的一種或者多種的組合。在一種實施例中，語音識別模組220還可以對提取得到的語音特徵進行正規化處理。而後，語音識別模組220基於第一語音的語音特徵，將第一語音與用戶的聲紋進行匹配，以得到第一語音與用戶的聲紋之間的相似評分，並根據該相似評分來確定與第一語音相匹配的用戶。具體地，在一些實施方式中，用戶的聲紋以聲紋模型來描述，例如隱藏式馬可夫模型(HMM模型)、高斯混合模型(GMM模型)等等。用戶的聲紋模型以語音特徵為特徵，利用包括用戶語音的聲音資料(後文簡稱為用戶的聲音資料)訓練得到。語音識別模組220可以採用匹配運算函數來計算第一語音與用戶的聲紋之間的相似度。例如可以計算第一語音的語音特徵與用戶的聲紋模型相匹配的後驗概率來作為相似評分，也可以計算第一語音的語音特徵與用戶的聲紋模型之間的似然度來作為相似評分。但由於訓練好用戶的聲紋模型需要大量該用戶的聲音資料，因此在一些實施方式中，用戶的聲紋模型可以基於與用戶無關的通用背景模型，利用少量用戶的聲音資料訓練得到(同樣以語音特徵為特徵)。例如，可以先使用與用戶無關的、多個說話人的聲音資料，藉由期望最大化演算法EM訓練得到通用背景模型(Universal Background Model，UBM)，以表示用戶無關的特徵分佈。再基於該UBM模型，利用少量的用戶的聲音資料藉由自適應演算法(如最大後驗概率MAP，最大似然線性回歸MLLR等)訓練得到GMM模型(這樣得到的GMM模型稱之為GMM-UBM模型)，以表示用戶的特徵分佈。該GMM-UBM模型即為用戶的聲紋模型。此時，語音識別模組220可以基於第一語音的語音特徵，分別將第一語音與用戶的聲紋模型和通用背景模型進行匹配，以得到第一語音與用戶的聲紋之間的相似評分。例如，分別計算第一語音的語音特徵與上述UBM模型和GMM-UBM模型之間的似然度，然後將這兩個似然度相除後取對數，將得到的值作為第一語音與用戶的聲紋之間的相似評分。在另一些實施方式中，用戶的聲紋以聲紋向量來描述，例如i-vector、d-vector、x-vector和j-vector等等。語音識別模組220可以至少基於第一語音的語音特徵，提取第一語音的聲紋向量。根據一種實施例，可以先利用第一語音的語音特徵訓練第一語音說話人的聲紋模型。如前文類似地，可以基於預先訓練好的與用戶無關的上述通用背景模型，利用第一語音的語音特徵訓練得到第一語音說話人的聲紋模型。在得到第一語音說話人的聲紋模型之後，可以根據該聲紋模型提取第一語音的均值超向量。例如，可以將第一語音說話人的GMM-UBM模型的各個GMM分量的均值進行拼接，得到第一語音說話人的GMM-UBM模型的均值超向量，即第一語音的均值超向量。之後，可以採用聯合因子分析法(JFA)或者簡化的聯合因子分析法，從第一語音的均值超向量中提取得到低維的聲紋向量。以i-vector為例，在訓練得到與用戶無關的上述通用背景模型(UBM模型)之後，可以提取該通用背景模型的均值超向量，並估計全域差異空間(Total Variability Space，T)矩陣。而後基於第一語音的均值超向量、T矩陣、通用背景模型的均值超向量來計算第一語音的i-vector。具體地，i-vector可以根據以下公式計算得到：

其中，

是從說話人s的語音h中得到的均值超向量，

是通用背景模型的均值超向量，

是全域差異空間矩陣，

是全域差異因子，也就是i-vector。根據另一種實施例，還可以利用訓練好的深度神經網路(Deep Neural Network，DNN)來得到第一語音的聲紋向量。以d-vector為例，DNN可以包括輸入層、隱層和輸出層。可以先將第一語音的FBank特徵輸入到DNN輸入層，DNN最後一個隱層的輸出即為d-vector。在得到第一語音的聲紋向量之後，語音識別模組220可以基於第一語音的聲紋向量和用戶的聲紋向量，來計算第一語音與用戶的聲紋之間的相似評分。其中，可以採用支持向量機(SVM)、LDA(Linear Discriminant Analysis，線性判別分析)、PLDA(Probabilistic Linear Discriminant Analysis，概率線性判別分析)、似然度和餘弦距離(Cosine Distance)等演算法來計算第一語音與用戶的聲紋之間的相似評分。以PLDA演算法為例，假設語音由I個說話人的語音組成，其中每個說話人有J段不一樣的語音，並且定義第i個說話人的第j段語音為

。那麼，定義

的產生模型為：

其中，

是聲紋向量的均值，

、

是空間特徵矩陣，各自代表說話人類間特徵空間和類內特徵空間。

的每一列，相當於類間特徵空間的特徵向量，

的每一列，相當於類內特徵空間的特徵向量。向量

和

可以看作是該語音分別在各自空間的特徵表示，

則是噪聲協方差。如果兩條語音的

特徵相同的似然度越大，即相似評分越高，那麼它們來自同一個說話人的可能性就越大。 PLDA的模型參數包括4個，即

、

和

，是採用EM演算法疊代訓練而成。通常地，可以採用簡化版的PLDA模型，忽略類內特徵空間矩陣

的訓練，只訓練類間特徵空間矩陣

，即：

語音識別模組220可以基於第一語音的聲紋向量，參照上述公式得到第一語音的

特徵。同樣地，基於用戶的聲紋向量，參照上述公式得到用戶語音的

特徵。而後，可以計算兩個

特徵的對數似然比或餘弦距離來作為第一語音與用戶的聲紋之間的相似評分。應當注意的是，聲紋並不限於上述聲紋向量(i-vector、d-vector和x-vector等等)和上述聲紋模型(HMM模型和GMM模型等等)，相應的相似評分演算法也可依據所選定的聲紋來任意選取，本發明對此不做限制。在各種實施方式中，如果得到的相似評分超過相似閾值，則語音識別模組220確定第一語音與該用戶的聲紋相匹配，也就是確定第一語音與該聲紋對應的用戶相匹配。否則語音識別模組220確定第一語音不與該用戶的聲紋相匹配。語音識別模組220可以將第一語音與每個用戶的聲紋相匹配，以判斷是否存在與第一語音相匹配的用戶。在存在與第一語音相匹配的用戶的情況下，語音識別模組220在對聲音資料進行語音識別以得到指令之外，可以將第一語音對應的指令與匹配到的用戶相關聯地儲存，例如儲存至該用戶的用戶設定檔。這樣，語音識別裝置200後續可以根據來自該用戶的所有指令分析該用戶的行為偏好，從而為該用戶提供個性化和定制化的服務。例如，根據用戶所有與播放歌曲相關的指令來分析用戶的歌曲偏好，從而可以為用戶推薦符合其偏好的歌曲。在不存在與第一語音相匹配的用戶的情況下，語音識別模組220可以將該(條)包括第一語音的聲音資料儲存至聲音儲存模組230。聲音儲存模組230適於儲存聲音資料。用戶發現模組240可以對聲音儲存模組230所儲存的多條聲音資料進行聚類，以便從這多條聲音資料中確定新用戶。這樣，對於後續接收到的包括該新用戶語音的聲音資料，語音識別裝置200就可以匹配到該新用戶，並將對應指令與該新用戶相關聯地儲存，以便後續可以根據來自該新用戶的所有指令分析該新用戶的行為偏好，從而為該新用戶提供個性化服務。在一些實施方式中，用戶發現模組240可以每隔預定週期就提取所儲存的多條聲音資料(例如固定數量條聲音資料)來進行聚類。具體地，用戶發現模組240先基於多條聲音資料中兩兩之間的相似評分，將這多條聲音資料劃分為多個集合。可以認為每個集合所包含的聲音資料彼此相似。在一種實施例中，可以採用聚類演算法來實現集合的劃分。其中，相似評分的計算已在上述對第一語音與用戶的聲紋之間的相似評分的計算過程的描述中詳細介紹，此處不再贅述。而後，用戶發現模組240基於集合的樣本特徵來確定至少一個目標集合，每個目標集合即對應於一個新用戶。其中，樣本特徵可以包括樣本密度、樣本數量等等，樣本則指的是聲音資料。在一種實施例中，對於每個集合，可以計算該集合的樣本密度和樣本數量。而後，選擇樣本密度和樣本數量滿足預定條件的集合作為目標集合。預定條件例如可以是：樣本密度超過預定密度；樣本數量超過預定數量等等。預定條件可以根據所需要確定的目標集合個數來配置，本發明對此不做限制。在確定目標集合(即發現新用戶)之後，用戶發現模組240為該目標集合對應的新用戶創建用戶設定檔，並使用該目標集合中的至少部分聲音資料來產生該新用戶的聲紋。聲紋可以是聲紋模型或者聲紋向量。例如，可以以這些聲音資料的語音特徵為特徵來訓練一個GMM模型或者GMM-UBM模型，作為新用戶的聲紋。也可以基於這些聲音資料的語音特徵來提取聲紋向量，作為新用戶的聲紋。具體的聲紋產生過程可以參考前文關於聲紋的描述，此處不再贅述。其中，可以隨機選擇目標集合中的至少部分聲音資料來產生聲紋。也可以根據到目標集合的質心的距離來確定目標集合中用於產生新用戶的聲紋的聲音資料。例如，先確定目標集合的質心，再計算目標集合中各樣本到目標集合的質心的距離，選擇距離較小的那些樣本作為用於產生新用戶的聲紋的聲音資料。質心的計算為本領域常規技術，此處不再贅述。如果沒有確定目標集合(即沒有發現新用戶)，例如所有集合中沒有滿足預定條件的目標集合，那麼，用戶發現模組240可以刪除這些聲音資料，也就是之前從聲音儲存模組230所提取的多條聲音資料。可以理解地，創建用戶設定檔可以認為是用戶的註冊過程。通常地，用戶可以主動提供包括用戶語音的聲音資料(例如，經由終端設備向伺服器發送主動註冊請求，並根據相應註冊提示針對特定文字主動錄入語音)，以便根據這些主動提供的聲音資料來產生用戶的聲紋。這一用戶主動操作的過程可以認為是主動註冊過程。而藉由聚類來發現新用戶、為其創建用戶設定檔並產生聲紋的過程，用戶並無感知，因此，該過程可以認為是非主動註冊過程。根據本發明的實施方式，用戶設定檔還可以包括指示用戶是否為主動註冊的用戶標記。如圖2所示，語音識別裝置200還可以包括用戶註冊模組250。通信模組210可以接收包括第二語音的聲音資料，第二語音通常用於主動註冊新用戶，例如，可以是按照終端設備102的註冊提示而錄入的語音。用戶註冊模組250則可以為主動註冊的新用戶創建用戶設定檔，並使用該包括第二語音的聲音資料來產生新用戶的聲紋，以及將該主動註冊的新用戶的用戶設定檔中的用戶標識置為主動註冊。相應地，對於藉由聚類發現的新用戶，用戶發現模組240在為該目標集合對應的新用戶創建用戶設定檔時，可以將所創建的用戶設定檔中的用戶標識置為非主動註冊。這樣，語音識別模組220就可以在確定存在與第一語音相匹配的用戶之後，基於對應用戶設定檔中的用戶標識來判斷該用戶是否為主動註冊。如果用戶標記指示用戶為非主動註冊，語音識別模組220可以記錄來自該用戶的聲音資料條數。具體地，用戶設定檔可以包括來自用戶的聲音資料條數。每接收一條來自該用戶的聲音資料，語音識別模組220將來自用戶的聲音資料條數加一。相應地，用戶發現模組230在為非主動註冊的新用戶創建用戶設定檔時，可以將所創建的用戶設定檔中來自用戶的聲音資料條數置為初始值。初始值通常可以為0。語音識別模組220還可以判斷來自該用戶的聲音資料條數是否在特定時段段內達到特定數量(例如在自註冊以來的1個月內達到特定數量)。如果沒達到，語音識別模組220可以刪除該用戶對應的用戶設定檔，也就是說，注銷該用戶。如果達到，則可以不做任何操作。根據本發明的另一個實施方式，用戶設定檔還可以包括與用戶相關聯的終端設備的設備標識。例如，在主動註冊過程中，通信模組210可以接收發送包括第二語音的聲音資料的終端設備的設備標識，用戶註冊模組250可以將該設備標識與主動註冊的新用戶相關聯地儲存至對應的用戶設定檔。這樣，語音識別模組220可以在接收聲音資料時也接收發送該聲音資料的終端設備的設備標識，並在判斷是否存在與第一語音相匹配的用戶之前，先基於該設備標識判斷是否存在與對應終端設備相關聯的用戶，也就是查找是否存在包括該設備標識的用戶設定檔。如果不存在與該終端設備相關聯的用戶，則語音識別模組220可以將該聲音資料儲存至聲音儲存模組230。如果存在與該終端設備相關聯的用戶，則語音識別模組220判斷是否存在與第一語音相匹配的用戶。此外，本發明的實施例還提供了一種用戶識別裝置。該用戶識別裝置包括通信模組、語音識別模組、聲音儲存模組和用戶發現模組。通信模組接收包括第一語音的聲音資料，語音識別模組可以判斷是否存在與第一語音相匹配的用戶，在不存在與第一語音相匹配的用戶的情況下，將聲音資料儲存至聲音儲存模組。聲音儲存模組儲存聲音資料。用戶發現模組則可以對聲音儲存模組所儲存的多條聲音資料進行聚類，以便從多條聲音資料中確定新用戶，並對新用戶進行行為分析。例如，可以根據該新用戶的語音對應的指令來分析該用戶的行為偏好，從而為該新用戶提供個性化服務。其中，用戶識別裝置中各模組的處理例如可以與上文中結合圖1和圖2所描述的語音識別裝置200中各模組的處理相同，並能夠達到相類似的技術效果，在此不再贅述。在下文中將結合圖式描述在上文中提及的各個模組和裝置等的具體結構以及對應的處理方法。根據本發明的實施方式，上述語音識別裝置200(和上述用戶識別裝置)中的各種部件，如各種模組等均可以藉由如下所述的計算設備300來實現。圖3示出了根據本發明一個實施例的計算設備300的示意圖。如圖3所示，在基本的配置302中，計算設備300典型地包括系統儲存器306和一個或者多個處理器304。儲存器匯流排308可以用於在處理器304和系統儲存器306之間的通信。取決於期望的配置，處理器304可以是任何類型的處理，包括但不限於：微處理器(µP)、微控制器(µC)、數位資訊處理器(DSP)或者它們的任何組合。處理器304可以包括諸如一級高速快取310和二級高速快取312之類的一個或者多個級別的高速快取、處理器核心314和暫存器316。示例的處理器核心314可以包括運算邏輯單元(ALU)、浮點數單元(FPU)、數位訊號處理核心(DSP核心)或者它們的任何組合。示例的儲存器控制器318可以與處理器304一起使用，或者在一些實現中，儲存器控制器318可以是處理器304的一個內部部分。取決於期望的配置，系統儲存器306可以是任意類型的儲存器，包括但不限於：易失性儲存器(諸如RAM)、非易失性儲存器(諸如ROM、快閃記憶體等)或者它們的任何組合。系統儲存器306可以包括操作系統320、一個或者多個應用程式322以及程式資料324。在一些實施方式中，應用程式322可以佈置為在操作系統上由一個或多個處理器304利用程式資料324執行指令。計算設備300還可以包括有助於從各種介面設備(例如，輸出設備342、外設介面344和通信設備346)到基本配置302經由匯流排/介面控制器330的通信的介面匯流排340。示例的輸出設備342包括圖形處理單元348和聲音處理單元350。它們可以被配置為有助於經由一個或者多個A/V端口352與諸如顯示器或者揚聲器之類的各種外部設備進行通信。示例外設介面344可以包括串聯介面控制器354和並聯介面控制器356，它們可以被配置為有助於經由一個或者多個I/O端口358和諸如輸入設備(例如，鍵盤、滑鼠、筆、語音輸入設備、觸控輸入設備)或者其他外設(例如印表機、掃描機等)之類的外部設備進行通信。示例的通信設備346可以包括網路控制器360，其可以被佈置為便於經由一個或者多個通信端口364與一個或者多個其他計算設備362藉由網路通信鏈路的通信。網路通信鏈路可以是通信媒體的一個示例。通信媒體通常可以體現為在諸如載波或者其他傳輸機制之類的調變資料訊號中的電腦可讀指令、資料結構、程式模組，並且可以包括任何資訊遞送媒體。“調變資料訊號”可以是這樣的訊號，它的資料集中的一個或者多個或者它的改變可以在訊號中編碼資訊的方式進行。作為非限制性的示例，通信媒體可以包括諸如有線網路或者專線網路之類的有線媒體，以及諸如聲音、射頻(RF)、微波、紅外(IR)或者其它無線媒體在內的各種無線媒體。這裡使用的術語電腦可讀媒體可以包括儲存媒體和通信媒體二者。計算設備300可以實現為伺服器，例如資料庫伺服器、應用程式伺服器和WEB伺服器等，也可以實現為包括桌面電腦和筆記型電腦配置的個人電腦。當然，計算設備300也可以實現為小尺寸便攜(或者行動)電子設備的一部分。在根據本發明的實施例中，計算設備300被實現為語音識別裝置200，並被配置為執行根據本發明實施例的語音識別方法400。其中，計算設備300的應用程式322中包含執行根據本發明實施例的語音識別方法400的多條程式指令，而程式資料324還可以儲存語音識別系統100的配置資訊等。圖4示出了根據本發明一個實施例的語音識別方法400。如圖4所示，語音識別方法400始於步驟S410。在步驟S410中，接收包括第一語音的聲音資料。如前所述，第一語音通常是指示終端設備102執行操作的語音。因此，根據本發明的實施方式，可以對該聲音資料進行語音識別而得到第一語音對應的指令，而後向終端設備102返回對該指令的響應結果，以便終端設備102至少根據該響應結果來執行相應操作。隨後在步驟S420中，可以判斷是否存在與第一語音相匹配的用戶。根據一種實施方式，用戶均會對應有記錄與用戶相關的資料的用戶設定檔，這些用戶設定檔可以儲存在與語音識別裝置200相耦接的用戶資料儲存設備中，也可以儲存在語音識別裝置200所包括的用戶資料儲存模組中。通常地，可以採用用戶的諸如指紋、聲紋和虹膜之類的生物特徵來唯一標識用戶。在本發明的一種實施方式中，可以採用聲紋來唯一標識用戶，用戶設定檔包括用戶的聲紋，可以判斷第一語音是否與用戶的聲紋相匹配，以判斷是否存在與第一語音相匹配的用戶。具體地，可以先根據包括第一語音的聲音資料，提取第一語音的語音特徵。在一些實施方式中，語音特徵可以是濾波器組FBank(Filter Bank)、梅爾頻率倒頻譜係數MFCC(Mel Frequency Cepstral Coefficents)、感知線性預測係數PLP、深度特徵Deep Feature、以及能量規整譜係數PNCC等特徵中的一種或者多種的組合。而後基於第一語音的語音特徵得到第一語音與用戶的聲紋之間的相似評分，根據該相似評分來確定第一語音是否與用戶的聲紋相匹配。如果得到的相似評分超過相似閾值，則確定第一語音與該用戶的聲紋相匹配，否則確定第一語音不與該用戶的聲紋相匹配。如果存在與第一語音相匹配的用戶，則可以將第一語音對應的指令與匹配到的用戶相關聯地儲存。如果不存在與第一語音相匹配的用戶，那麼在步驟S430中，儲存該(條)聲音資料。而後在步驟S440中，對所儲存的多條聲音資料進行聚類，以便從多條聲音資料中確定新用戶。具體地，可以先基於多條聲音資料中兩兩之間的相似評分，將多條聲音資料劃分為多個集合。再基於集合的樣本密度和樣本數量確定至少一個目標集合，目標集合對應於新用戶。最後為目標集合對應的新用戶創建用戶設定檔，並使用目標集合中的至少部分聲音資料來產生該新用戶的聲紋。在一種實施例中，可以根據到目標集合的質心的距離來確定目標集合中用於產生新用戶的聲紋的聲音資料。例如，先確定目標集合的質心，再計算目標集合中各樣本到目標集合的質心的距離，選擇距離較小的那些樣本作為用於產生新用戶的聲紋的聲音資料。如果沒有確定目標集合，則可以刪除這些聲音資料，也就是之前的多條聲音資料。根據本發明的一種實施方式，用戶設定檔還可以包括指示用戶是否為主動註冊的用戶標記，在為目標集合對應的新用戶創建用戶設定檔時，可以將該用戶設定檔中的用戶標識置為非主動註冊。在存在與第一語音相匹配的用戶且對應的用戶標記指示用戶為非主動註冊的情況下，還可以記錄來自用戶的聲音資料條數，判斷聲音資料條數是否在特定時間段內達到特定數量。若否，可以刪除該用戶對應的用戶設定檔。根據本發明的一種實施方式，語音識別方法400還可以包括步驟：接收包括第二語音的聲音資料，第二語音通常用於主動註冊新用戶。為主動註冊的新用戶創建用戶設定檔，並使用該包括第二語音的聲音資料來產生新用戶的聲紋，以及將該主動註冊的新用戶的用戶設定檔中的用戶標識置為主動註冊。根據本發明的一種實施方式，用戶設定檔還可以包括與用戶相關聯的終端設備的設備標識，語音識別方法400還可以包括步驟：接收發送上述聲音資料的終端設備的設備標識，基於該設備標識判斷是否存在與該終端設備相關聯的用戶。如果不存在，則儲存上述聲音資料。語音識別方法400的具體步驟以及實施例，在結合圖1～圖3對語音識別系統100的描述中已經詳細公開，此處不再贅述。此外，本發明實施例還提供了一種用戶識別方法，包括步驟：接收包括第一語音的聲音資料；判斷是否存在與第一語音相匹配的用戶；在不存在與第一語音相匹配的用戶的情況下，儲存該聲音資料；對所儲存的多條聲音資料進行聚類，以便從多條聲音資料中確定新用戶，並對新用戶進行行為分析。其中，用戶識別方法中各步驟的處理例如可以與上文中結合圖4所描述的語音識別方法400中各步驟的處理相同，並能夠達到相類似的技術效果，在此不再贅述。綜上所述，根據本發明實施例的語音識別方案，藉由對所儲存的多條聲音資料進行聚類來從中確定新用戶以及產生該新用戶的聲紋，以便後續可以根據聲紋識別該用戶，並根據來自該用戶的指令分析該用戶的行為偏好，從而可以為該用戶提供更精准的個性化服務。並且，整個新用戶確定和聲紋產生過程用戶是無感知的，省去了用戶的主動註冊操作，提高了用戶的使用體驗。應當理解，為了精簡本公開並幫助理解各個發明方面中的一個或多個，在上面對本發明的示例性實施例的描述中，本發明的各個特徵有時被一起分組到單個實施例、圖、或者對其的描述中。然而，並不應將該公開的方法解釋成反映如下意圖：即所要求保護的本發明要求比在每個請求項中所明確記載的特徵更多特徵。更確切地說，如下面的申請專利範圍所反映的那樣，發明方面在於少於前面公開的單個實施例的所有特徵。因此，遵循具體實施方式的申請專利範圍由此明確地併入該具體實施方式，其中每個請求項本身都作為本發明的單獨實施例。本領域那些技術人員應當理解在本文所公開的示例中的設備的模組或單元或組件可以佈置在如該實施例中所描述的設備中，或者可替換地可以定位在與該示例中的設備不同的一個或多個設備中。前述示例中的模組可以組合為一個模組或者此外可以分成多個子模組。本領域那些技術人員可以理解，可以對實施例中的設備中的模組進行自適應性地改變並且把它們設置在與該實施例不同的一個或多個設備中。可以把實施例中的模組或單元或組件組合成一個模組或單元或組件，以及此外可以把它們分成多個子模組或子單元或子組件。除了這樣的特徵和/或過程或者單元中的至少一些是相互排斥之外，可以採用任何組合對本說明書(包括伴隨的申請專利範圍、摘要和圖式)中公開的所有特徵以及如此公開的任何方法或者設備的所有過程或單元進行組合。除非另外明確陳述，本說明書(包括伴隨的申請專利範圍、摘要和圖式)中公開的每個特徵可以由提供相同、等同或相似目的的替代特徵來代替。此外，本領域的技術人員能夠理解，儘管在此所述的一些實施例包括其它實施例中所包括的某些特徵而不是其它特徵，但是不同實施例的特徵的組合意味著處於本發明的範圍之內並且形成不同的實施例。例如，在下面的申請專利範圍中，所要求保護的實施例的任意之一都可以以任意的組合方式來使用。此外，所述實施例中的一些在此被描述成可以由電腦系統的處理器或者由執行所述功能的其它裝置實施的方法或方法元素的組合。因此，具有用於實施所述方法或方法元素的必要指令的處理器形成用於實施該方法或方法元素的裝置。此外，裝置實施例的在此所述的元素是如下裝置的例子：該裝置用於實施由為了實施該發明的目的的元素所執行的功能。如在此所使用的那樣，除非另行規定，使用序數詞“第一”、“第二”、“第三”等等來描述普通對象僅僅表示涉及類似對象的不同實例，並且並不意圖暗示這樣被描述的對象必須具有時間上、空間上、排序方面或者以任意其它方式的給定順序。儘管根據有限數量的實施例描述了本發明，但是受益於上面的描述，本技術領域內的技術人員明白，在由此描述的本發明的範圍內，可以設想其它實施例。此外，應當注意，本說明書中使用的語言主要是為了可讀性和教導的目的而選擇的，而不是為了解釋或者限定本發明的主題而選擇的。因此，在不偏離所附申請專利範圍的範圍和精神的情況下，對於本技術領域的具有通常知識者來說許多修改和變更都是顯而易見的。對於本發明的範圍，對本發明所做的公開是說明性的，而非限制性的，本發明的範圍由所附申請專利範圍限定。Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure may be implemented in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art. FIG. 1 shows a schematic diagram of a speech recognition system 100 according to an embodiment of the present invention. As shown in FIG. 1, the voice recognition system 100 includes a terminal device 102 and a server 106. The terminal device 102 is the receiver of any speaker's voice. The speaker can interact with the server 106 via the terminal device 102 using voice. The terminal device 102 may be a computing device coupled to the server 104 via one or more networks 105 such as a local area network (LAN) or a wide area network (WAN) such as the Internet. For example, the terminal device 102 may be a desktop computing device, a laptop computing device, a tablet computing device, a mobile phone computing device, a speaker computing device, a vehicle computing device (eg, an in-vehicle communication system, an in-vehicle entertainment system, an in-vehicle navigation system ), a wearable device including a computing device (eg, a watch with a computing device, glasses with a computing device) or a home device including a computing device (eg, a speaker with a computing device, a TV with a computing device, a TV with a computing device) washing machine). Although it is possible for a speaker to operate multiple computing devices, for the sake of brevity, the examples in this disclosure will be directed to the speaker operating terminal device 102. The terminal device 102 may operate one or more applications and/or components, which may involve providing notification to the speaker and providing various types of signals. These applications and/or components may include, but are not limited to, microphone 103, output device 104, position coordinate components such as global positioning system ("GPS") components (not shown in FIG. 1), and so on. In some embodiments, one or more of these applications and/or components may run on multiple terminal devices operated by the speaker. Other components of the terminal device 102 not shown in FIG. 1 include but are not limited to barometers, cameras, light sensors, presence sensors, thermometers, health sensors (eg, heart rate monitors, blood glucose meters, blood pressure Meter), accelerometer, gyroscope, etc. In some embodiments, the output device 104 may include one or more of a speaker (speakers), a screen, a touch screen, one or more notification lights (eg, light emitting diodes), a printer, etc. . In some embodiments, the output device 104 may be used to provide output based on one or more operations called in response to the speaker's voice (such as operations such as opening a program, playing a song, sending an email or text message, taking a picture, etc.). The terminal device 102 includes one or more storages for storing data and software applications, one or more processors for accessing data and executing application programs, and other components that facilitate communication over a network. In some embodiments, the terminal device 102 may be configured to sense one or more audible sounds (eg, spoken speech) using the microphone 103, for example, and may be based on the sensed one or more audible sounds Voice (also known as "voice input") provides sound data to various other computing devices. Those other computing devices (examples of which will be described in more detail below) can perform various operations based on the sound material to identify matching sound materials. In various embodiments, the sound material may include: one or more original records of spoken speech; a compressed version of the record; an indication of one or more characteristics of the sound input obtained via the microphone 103 of the terminal device 102, such as Pitch, pitch, sound and/or volume; and/or transcription of sound input obtained via the microphone 103, etc. In some embodiments, the terminal device 102 sends the audio data including the speaker's voice to the server 106. The voice recognition device 200 resides in the server 106. Of course, in other embodiments, the voice recognition apparatus 200 may also reside in the terminal device 102. That is, the following processing is directly performed on the terminal device 102. FIG. 2 shows a structural block diagram of a voice recognition device 200 according to an embodiment of the present invention. As shown in FIG. 2, the voice recognition device 200 includes a communication module 210, a voice recognition module 220, a sound storage module 230 and a user discovery module 240. The communication module 210 may receive audio data including the first voice from the terminal device 102, where the first voice is generally used to instruct the terminal device 102 to perform an operation. The voice recognition module 220 performs voice recognition on the voice data to obtain an instruction corresponding to the first voice. Then, the voice recognition module 220 returns a response result to the instruction to the terminal device 102 via the communication module 210, so that the terminal device 102 performs a corresponding operation at least according to the response result. For example, in one embodiment, the terminal device 102 may be implemented as a sound box with a computing device. The speaker receives the voice spoken by the speaker-"Play Song Blue and White Porcelain", and sends the audio data including the voice to the server 106. The server 106 returns the corresponding response result-the sound file of "Blue and White Porcelain" to the speaker. The speaker performs the corresponding operation according to the response result-playing the sound file. Of course, the process of performing voice recognition on the sound material to obtain instructions can also be performed on the terminal device 102. That is, the terminal device 102 performs voice recognition on the voice material, and then sends the voice material and the recognized instruction to the voice recognition device 200. The voice recognition module 220 also determines whether there is a user matching the first voice. Generally, a user refers to a speaker whose identity is identified by a speech recognition system. According to an embodiment, the user corresponds to a user profile that records data related to the user. These user profiles can be stored in a user data storage device coupled to the voice recognition device 200, or stored in the voice recognition device 200. Included in the user data storage module (not shown in Figure 2). Generally, the user's biometrics such as fingerprints, voice prints, and irises can be used to uniquely identify the user. In some embodiments of the present invention, a voiceprint may be used to uniquely identify a user, and a voiceprint refers to a sound wave spectrum carrying speech information in a speaker's voice, and may uniquely identify the speaker. The voice recognition module 220 may use various voiceprint recognition technologies to determine whether there is a user matching the first voice. Specifically, in various embodiments, the user profile may include the user's voice print. The voice recognition module 220 can determine whether there is a user matching the first voice by determining whether the first voice matches the voiceprint of the user. The process of determining whether the first voice matches the user's voiceprint will be described in detail below. In some embodiments, the sound data may be subjected to different levels of pre-processing before being matched by the voice recognition module 220 to the user. In some embodiments, this pre-processing may facilitate the speech recognition module 220 to perform more efficient speech recognition. In various embodiments, the preprocessing may be performed by the terminal device 102 or by another component, such as a component of the voice recognition apparatus 200. In some embodiments, the voice recognition module 220 itself can preprocess the sound data. As a non-limiting example of preprocessing, the sound material may be initially captured, for example, by the microphone 103 of the terminal device 102 as the original material (for example, in a "lossless" form such as a wav file or a "lossy" form such as an MP3 file). Such raw materials may be pre-processed by one or more components of the terminal device 102 or the voice recognition apparatus 200 to facilitate voice recognition, for example. In various embodiments, the preprocessing may include: sampling; quantization; removing non-speech sound data and silent sound data; framing and windowing the sound data including speech for subsequent processing, etc. After preprocessing, the voice recognition module 220 may extract the voice features of the first voice according to the sound data including the first voice, and match the first voice with the user's voiceprint based on the voice features of the first voice. In some embodiments, the speech feature may be a filter bank Fbank (Filter Bank), a Mel Frequency Cepstral Coefficient (MFCC), a perceptual linear prediction coefficient PLP, a deep feature Deep Feature, and an energy regularization spectral coefficient PNCC One or more combinations of features. In an embodiment, the voice recognition module 220 can also normalize the extracted voice features. Then, based on the voice characteristics of the first voice, the voice recognition module 220 matches the first voice with the user's voiceprint to obtain a similarity score between the first voice and the user's voiceprint, and determines the similarity score according to the similarity score The user matching the first voice. Specifically, in some embodiments, the user's voiceprint is described by a voiceprint model, such as a hidden Markov model (HMM model), a Gaussian mixture model (GMM model), and so on. The user's voiceprint model is characterized by voice characteristics, and is obtained by training using voice data including the user's voice (hereinafter referred to as the user's voice data). The voice recognition module 220 may use a matching operation function to calculate the similarity between the first voice and the voiceprint of the user. For example, the posterior probability that the voice feature of the first voice matches the user's voiceprint model can be calculated as the similarity score, or the likelihood between the voice feature of the first voice and the user's voiceprint model can be calculated as the similarity score. However, since a well-trained user's voiceprint model requires a large amount of the user's voice data, in some implementations, the user's voiceprint model can be based on a general background model that is irrelevant to the user, and can be obtained by training with a small number of user's voice data (same as Voice characteristics are features). For example, the voice data of multiple speakers unrelated to the user can be used first, and a universal background model (UBM) can be obtained through the EM training of the expectation maximization algorithm to represent the user-independent feature distribution. Based on the UBM model, a small amount of user's voice data is used to train a GMM model through adaptive algorithms (such as maximum posterior probability MAP, maximum likelihood linear regression MLLR, etc.) (the GMM model thus obtained is called GMM- UBM model) to represent the user's feature distribution. The GMM-UBM model is the user's voiceprint model. At this time, the voice recognition module 220 may match the first voice with the user's voiceprint model and the general background model based on the voice characteristics of the first voice, respectively, to obtain a similarity score between the first voice and the user's voiceprint . For example, calculate the likelihood between the speech feature of the first speech and the UBM model and the GMM-UBM model, and then divide the two likelihoods to take the logarithm, and use the obtained value as the first speech and the user Similar score between the voiceprints. In other embodiments, the user's voiceprint is described by a voiceprint vector, such as i-vector, d-vector, x-vector, j-vector, and so on. The voice recognition module 220 may extract the voiceprint vector of the first voice based at least on the voice characteristics of the first voice. According to an embodiment, the voiceprint model of the first voice speaker may be first trained using the voice characteristics of the first voice. Similar to the foregoing, the voiceprint model of the first voice speaker can be obtained by training the voice characteristics of the first voice based on the pre-trained general background model that is irrelevant to the user. After the voiceprint model of the first voice speaker is obtained, the mean supervector of the first voice can be extracted according to the voiceprint model. For example, the average value of each GMM component of the GMM-UBM model of the first voice speaker may be spliced to obtain the average supervector of the GMM-UBM model of the first voice speaker, that is, the average supervector of the first voice. After that, a joint factor analysis method (JFA) or a simplified joint factor analysis method can be used to extract a low-dimensional voiceprint vector from the average supervector of the first speech. Taking i-vector as an example, after training to obtain the above-mentioned universal background model (UBM model) that is irrelevant to the user, the mean supervector of the universal background model can be extracted, and the total Variability Space (T) matrix can be estimated. Then, the i-vector of the first speech is calculated based on the mean supervector of the first speech, the T matrix, and the mean supervector of the general background model. Specifically, i-vector can be calculated according to the following formula:

among them,

Is the mean supervector obtained from the speech h of the speaker s,

Is the mean supervector of the general background model,

Is the global difference spatial matrix,

Is the global difference factor, that is, i-vector. According to another embodiment, the trained deep neural network (DNN) can also be used to obtain the voiceprint vector of the first speech. Taking d-vector as an example, the DNN may include an input layer, a hidden layer, and an output layer. The FBank feature of the first speech can be input to the DNN input layer first, and the output of the last hidden layer of the DNN is the d-vector. After obtaining the voiceprint vector of the first voice, the voice recognition module 220 may calculate a similarity score between the first voice and the user's voiceprint based on the voiceprint vector of the first voice and the user's voiceprint vector. Among them, algorithms such as support vector machine (SVM), LDA (Linear Discriminant Analysis), PLDA (Probabilistic Linear Discriminant Analysis), likelihood, and cosine distance can be used to calculate Similarity score between the first voice and the user's voiceprint. Taking the PLDA algorithm as an example, suppose that the speech is composed of the speech of one speaker, where each speaker has a different speech of J, and the jth speech of the ith speaker is defined as

. So, define

The production model is:

among them,

Is the mean of the voiceprint vector,

,

Is a spatial feature matrix, each representing a feature space between speakers and a feature space within a class.

Each column of is equivalent to the feature vector of the feature space between classes,

Each column of is equivalent to the feature vector of the feature space in the class. vector

with

It can be seen as the feature representation of the speech in its own space,

Then it is the noise covariance. If the two voices

The greater the likelihood that the features are the same, that is, the higher the similarity score, the greater the likelihood that they will come from the same speaker. The model parameters of PLDA include 4, namely

,

with

It is made by iterative training using EM algorithm. Generally, a simplified version of the PLDA model can be used, ignoring the feature space matrix within the class

Training, only the inter-class feature space matrix is trained

,which is:

The voice recognition module 220 may obtain the voice of the first voice based on the voiceprint vector of the first voice, referring to the above formula

feature. Similarly, based on the user's voiceprint vector, the user's voice

feature. Then, you can calculate two

The feature's log-likelihood ratio or cosine distance is used as the similarity score between the first voice and the user's voiceprint. It should be noted that the voiceprint is not limited to the above voiceprint vectors (i-vector, d-vector, x-vector, etc.) and the above voiceprint models (HMM model, GMM model, etc.), the corresponding similarity scoring algorithm It can also be arbitrarily selected according to the selected voiceprint, which is not limited by the present invention. In various embodiments, if the obtained similarity score exceeds the similarity threshold, the voice recognition module 220 determines that the first voice matches the voiceprint of the user, that is, determines that the first voice matches the user corresponding to the voiceprint. Otherwise, the voice recognition module 220 determines that the first voice does not match the user's voiceprint. The voice recognition module 220 may match the first voice with the voiceprint of each user to determine whether there is a user matching the first voice. In the case where there is a user matching the first voice, the voice recognition module 220 may store the instruction corresponding to the first voice in association with the matched user in addition to performing voice recognition on the sound data to obtain the instruction, For example, the user profile saved to the user. In this way, the voice recognition device 200 can subsequently analyze the user's behavior preferences according to all instructions from the user, thereby providing the user with personalized and customized services. For example, the user's song preferences are analyzed according to all instructions related to the user's playing songs, so that the user can be recommended songs that meet his preferences. In the case where there is no user matching the first voice, the voice recognition module 220 may store the sound data including the first voice to the voice storage module 230. The sound storage module 230 is suitable for storing sound data. The user discovery module 240 may cluster multiple pieces of sound data stored in the sound storage module 230 to identify new users from the multiple pieces of sound data. In this way, for the subsequently received voice data including the voice of the new user, the voice recognition device 200 can match the new user, and store the corresponding instruction in association with the new user, so that subsequent All instructions analyze the behavior preferences of the new user, thereby providing personalized services for the new user. In some embodiments, the user discovery module 240 may extract multiple pieces of stored sound data (eg, a fixed number of pieces of sound data) every predetermined period for clustering. Specifically, the user discovery module 240 first divides the multiple pieces of sound data into multiple sets based on the similarity score between the two pieces of sound data. It can be considered that the sound materials contained in each set are similar to each other. In one embodiment, a clustering algorithm may be used to divide the set. The calculation of the similarity score has been described in detail in the above description of the calculation process of the similarity score between the first voice and the voiceprint of the user, and will not be repeated here. Then, the user discovery module 240 determines at least one target set based on the sample characteristics of the set, and each target set corresponds to a new user. Among them, the sample characteristics may include sample density, sample number, etc., and the sample refers to sound data. In one embodiment, for each set, the sample density and number of samples for that set can be calculated. Then, the set whose sample density and sample quantity meet the predetermined condition is selected as the target set. The predetermined condition may be, for example, that the sample density exceeds the predetermined density; the number of samples exceeds the predetermined number, and so on. The predetermined condition can be configured according to the number of target sets that need to be determined, and the present invention does not limit this. After determining the target set (ie, discovering a new user), the user discovery module 240 creates a user profile for the new user corresponding to the target set, and uses at least part of the sound data in the target set to generate the voiceprint of the new user. The voiceprint may be a voiceprint model or a voiceprint vector. For example, it is possible to train a GMM model or GMM-UBM model using the voice characteristics of these sound materials as the voiceprint of a new user. Voiceprint vectors can also be extracted based on the voice features of these sound materials as the voiceprint of new users. For the specific voiceprint generation process, please refer to the previous description about voiceprint, which will not be repeated here. Among them, at least part of the sound data in the target set may be randomly selected to generate a voiceprint. The sound data for generating the voiceprint of the new user in the target set may also be determined according to the distance from the centroid of the target set. For example, first determine the centroid of the target set, and then calculate the distance from each sample in the target set to the centroid of the target set, and select those samples with smaller distances as the sound data for generating the voiceprint of the new user. The calculation of the center of mass is a conventional technique in the art, and will not be repeated here. If the target set is not determined (ie, no new user is found), for example, there is no target set in all the sets that meet the predetermined condition, then the user discovery module 240 can delete these sound data, that is, previously extracted from the sound storage module 230 Multiple sound data. Understandably, creating a user profile can be considered as a user registration process. Generally, the user can actively provide audio materials including the user's voice (for example, send an active registration request to the server via the terminal device, and actively enter voice for a specific text according to the corresponding registration prompt), so as to generate based on these actively provided audio materials The user's voiceprint. This process of user active operation can be regarded as an active registration process. The process of discovering new users by clustering, creating user profiles and generating voiceprints for them is not perceived by users. Therefore, this process can be considered as an inactive registration process. According to an embodiment of the present invention, the user profile may further include a user flag indicating whether the user is actively registered. As shown in FIG. 2, the voice recognition device 200 may further include a user registration module 250. The communication module 210 may receive audio data including a second voice, which is usually used to actively register a new user, for example, it may be a voice recorded according to a registration prompt of the terminal device 102. The user registration module 250 can create a user profile for the actively registered new user, and use the sound data including the second voice to generate the voiceprint of the new user, and the user profile of the actively registered new user User ID is set to active registration. Accordingly, for new users discovered through clustering, when the user discovery module 240 creates a user profile for the new user corresponding to the target set, the user identification in the created user profile can be set to inactive registration . In this way, after determining that there is a user matching the first voice, the voice recognition module 220 can determine whether the user is actively registered based on the user identification in the corresponding user profile. If the user mark indicates that the user is not actively registered, the voice recognition module 220 may record the number of voice data from the user. Specifically, the user profile may include the number of voice data from the user. Each time a piece of voice data from the user is received, the voice recognition module 220 increases the number of voice data from the user by one. Accordingly, when the user discovery module 230 creates a user profile for a non-actively registered new user, it can set the number of voice data items from the user in the created user profile to the initial value. The initial value can usually be 0. The voice recognition module 220 can also determine whether the number of voice data from the user reaches a certain amount within a certain period of time (for example, within a month since registration). If not, the voice recognition module 220 may delete the user profile corresponding to the user, that is, log out the user. If it is reached, you can do nothing. According to another embodiment of the present invention, the user profile may further include the device identification of the terminal device associated with the user. For example, during the active registration process, the communication module 210 may receive the device identification of the terminal device that sends the sound data including the second voice, and the user registration module 250 may store the device identification in association with the newly registered new user to The corresponding user profile. In this way, the voice recognition module 220 can also receive the device ID of the terminal device that sent the voice data when receiving the voice data, and before determining whether there is a user matching the first voice, first determine whether there is a user based on the device ID The user associated with the corresponding terminal device, that is, to find whether there is a user profile including the device identification. If there is no user associated with the terminal device, the voice recognition module 220 may store the voice data to the voice storage module 230. If there is a user associated with the terminal device, the voice recognition module 220 determines whether there is a user matching the first voice. In addition, an embodiment of the present invention also provides a user identification device. The user identification device includes a communication module, a voice recognition module, a sound storage module, and a user discovery module. The communication module receives sound data including the first voice, and the voice recognition module can determine whether there is a user matching the first voice, and store the sound data to the sound if there is no user matching the first voice Storage module. The sound storage module stores sound data. The user discovery module can cluster multiple pieces of sound data stored in the sound storage module, so as to identify new users from the multiple pieces of sound data, and conduct behavior analysis on the new users. For example, the behavior preference of the new user can be analyzed according to the instruction corresponding to the voice of the new user, so as to provide a personalized service for the new user. Among them, the processing of each module in the user recognition device can be the same as the processing of each module in the voice recognition device 200 described above in conjunction with FIGS. 1 and 2, and can achieve similar technical effects, which will not be repeated here. Repeat. The specific structures and corresponding processing methods of the modules and devices mentioned above will be described below with reference to the drawings. According to an embodiment of the present invention, various components in the voice recognition device 200 (and the user recognition device), such as various modules, can be implemented by the computing device 300 as described below. FIG. 3 shows a schematic diagram of a computing device 300 according to an embodiment of the invention. As shown in FIG. 3, in the basic configuration 302, the computing device 300 typically includes system memory 306 and one or more processors 304. The memory bus 308 may be used for communication between the processor 304 and the system memory 306. Depending on the desired configuration, the processor 304 may be any type of processing, including but not limited to: microprocessor (µP), microcontroller (µC), digital information processor (DSP), or any combination thereof. The processor 304 may include one or more levels of high-speed cache, such as a first-level high-speed cache 310 and a second-level high-speed cache 312, a processor core 314, and a scratchpad 316. The example processor core 314 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example storage controller 318 may be used with the processor 304, or in some implementations, the storage controller 318 may be an internal part of the processor 304. Depending on the desired configuration, the system storage 306 may be any type of storage, including but not limited to: volatile storage (such as RAM), non-volatile storage (such as ROM, flash memory, etc.) or Any combination of them. The system storage 306 may include an operating system 320, one or more application programs 322, and program data 324. In some embodiments, the application program 322 may be arranged to execute instructions by the one or more processors 304 using the program data 324 on the operating system. The computing device 300 may also include an interface bus 340 that facilitates communication from the various interface devices (eg, output device 342, peripheral interface 344, and communication device 346) to the basic configuration 302 via the bus/interface controller 330. The example output device 342 includes a graphics processing unit 348 and a sound processing unit 350. They may be configured to facilitate communication with various external devices such as displays or speakers via one or more A/V ports 352. The example peripheral interface 344 may include a serial interface controller 354 and a parallel interface controller 356, which may be configured to facilitate via one or more I/O ports 358 and such as input devices (eg, keyboard, mouse, pen) , Voice input devices, touch input devices) or other peripheral devices (such as printers, scanners, etc.) to communicate. The example communication device 346 may include a network controller 360, which may be arranged to facilitate communication via one or more communication ports 364 with one or more other computing devices 362 via a network communication link. The network communication link may be an example of communication media. Communication media can generally be embodied as computer-readable instructions, data structures, program modules in modulated data signals such as carrier waves or other transmission mechanisms, and can include any information delivery media. The "modulated data signal" may be a signal in which one or more of its data sets or its changes can be made by encoding information in the signal. As a non-limiting example, the communication media may include wired media such as a wired network or a dedicated line network, and various wireless media such as sound, radio frequency (RF), microwave, infrared (IR), or other wireless media . The term computer-readable media as used herein may include both storage media and communication media. The computing device 300 can be implemented as a server, such as a database server, an application server, and a WEB server, etc., or as a personal computer including desktop computers and notebook computers. Of course, the computing device 300 may also be implemented as part of a small-sized portable (or mobile) electronic device. In the embodiment according to the present invention, the computing device 300 is implemented as a voice recognition apparatus 200, and is configured to perform the voice recognition method 400 according to the embodiment of the present invention. Among them, the application program 322 of the computing device 300 includes multiple program instructions to execute the voice recognition method 400 according to an embodiment of the present invention, and the program data 324 can also store configuration information of the voice recognition system 100 and the like. FIG. 4 shows a voice recognition method 400 according to an embodiment of the present invention. As shown in FIG. 4, the voice recognition method 400 starts at step S410. In step S410, sound data including the first voice is received. As mentioned above, the first voice is generally a voice instructing the terminal device 102 to perform an operation. Therefore, according to the embodiment of the present invention, the voice material can be voice-recognized to obtain an instruction corresponding to the first voice, and then a response result to the instruction is returned to the terminal device 102, so that the terminal device 102 executes at least according to the response result Operate accordingly. Then in step S420, it may be determined whether there is a user matching the first voice. According to one embodiment, the users will correspond to user profiles that record user-related data. These user profiles may be stored in a user data storage device coupled to the voice recognition device 200 or may be stored in the voice recognition device User data storage module included in 200. Generally, the user's biometrics such as fingerprints, voice prints, and irises can be used to uniquely identify the user. In one embodiment of the present invention, a voiceprint can be used to uniquely identify the user, and the user profile includes the user's voiceprint, and it can be determined whether the first voice matches the user's voiceprint to determine whether there is a phase with the first voice Matching users. Specifically, the voice feature of the first voice may be extracted based on the sound data including the first voice. In some embodiments, the speech feature may be a filter bank FBank (Filter Bank), a Mel Frequency Cepstral Coefficient (MFCC), a perceptual linear prediction coefficient PLP, a deep feature Deep Feature, and an energy regularization spectral coefficient PNCC One or more combinations of features. Then, a similarity score between the first voice and the voiceprint of the user is obtained based on the voice characteristics of the first voice, and it is determined whether the first voice matches the voiceprint of the user according to the similarity score. If the obtained similarity score exceeds the similarity threshold, it is determined that the first voice matches the voiceprint of the user, otherwise it is determined that the first voice does not match the voiceprint of the user. If there is a user matching the first voice, the instruction corresponding to the first voice may be stored in association with the matched user. If there is no user matching the first voice, then in step S430, the (strip) voice data is stored. Then in step S440, the stored multiple pieces of sound data are clustered to determine a new user from the multiple pieces of sound data. Specifically, based on the similarity score between two pieces of sound data, the pieces of sound data may be divided into multiple sets. Then at least one target set is determined based on the sample density and the number of samples of the set, and the target set corresponds to the new user. Finally, a user profile is created for the new user corresponding to the target set, and at least part of the sound data in the target set is used to generate the voiceprint of the new user. In one embodiment, the sound data for generating the voiceprint of the new user in the target set may be determined according to the distance from the centroid of the target set. For example, first determine the centroid of the target set, and then calculate the distance from each sample in the target set to the centroid of the target set, and select those samples with smaller distances as the sound data for generating the voiceprint of the new user. If the target set is not determined, these sound materials can be deleted, that is, the previous multiple sound materials. According to an embodiment of the present invention, the user profile may further include a user flag indicating whether the user is actively registered, and when creating a user profile for a new user corresponding to the target set, the user identifier in the user profile may be set to Non-active registration. When there is a user matching the first voice and the corresponding user mark indicates that the user is not actively registered, the number of voice data from the user can also be recorded to determine whether the number of voice data reaches a specific number within a specific time period . If not, the user profile corresponding to the user can be deleted. According to an embodiment of the present invention, the voice recognition method 400 may further include the step of receiving voice data including a second voice, which is generally used to actively register a new user. Create a user profile for the actively registered new user, and use the sound data including the second voice to generate the voiceprint of the new user, and set the user ID in the user profile of the actively registered new user to be actively registered. According to an embodiment of the present invention, the user profile may further include a device identification of the terminal device associated with the user, and the voice recognition method 400 may further include the step of: receiving the device identification of the terminal device sending the above sound data, based on the device identification Determine whether there is a user associated with the terminal device. If it does not exist, the above sound data is stored. The specific steps and embodiments of the voice recognition method 400 have been disclosed in detail in the description of the voice recognition system 100 in conjunction with FIGS. 1 to 3, and will not be repeated here. In addition, an embodiment of the present invention also provides a user recognition method, including the steps of: receiving sound data including the first voice; judging whether there is a user matching the first voice; when there is no user matching the first voice In this case, store the sound data; cluster the stored multiple sound data to identify new users from the multiple sound data and conduct behavior analysis on the new user. Wherein, the processing of each step in the user recognition method may be the same as the processing of each step in the voice recognition method 400 described above in conjunction with FIG. 4 and can achieve similar technical effects, which will not be repeated here. In summary, according to the voice recognition scheme of the embodiment of the present invention, a new user is determined and the voiceprint of the new user is generated by clustering the stored multiple voice data, so that the voiceprint can be identified based on the voiceprint later The user, and analyzes the user's behavior preferences according to the instruction from the user, so that the user can be provided with more accurate personalized services. In addition, the entire new user determination and voiceprint generation process are unaware of the user, eliminating the user's active registration operation and improving the user's experience. It should be understood that in order to streamline the disclosure and help understand one or more of the various inventive aspects, in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together into a single embodiment, drawing, Or in its description. However, the disclosed method should not be interpreted as reflecting the intention that the claimed invention requires more features than those explicitly recited in each claim. More precisely, as reflected in the scope of the patent application below, the inventive aspect lies in less than all the features of the single embodiments disclosed previously. Therefore, the scope of patent application that follows a specific embodiment is thus explicitly incorporated into the specific embodiment, where each claim item itself serves as a separate embodiment of the present invention. Those skilled in the art should understand that the modules or units or components of the device in the examples disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be positioned in the device with the example Different one or more devices. The modules in the foregoing examples may be combined into one module or may be divided into multiple sub-modules. Those skilled in the art can understand that it is possible to adaptively change the modules in the device in the embodiment and set them in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and in addition, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or units are mutually exclusive, all features disclosed in this specification (including the accompanying patent application scope, abstract and drawings) and any method so disclosed can be adopted in any combination Or all processes or units of the device are combined. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying patent application scope, abstract, and drawings) may be replaced by an alternative feature providing the same, equivalent, or similar purpose. In addition, those skilled in the art can understand that although some of the embodiments described herein include certain features included in other embodiments but not other features, the combination of features of different embodiments is meant to be within the scope of the present invention And form different embodiments. For example, in the following patent application scope, any one of the claimed embodiments can be used in any combination. Furthermore, some of the described embodiments are described herein as methods or combinations of method elements that can be implemented by a processor of a computer system or by other devices that perform the described functions. Thus, a processor having the necessary instructions for implementing the method or method element forms a device for implementing the method or method element. Furthermore, the elements of the device embodiments described herein are examples of devices for implementing the functions performed by the elements for the purpose of implementing the invention. As used herein, unless otherwise specified, the use of ordinal words "first", "second", "third", etc. to describe ordinary objects merely indicates different instances involving similar objects and is not intended to imply such The objects described must have a given order in time, space, order, or in any other way. Although the present invention has been described based on a limited number of embodiments, benefiting from the above description, those skilled in the art understand that other embodiments are conceivable within the scope of the invention thus described. In addition, it should be noted that the language used in this specification is mainly selected for readability and teaching purposes, not for explaining or limiting the subject matter of the present invention. Therefore, without departing from the scope and spirit of the appended patent application, many modifications and changes will be apparent to those of ordinary skill in the art. Regarding the scope of the present invention, the disclosure made to the present invention is illustrative rather than limiting, and the scope of the present invention is defined by the scope of the attached patent application.

102:終端設備 103:麥克風 104:輸出設備 105:網路 106:伺服器 200:語音識別裝置 210:通信模組 220:語音識別模組 230:聲音儲存模組 240:用戶發現模組 250:用戶註冊模組 300:計算設備 302:基本配置 304:處理器 306:系統儲存器 308:儲存器匯流排 310:一級高速快取 312:二級高速快取 314:處理器核心 316:暫存器 318:儲存器控制器 320:操作系統 322:應用程式 324:程式資料 330:匯流排/介面控制器 332:儲存設備 334:儲存介面匯流排 336:可移除儲存器 338:不可移除儲存器 340:介面匯流排 342:輸出設備 344:外設介面 346:通信設備 348:圖形處理單元 350:聲音處理單元 352:A/V端口 354:串聯介面控制器 356:並聯介面控制器 358:I/O端口 360:網路控制器 362:計算設備 364:通信端口 400:語音識別方法 S410～S440:步驟102: terminal equipment 103: microphone 104: output device 105: Internet 106: server 200: voice recognition device 210: Communication module 220: Voice recognition module 230: sound storage module 240: User discovered the module 250: User registration module 300: computing equipment 302: basic configuration 304: processor 306: System memory 308: memory bus 310: First-class high-speed cache 312: Secondary high-speed cache 314: processor core 316: register 318: memory controller 320: Operating system 322: Application 324: Program data 330: bus/interface controller 332: Storage equipment 334: Storage interface bus 336: Removable storage 338: Non-removable storage 340: Interface bus 342: output device 344: Peripheral interface 346: Communication equipment 348: graphics processing unit 350: sound processing unit 352: A/V port 354: Serial interface controller 356: Parallel interface controller 358: I/O port 360: network controller 362: Computing equipment 364: Communication port 400: voice recognition method S410～S440: Steps

為了實現上述以及相關目的，本文結合下面的描述和圖式來描述某些說明性方面，這些方面指示了可以實踐本文所公開的原理的各種方式，並且所有方面及其等效方面旨在落入所要求保護的主題的範圍內。藉由結合圖式閱讀下面的詳細描述，本公開的上述以及其它目的、特徵和優勢將變得更加明顯。遍及本公開，相同的圖式標記通常指代相同的部件或元素。圖1示出了根據本發明一個實施例的語音識別系統100的示意圖；圖2示出了根據本發明一個實施例的語音識別裝置200的架構圖；圖3示出了根據本發明一個實施例的計算設備300的示意圖；以及圖4示出了根據本發明一個實施例的語音識別方法400的結構方塊圖。In order to achieve the above and related purposes, this document combines the following description and drawings to describe certain illustrative aspects, which indicate various ways in which the principles disclosed herein may be practiced, and all aspects and their equivalents are intended to fall within Within the scope of the claimed subject matter. The above and other objects, features, and advantages of the present disclosure will become more apparent by reading the following detailed description in conjunction with the drawings. Throughout this disclosure, the same graphical notations generally refer to the same parts or elements. FIG. 1 shows a schematic diagram of a speech recognition system 100 according to an embodiment of the present invention; 2 shows an architecture diagram of a voice recognition device 200 according to an embodiment of the present invention; 3 shows a schematic diagram of a computing device 300 according to an embodiment of the invention; and FIG. 4 shows a structural block diagram of a voice recognition method 400 according to an embodiment of the present invention.

102:終端設備 102: terminal equipment

103:麥克風 103: microphone

104:輸出設備 104: output device

105:網路 105: Internet

106:伺服器 106: server

200:語音識別裝置 200: voice recognition device

Claims

A voice recognition method, including steps: Receiving sound data including the first voice; Determine whether there is a user matching the first voice; In the case where there is no user matching the first voice, store the sound data; Cluster the multiple pieces of stored sound data to identify new users from the multiple pieces of sound data.

The method according to claim 1, wherein the user corresponds to a user profile, the user profile includes the user's voiceprint, and the step of determining whether there is a user matching the first voice includes: It is determined whether the first voice matches the voiceprint of the user, so as to determine whether there is a user matching the first voice.

The method according to claim 2, wherein the step of clustering the plurality of stored sound materials to determine a new user from the plurality of sound materials includes: Based on the similarity score between two of the multiple pieces of sound data, divide the multiple pieces of sound data into multiple sets; Determine at least one target set based on the sample density and the number of samples of the set, the target set corresponding to the new user; Create a user profile for the new user corresponding to the target set, and use at least part of the sound data in the target set to generate the voiceprint of the new user.

The method according to claim 3, wherein the step of using at least part of the sound data in the target set to generate the voiceprint of the new user includes: The sound data for generating the voiceprint of the new user in the target set is determined according to the distance to the centroid of the target set.

The method of claim 3, wherein the user profile includes a user flag indicating whether the user is actively registered, and the step of creating a user profile for a new user corresponding to the target set includes: Set the user ID in the user profile created for the new user corresponding to the target set to inactive registration; and The method also includes steps: In the case where there is a user matching the first voice and the corresponding user mark indicates that the user is not actively registered, record the number of voice data from the user.

The method according to claim 5, further comprising the steps of: After recording the number of voice data from the user, determine whether the number of voice data reaches a specific number within a specific time period; if not, delete the user profile corresponding to the user.

The method according to claim 2, wherein the user profile further includes a device identification of a terminal device associated with the user, the method includes the steps of: Receive the device identification of the terminal device that sent the sound material; Determine whether there is a user associated with the terminal device based on the device identification; If it does not exist, store the sound data.

The method according to claim 1, further comprising the steps of: When there is a user matching the first voice, the instruction corresponding to the first voice is stored in association with the user.

The method according to any one of claims 1-8, further comprising the steps of: Receiving sound data including a second voice, which is used to actively register a new user; Create a user profile for the actively registered new user, and use the sound data including the second voice to generate the voiceprint of the new user; and Set the user ID in the user profile created for the actively registered new user to be actively registered.

The method according to claim 9, further comprising the steps of: Receiving and sending the device identification of the terminal device including the sound material of the second voice; The device identification is stored in the corresponding user profile in association with the actively registered new user.

The method according to any one of claims 2-8, wherein the step of determining whether the first voice matches the voiceprint of the user includes: Extract the voice characteristics of the first voice according to the sound data including the first voice; Obtaining a similarity score between the first voice and the user's voiceprint based on the voice characteristics of the first voice; According to the similarity score, it is determined whether the first voice matches the user's voiceprint.

A user identification method, including steps: Receiving sound data including the first voice; Determine whether there is a user matching the first voice; In the case where there is no user matching the first voice, store the sound data; Clustering the stored multiple pieces of sound data, so as to determine a new user from the multiple pieces of sound data, and conduct behavior analysis on the new user.

A voice recognition device, including: The communication module is adapted to receive sound data including the first voice; The voice recognition module is suitable for judging whether there is a user matching the first voice; when there is no user matching the first voice, the voice data is stored in the voice storage module; Sound storage module, suitable for storing the sound data; and The user discovery module is suitable for clustering multiple pieces of sound data stored in the sound storage module, so as to determine a new user from the multiple pieces of sound data.

The device according to claim 13, wherein the user corresponds to a user profile, the user profile includes the user's voiceprint, and the voice recognition module is adapted It is determined whether the first voice matches the voiceprint of the user, so as to determine whether there is a user matching the first voice.

The device according to claim 14, wherein the user discovery module is suitable for Based on the similarity score between two of the multiple pieces of sound data, divide the multiple pieces of sound data into multiple sets; Determine at least one target set based on the sample density and the number of samples of the set, the target set corresponding to the new user; Create a user profile for the new user corresponding to the target set, and use at least part of the sound data in the target set to generate the voiceprint of the new user.

The device according to claim 15, wherein the user discovery module is suitable for The sound data for generating the voiceprint of the new user in the target set is determined according to the distance to the centroid of the target set.

The device according to claim 15, wherein the user profile includes a user flag indicating whether the user is actively registered, and the user discovery module is suitable for Set the user ID in the user profile created for the new user corresponding to the target set to inactive registration; and The speech recognition module is also suitable for In the case where there is a user matching the first voice and the corresponding user mark indicates that the user is not actively registered, record the number of voice data from the user.

The device according to claim 17, wherein the voice recognition module is adapted After recording the number of voice data from the user, determine whether the number of voice data reaches a specific number within a specific time period; if not, delete the user profile corresponding to the user.

The apparatus according to claim 14, wherein the user profile includes a device identification of a terminal device associated with the user, The communication module is also suitable for receiving the device identification of the terminal device sending the sound data; and the voice recognition module is also suitable for Determine whether there is a user associated with the terminal device based on the device identification; If it does not exist, the sound data is stored in the sound storage module.

The device according to claim 13, wherein the voice recognition module is further adapted to When there is a user matching the first voice, the instruction corresponding to the first voice is stored in association with the user.

The device according to any one of claims 13-19, wherein the communication module is further adapted to receive sound data including a second voice, which is used to actively register a new user; the device further includes: The user registration module is suitable for creating a user profile for an actively registered new user, and using the sound data including the second voice to generate the voiceprint of the new user; and the user settings created for the actively registered new user The user ID in the file is set to active registration.

The apparatus according to claim 21, wherein the communication module is further adapted to receive the device identification of the terminal device that sends the sound data including the second voice; the user registration module is further adapted to register the device identification with the active The new user of is stored in the corresponding user profile in association.

The device according to any one of claims 14-20, wherein the voice recognition module is further adapted Extract the voice characteristics of the first voice according to the sound data including the first voice; Obtaining a similarity score between the first voice and the user's voiceprint based on the voice characteristics of the first voice; According to the similarity score, it is determined whether the first voice matches the user's voiceprint.

The device according to any one of claims 14-20, the device resides in a terminal device, and the terminal device is a speaker, a television, or a washing machine.

A user identification device, including: The communication module is adapted to receive sound data including the first voice; The voice recognition module is suitable for judging whether there is a user matching the first voice; when there is no user matching the first voice, the voice data is stored in the voice storage module; Sound storage module, suitable for storing the sound data; and The user discovery module is suitable for clustering multiple pieces of sound data stored in the sound storage module, so as to determine a new user from the multiple pieces of sound data, and conduct behavior analysis on the new user.

A voice recognition system, including terminal equipment and a server, in which The terminal device is adapted to receive the speaker's voice and send audio data including the voice to the server; the server resides with the voice recognition device described in any one of the request items 13-24.

A computing device, including: At least one processor; and A memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, and the program instructions include a method for performing the speech recognition method according to any one of the request items 1-11 instruction.