WO2019101083A1 - 一种语音数据处理方法、语音交互设备及存储介质 - Google Patents

一种语音数据处理方法、语音交互设备及存储介质 Download PDF

Info

Publication number
WO2019101083A1
WO2019101083A1 PCT/CN2018/116590 CN2018116590W WO2019101083A1 WO 2019101083 A1 WO2019101083 A1 WO 2019101083A1 CN 2018116590 W CN2018116590 W CN 2018116590W WO 2019101083 A1 WO2019101083 A1 WO 2019101083A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
historical
user
cluster
speech
Prior art date
Application number
PCT/CN2018/116590
Other languages
English (en)
French (fr)
Inventor
马龙
李俊
张力
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019101083A1 publication Critical patent/WO2019101083A1/zh
Priority to US16/600,421 priority Critical patent/US11189263B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a voice data processing method, a voice interaction device, and a storage medium.
  • speech recognition technology such as cars, speakers, televisions, etc. that can be voice-controlled, that is, the voice interaction device can recognize the speech of the speaker and according to the recognition. Content is automatically controlled.
  • the voice interaction device capable of voice recognition can personalize the voice features of different speakers. Before this, the speaker needs to actively register the voice interaction device to register the speaker's voice feature and the speaker's voice feature. The relationship between the speaker information, so that after a subsequent recognition of a certain voice matches the speaker's voice feature, the usage authority corresponding to the speaker's speaker information can be provided.
  • the current voice registration process usually requires the speaker to repeatedly and clearly say a number of fixed sentences to the voice interaction device to extract the speaker's voice features. Thus, the current voice registration method needs to be performed by the speaker.
  • Proactively initiated, and registration time may take a long time, resulting in inefficient voice registration; and in the voice registration process, the speaker is easily caused by the carelessness of the speaker's voice content and the fixed sentence provided by the system, resulting in voice Registration failed, which reduces the success rate of voice registration.
  • the embodiment of the present invention provides a voice data processing method, a voice interaction device, and a storage medium, which can improve voice registration efficiency and improve the success rate of voice registration.
  • An aspect of the present application provides a voice data processing method, which is performed by a voice interaction device, and includes:
  • Obtaining historical speech data and acquiring historical speech feature vectors corresponding to the historical speech data, and clustering the historical speech feature vectors to obtain a speech feature cluster; the speech feature cluster includes at least one historical speech feature with similar features vector;
  • voice feature cluster satisfies the high frequency user condition, training the corresponding user voice model according to the historical voice feature vector included in the voice feature cluster;
  • a voice interaction device including: a processor, a memory; the processor is coupled to a memory, wherein the memory is configured to store program code, and the processor is configured to invoke the program code To do the following:
  • Obtaining historical speech data and acquiring historical speech feature vectors corresponding to the historical speech data, and clustering the historical speech feature vectors to obtain a speech feature cluster; the speech feature cluster includes at least one historical speech feature with similar features vector;
  • voice feature cluster satisfies the high frequency user condition, training the corresponding user voice model according to the historical voice feature vector included in the voice feature cluster;
  • Another aspect of the present application provides a computer storage medium storing a computer program, the computer program including program instructions that, when executed by a processor, perform the voice data processing method.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 2a is a schematic diagram of a scenario of a voice data processing method according to an embodiment of the present application
  • FIG. 2b is a schematic diagram of a scenario of another voice data processing method according to an embodiment of the present application.
  • 2c is a schematic diagram of another scenario of a voice data processing method according to an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a voice data processing method according to an embodiment of the present application.
  • FIG. 4 is a schematic flowchart diagram of another voice data processing method according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a scenario of a parameter update method according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a performance verification result provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of another performance verification result provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a voice data processing apparatus according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of a voice interaction device according to an embodiment of the present application.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the system architecture may include a voice interaction device 100a and a background server 100b.
  • the voice interaction device 100a may connect to the background server 100b via the Internet.
  • the voice interaction device 100a may include a smart device capable of voice recognition, such as a smart speaker. Smartphones, computers, smart TVs, smart air conditioners, etc.
  • the voice interaction device 100a can receive the voice data of the user and send the voice data to the background server 100b, so that the background server 100b can perform voice recognition on the voice data and generate a control instruction according to the recognized semantics, and the voice interaction device 100a receives the background server 100b.
  • the voice interaction device 100a can transmit the voice data "play song A” to the background server 100b, and the background server 100b performs voice recognition on the voice data "play song A”. And generating a corresponding control instruction, the voice interaction device 100a receives the control instruction and plays the song A according to the control instruction.
  • the background server 100b can also perform the discovery of the high-frequency user according to the voice data sent by the voice interaction device 100a and actively initiate the identity registration for the high-frequency user. This process can be seen in both FIG. 2a and FIG. 2c, which are all implemented by the present application.
  • FIG. 2a A schematic diagram of a scenario of a voice data processing method provided by the example.
  • the voice interaction device 100a can receive voice data of multiple users, and each voice data can be forwarded to the background server 100b. (For the process of voice control according to the voice data, refer to the corresponding embodiment in FIG. 1 above) Therefore, the background server 100b can store a large amount of voice data to form historical voice data; as shown in FIG.
  • the background server 100b can cluster all the historical voice data to obtain a voice feature cluster 1, a voice feature cluster 2, and a voice feature cluster 3.
  • a speech feature cluster 4 each of the speech feature clusters includes at least one historical speech feature vector with similar features, and the historical speech feature vector may refer to an i-Vector corresponding to the historical speech data.
  • the background server 100b may further perform high frequency user discovery based on all voice feature clusters, specifically by analyzing the number of historical phone feature vectors and the distribution density of historical phone feature vectors in each of the phonetic feature clusters. Whether the user corresponding to the voice feature cluster is a high frequency user, in FIG.
  • the background server 100b analyzes that the user corresponding to the voice feature cluster 1 and the voice feature cluster 4 belongs to a high frequency user (often sending voice data to the voice interaction device 100a) The user can determine that the user is a high-frequency user.
  • the background server 100b further creates a corresponding user voice model 1 for the voice feature cluster 1 and a corresponding user voice model 4 for the voice feature cluster 4.
  • the user voice model 1 and the user voice model 4 belong to a user voice model of unbound user identity information, that is, an unregistered user voice model.
  • the voice interaction device 100a can forward the voice data of the user 1 at the current moment to the background server 100b.
  • the background server 100b can perform model matching on the voice data, specifically, the i-Vector corresponding to the voice data is compared with the user voice model 1 and the user voice model 4 in FIG. 2a, respectively, as shown in FIG. 2b.
  • the user voice model 1 matches the voice data.
  • the user identity association request corresponding to the user voice model 1 can be initiated to the voice interaction device 100a, and the identity association module in the voice interaction device 100a can send the user identity module to the user 1.
  • the user registers a prompt tone (for example, the prompt tone is “Please input your identity information”), and the user 1 can send the identity information of the user 1 to the voice interaction device 100a by voice or client according to the prompt tone, and the voice interaction device 100a then forwards the identity information of the user 1 to the background server 100b, and the background server 100b can identify the identity of the user 1.
  • the user identity registration is performed, and the registration process is to bind the user voice model 1 with the identity information of the user 1.
  • the user voice model 1 belongs to the user voice model of the bound user identity information, that is, the registered user voice model; the user voice model 4 still belongs to the user voice model of the unbound user identity information.
  • the user needs to repeatedly send the voice content of the fixed sentence to complete the voice registration, that is, the user only needs to respond to the user identity association request initiated by the voice interaction device to complete the voice registration, thereby improving the efficiency of voice registration. .
  • the background server 100b can continuously receive the voice data sent by the voice interaction device 100a to form more historical voice data. In order to ensure that the background server 100b can continue to discover new high frequency users, the background server 100b can re-cluster historical voice data periodically or quantitatively. In FIG.
  • the background server 100b can cluster other historical voice data of all historical voice data except the historical voice data matched with the user voice model 1 (since the user voice model 1 has been registered, there is no need to
  • the historical speech data matched with the user speech model 1 is clustered to obtain a speech feature cluster 2, a speech feature cluster 3, a speech feature cluster 4, and a speech feature cluster 5; the speech feature cluster 2 in Fig. 2c may be included in Fig. 2a.
  • the speech feature cluster 2 and may also include some new historical speech feature vectors;
  • the speech feature cluster 3 in FIG. 2c may include the speech feature cluster 3 in FIG. 2a, and may also include some new historical speech feature vectors.
  • the speech feature cluster 4 in FIG. 2c may include the speech feature cluster 4 in FIG.
  • the speech feature cluster 5 in FIG. 2c is a newly added speech feature cluster.
  • the background server 100b further performs high-frequency user discovery in the voice feature cluster 2, the voice feature cluster 3, the voice feature cluster 4, and the voice feature cluster 5, and further analyzes that the voice feature cluster 3 and the voice feature cluster 4 respectively correspond to users belonging to the high frequency.
  • the user since the voice feature cluster 4 already has a corresponding user voice model 4, only the user voice model 3 corresponding to the voice feature cluster 3 needs to be created; wherein all the historical voice features in the voice feature cluster 4 in FIG. 2c can also be used.
  • the vector updates the existing user speech model 4. As shown in FIG.
  • the user voice model in the background server 100b at this time includes a user voice model 1, a user voice model 3, a user voice model 4, and the user voice model 1 belongs to a user voice model of the bound user identity information;
  • the voice model 3 and the user voice model 4 belong to a user voice model in which user identity information is not bound. Therefore, when the voice data is detected to match the user voice model 3 or the user voice model 4, the user identity registration may be initiated.
  • the user voice model of the unbound user identity information can be added, and based on the automatic identity registration mechanism of the user, the user voice model of the unbound user identity information can be gradually converted into tied.
  • the user voice model of the user identity information that is, the identity registration of each high frequency user is gradually completed.
  • the background server 100b can be integrated into the voice interaction device 100a, that is, the voice interaction device 100a can directly perform voice recognition on the voice data to implement voice control, and the voice interaction device 100a can also directly perform high voice data according to the received voice data. Frequency users discover and actively initiate identity registration for high frequency users.
  • FIG. 3 to FIG. 9 illustrate the specific process of high-frequency user discovery and active registration of high-frequency users by taking the background server 100b integrated into the voice interaction device 100a as an example.
  • FIG. 3 is a schematic flowchart of a voice data processing method according to an embodiment of the present disclosure, where the method may include:
  • S301 Acquire historical voice data, acquire historical voice feature vectors corresponding to the historical voice data, and cluster the historical voice feature vectors to obtain a voice feature cluster; the voice feature cluster includes at least one feature-like history. Speech feature vector
  • the voice interaction device may directly obtain the obtained voice interaction device.
  • User speech performs semantic recognition, which in turn performs control operations associated with semantics.
  • the voice interaction device may include a smart device capable of voice interaction recognition and control, such as an audio/speaker, a television, a car, a mobile phone, and a VR (Virtual Reality) device. For example, if the user says “song the next song” to the voice interaction device, after the voice interaction device analyzes the semantics, the voice interaction device can switch the current song to the next song for playing.
  • the voice interaction device can start the voice control function without waiting for the user to complete the voice registration, that is, before the voice feature of the user is bound to the user identity information, the voice interaction device can perform the associated control according to the voice content of the user. operating. Moreover, a plurality of different users can speak the user voice to the voice interaction device, so that the voice interaction device can perform an associated control operation according to each user voice instruction, and the voice interaction device can also record and save each user voice, and Each user voice saved is determined as historical voice data.
  • the historical speech feature vector can be an i-vector (identity-vector).
  • the process of obtaining the i-vector may be: firstly, using all historical speech data training to obtain a high-order Gaussian Mixture Model (GMM), which can be used to describe the speaker's speech feature space.
  • GMM Gaussian Mixture Model
  • UBM Universal Background Model
  • the GMM-UBM model is used to estimate the parameters of each historical speech data to determine the mixing weight, mean vector and variance matrix of each component of the Gaussian mixture model.
  • each historical voice can be separately
  • the speaker and channel-related characteristics implicit in the high-dimensional speaker speech space are projected into the low-dimensional space to obtain the historical speech feature vector of each historical speech data, that is, i-Vector.
  • the global difference matrix T can also be trained based on deep neural networks.
  • the speech interaction device may further reduce the dimensionality of the historical speech feature vector, and cluster the reduced historical speech feature vectors according to the target clustering model parameters to obtain a speech feature cluster.
  • the process of dimensionality reduction on historical phonetic feature vectors may be: PCA (Principal Component Analysis), tSNE (t-distributed stochastic neighbor embedding), and LDA (Linear Discriminant Analysis) Analytical algorithm, etc., performs data dimensionality reduction on the acquired historical speech feature vector (ie i-vector), removes redundant multi-collinear components in the data, and reduces the computational complexity of the cluster.
  • the dimensionality reduction using PCA and tSNE is unsupervised, that is, it is not necessary to pre-train the model, and can be directly applied to dimension reduction of i-vector; wherein the use of LDA to reduce dimension requires pre-use of i-vector data with actual tags to train out The optimal projection direction is then applied to reduce the dimension of the i-vector.
  • the specific process of clustering all historical phonetic feature vectors may be: using DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm, using Euclidean distance as a sample The distance metric is used to cluster the reduced historical phonological feature vectors into clusters (ie, speech feature clusters).
  • the DBSCAN clustering algorithm can find out the irregularly shaped clusters in the feature space, and it is not necessary to set the number of clusters in advance in the clustering, so that the scene requirement in which the number of speakers is unknown in the embodiment of the present application can be satisfied.
  • the speech feature cluster after generating at least one voice feature cluster, for each voice feature cluster, determining the number of historical phone feature vectors in the voice feature cluster and the distribution of historical phone feature vectors in the voice feature cluster Whether the speech feature cluster satisfies the high frequency user condition. For example, if the number of historical speech feature vectors in the speech feature cluster exceeds a preset number threshold and the distribution density of the historical speech feature vector in the speech feature cluster also exceeds a preset density threshold, it may be determined that the speech feature cluster satisfies high
  • the frequency user condition that is, the speaker corresponding to the voice feature cluster is a user who frequently performs voice interaction with the voice interaction device.
  • the corresponding user voice model may be trained according to the historical voice feature vector included in the voice feature cluster.
  • the process of training the user's speech model may be: acquiring all historical speech feature vectors in the speech feature cluster satisfying the high-frequency user condition, and performing mean calculation or interpolation calculation on the acquired historical speech feature vectors to obtain the target historical speech. And a feature vector, and the target historical phone feature vector is used as a model parameter of the user speech model corresponding to the phonetic feature cluster.
  • the method for calculating the mean value of the historical phonetic feature vector in the speech feature cluster may be: adding each historical speech feature vector in the speech feature cluster, and dividing by the number of historical speech feature vectors in the speech feature cluster , get the target historical phonetic feature vector.
  • the mean value calculation of the historical phonetic feature vector in the speech feature cluster may be: adding weights to each historical speech feature vector in the speech feature cluster according to the weight coefficient, and dividing by the historical speech in the speech feature cluster The number of feature vectors gives the target historical phone feature vector.
  • a voice feature vector (ie, i-vector) of the voice data is acquired, and then the voice feature vector of the voice data is compared with each created user voice model. If a currently received voice data (ie, current voice data) matches a certain user voice model, and the user voice model is not bound to the user identity information, the voice interaction device may initiate association with the current voice data.
  • the user identity association request the specific form of the user identity association request may be: a voice prompt for performing user identity association (for example, the voice interaction device sends a voice “please bind your identity information”), or is sent to the user terminal.
  • a registration interface for user identity association (for example, the registration interface can be displayed on the user's mobile phone, and the user can fill in the identity information on the registration interface, or can also bind to the user account on the registration interface to complete the voice. registered).
  • the Euclidean distance can be used to determine the vector distance between the i-vector of the current speech data and the i-vector in each user speech model, and the user speech model whose vector distance is less than the distance threshold is determined to be related to the current speech data. Matching user speech model.
  • the voice interaction device can also provide a corresponding personalized service to the speaker corresponding to the current voice data according to the user voice model, for example, if the current voice is The data is “play A song”, and the voice interaction device can acquire the user habit parameter corresponding to the user identity information according to the user identity information bound by the user voice model (such as the tone, volume, and other parameters preferred by the user). And adjusting and playing the audio parameters of the A song according to the user's custom parameters; or, if the user identity information bound by the user voice model is the administrator identity information, the voice interaction device may be the speaker corresponding to the current voice data. Open system management permissions.
  • the speaker corresponding to the current voice data can complete the voice registration by using the corresponding response message of the voice feedback. For example, if the speaker can say the response message “My identity information is XXXX”, the voice interaction device can know that the user identity information in the response message is “XXXX” through voice recognition, and perform the user identity information “XXXX” with The user voice model that matches the current voice data is bound.
  • the speaker corresponding to the current voice data may input a corresponding response message through the registration interface to complete the voice registration.
  • the registration interface includes a user name input box, a password input box, a user interest input box, and the like, and the speaker can input corresponding data in each input box in the registration interface, and after clicking submit, the user terminal can The data input in the registration interface is encapsulated into a response message, and the response message is sent to the voice interaction device, so that the voice interaction device responds to the user identity information in the message (eg, includes the user name, password, and user entered in the registration interface). Interests and other information) Binding to the user's speech model that matches the current voice data.
  • FIG. 4 is a schematic flowchart diagram of another voice data processing method according to an embodiment of the present disclosure, where the method may include:
  • S401 Acquire all historical voice data, and train a Gaussian mixture model and a global difference space matrix according to all the historical voice data;
  • the voice interaction device (which may be specifically the voice interaction device 100a integrated with all functions of the background server 100b in the corresponding embodiment of FIG. 1) may be based on the DBSCAN clustering algorithm for the historical speech feature after the dimension reduction Vector clustering, DBSCAN clustering algorithm can assume that the cluster structure can be determined by the closeness of the sample distribution, which can be characterized by a pair of parameters (Eps, MinPts), Eps is the neighborhood radius when defining density, and MinPts is The threshold when defining the core sample, that is, the target cluster model parameters may include: Eps (ie, density domain radius) and MinPts (ie, core sample threshold).
  • the DBSCAN clustering algorithm may generate a sample data set including the sample points by using all the reduced dimensional historical speech feature vectors as sample points, and search for the sample data set according to the density domain radius and the core sample threshold. All sample points that are core points; determine any core point among all core points as a starting point, and find all sample points in the sample data set that have a density-reachable relationship with the starting point as a reachable sample point (in order to Differentiating other sample points, so a sample point having a density-reachable relationship with the starting point is defined as a reachable sample point), and a speech feature cluster including the starting point and all the reachable sample points is generated, and The next core point of all core points is determined as the starting point, and this step is repeated until all core points are determined as the starting point.
  • Direct density is defined as: if x j x i located Eps- neighborhood, and x i is the core point, said to direct the x i x j density.
  • Density is defined as: for x i and x j , if x k is present such that x i and x j are both reachable by x k density, then x i is said to be connected to x j density.
  • the DBSCAN clustering algorithm can classify all sample points in the sample data set D into three categories: core points, boundary points, and noise points; wherein, the core points are: containing no less than MinPts samples in the radius Eps.
  • a cluster cluster is defined as a collection of core points and boundary points with density connections.
  • the DBSCAN clustering algorithm first finds all the core points in the sample data set D according to the parameters (Eps, MinPts), and then finds all the sample points whose density is reachable by using any core point as the reachable sample point, and A speech feature cluster including a starting point and all reachable sample points is generated until all core points are accessed, that is, each of the speech feature clusters may include at least one historical speech feature vector with similar features.
  • the specific algorithm flow for generating a voice feature cluster is described as follows:
  • the formula for determining the divergence divergence within the class is:
  • the system threshold may be base_frequency
  • the divergence threshold in the system class may be base_divergence, that is, if the number of samples in the voice feature cluster is greater than base_frequency and the intra-class divergence of the voice feature cluster is less than base_divergence, the voice feature may be determined.
  • the cluster meets high frequency user conditions. Among them, base_frequency and base_divergence are hyperparameters set by the system.
  • the corresponding user voice model may be trained according to the historical voice feature vector included in the voice feature cluster.
  • the process of training the user's speech model may be: acquiring all historical speech feature vectors in the speech feature cluster satisfying the high-frequency user condition, and performing mean calculation or interpolation calculation on the acquired historical speech feature vectors to obtain the target historical speech. And a feature vector, and the target historical phone feature vector is used as a model parameter of the user speech model corresponding to the phonetic feature cluster.
  • S406-S407 For the specific implementation of S406-S407, refer to S303-S304 in the foregoing embodiment of FIG. 3, and details are not described herein.
  • Eps and MinPts in DBSCAN clustering algorithm directly determine the performance of clustering.
  • Eps and MinPts can be updated regularly, that is, the more accurate Eps and MinPts are clustered The more accurate the new cluster of speech features.
  • the clustering algorithm performance parameter may include two external indicators, namely Jaccard coefficient (Jackard coefficient, JC) and Rand index (Rand index, RI), and the performance of the clustering algorithm can be measured by JC and RI, that is, when When the clustering performance is improved, JC and RI will also increase.
  • the clustering algorithm performance parameter maximization condition can refer to the condition that JC is maximized.
  • JC SS/(SS+SD+DS)
  • RI (SS+DD)/(SS+SD+DS+DD)
  • SS represents the sample point pair with the same actual label and the same cluster label
  • Quantity SD indicates the number of sample point pairs with the same actual label but different cluster labels
  • DS indicates the number of sample point pairs with different actual labels but the same cluster labels
  • DD indicates sample points with different actual labels and different cluster labels.
  • the number of pairs (the label here can refer to the identity information of the speaker).
  • the specific process of updating the current cluster model parameters may be: acquiring all historical phone feature vectors that match the user voice model to which the user identity information is bound, as the first historical phone feature vector, and 70%
  • the first historical speech feature vector is used as the training set, and the remaining 30% of the first historical speech feature vector is used as the verification set;
  • the training set is used to train a DBSCAN clustering model, and the training target is to maximize the JC;
  • the JC of the clustering model on the verification set is determined, and the Eps and MinPts parameters that maximize the JC value of the verification set are selected as the optimized model parameters (ie, the target clustering model parameters).
  • the target clustering model parameters can then be updated periodically or quantitatively so that the target clustering model parameters can be gradually optimized.
  • step S408 It is also possible to accumulate the number of newly added historical voice data after generating the target clustering model parameter, and when the quantity reaches the first quantity threshold, perform step S408; or, after generating the target clustering model parameter, start the accumulated duration, and the When the accumulated duration reaches the first duration threshold, step S408 is performed.
  • the historical voice data increases, some high-frequency users may be added. Therefore, it is necessary to periodically perform re-clustering to divide new voice feature clusters, and then discover new ones through new voice feature clusters. High frequency users. Therefore, if the number of newly added historical voice data after the clustering reaches the second number threshold, or the accumulated time length after the cluster reaches the second duration threshold, the user voice model with and without the user identity information is acquired. All historical speech feature vectors that match, and historical speech feature vectors that do not match all user speech models (ie, historical speech feature vectors that do not belong to high frequency users), as the second historical speech feature vector, and for the first The two historical speech feature vectors are clustered to obtain the currently generated speech feature clusters.
  • the process of clustering the second historical speech feature vectors may be referred to the above step S403, and details are not described herein.
  • the second historical speech feature vector that has not been reduced in dimension may be subjected to dimensionality reduction processing.
  • the voice feature cluster corresponding to the user voice model that is not bound to the user identity information is updated according to the currently generated voice feature cluster, and the update process may be specifically: detecting whether each currently generated voice feature cluster is satisfied.
  • the high-frequency user condition is determined, and the voice feature cluster of the currently generated voice feature cluster that satisfies the high-frequency user condition is determined as the voice feature cluster to be updated, the user voice model corresponding to the voice feature cluster to be updated is trained, and the voice feature cluster to be updated is to be updated.
  • Corresponding user voice model is compared with a user voice model that is not bound to the user identity information before re-clustering, if there is a user voice model corresponding to a voice feature cluster to be updated and an unbound one
  • the user voice model of the user identity information is approximated (eg, the vector distance between the i-Vectors of the two user voice models is less than the preset distance threshold), and the user portrait data in the voice feature cluster to be updated may be transmitted and inherited to a voice feature cluster having a user voice model similar thereto, to complete the unbound user body
  • the update of the speech feature cluster corresponding to the user speech model of the information is approximated (eg, the vector distance between the i-Vectors of the two user voice models is less than the preset distance threshold), and the user portrait data in the voice feature cluster to be updated may be transmitted and inherited to a voice feature cluster having a user voice model similar thereto, to complete the unbound user body
  • the voice feature clusters that do not satisfy the high frequency user condition are replaced, that is, The speech feature clusters that have existed before the clustering that do not satisfy the high frequency user condition are deleted, and the speech features of all currently generated speech feature clusters other than the to-be-updated speech feature clusters that have been used to conduct and inherit user portrait data are retained. cluster.
  • voice feature clusters a1 not satisfying high frequency user conditions
  • voice feature clusters a2 not satisfying high frequency user conditions
  • voice feature clusters a3 user voice models with unbound user identity information
  • a speech feature cluster a4 a user speech model with unbound user identity information
  • a speech feature cluster a5 a user speech model with bound user identity information
  • a currently generated speech feature cluster is obtained B1, a speech feature cluster b2, a speech feature cluster b3, and a speech feature cluster b4, wherein the speech feature cluster b1 and the speech feature cluster b2 do not satisfy the high frequency user condition, and the speech feature cluster b3 and the speech feature cluster b4 satisfy the high frequency user.
  • the user speech model corresponding to the speech feature cluster b3 is similar to the user speech model corresponding to the speech feature cluster a4,
  • the user portrait data in the voice feature cluster b3 can be transmitted and inherited into the voice feature cluster a4 to complete the update of the voice feature cluster a4;
  • the user speech model corresponding to the speech feature cluster b4 is different from the user speech model corresponding to the speech feature cluster a4 and the user speech model corresponding to the speech feature cluster a3, and thus the speech feature cluster b4, the speech feature cluster b1, and the speech feature cluster b2 can be retained.
  • all the voice feature clusters in the voice interaction device include: a voice feature cluster b4, a voice feature cluster b1, a voice feature cluster b2, an updated voice feature cluster a4, The speech feature cluster a3 and the speech feature cluster a5.
  • the user voice with the user identity information already bound is obtained.
  • All historical speech feature vectors matched by the model are used as a third historical speech feature vector, and the user speech model to which the user identity information has been bound is updated according to the third historical speech feature vector; wherein, a bound device is
  • the third historical speech feature vector corresponding to the user speech model of the user identity information may include: an existing historical speech feature vector and a newly added historical speech feature vector after clustering; a model parameter of the user speech model (model parameter)
  • the i-Vector is also generated according to the existing historical speech feature vector.
  • the process of updating the user speech model may be: model parameters of the user speech model and accumulated historical speech features after clustering.
  • the vector is subjected to mean calculation or interpolation calculation to obtain an updated historical speech feature vector, and the updated history is used. Audio feature vector replacement model parameters of the user speech model to complete the update of the user's speech model.
  • the user voice model is updated in the mean value calculation manner, and the user voice model A of the bound user identity information includes the model parameter a1, and the historical voice added after the clustering and matching the user voice model A
  • step S410 It is also possible to accumulate the number of new historical voice data that matches the user voice model after updating a user voice model, and when the number reaches the third number threshold, perform step S410; or, after updating a user voice model, start When the cumulative duration is long and the accumulated duration reaches the third duration threshold, step S410 is performed.
  • the step of S408 may be performed at any time between S401-S407 or before S401 or after S407, that is, the current cluster model parameter may be updated periodically or quantitatively after each clustering. Therefore, the steps of S408 are not performed.
  • the execution order is limited.
  • the step of S409 may be performed at any time between S401-S407 or before S401 or after S407, that is, re-clustering may be performed periodically or quantitatively to update or replace the corresponding voice feature cluster, and therefore, the steps of S409 are not performed.
  • the order is limited.
  • the step of S410 may be performed at any time between S401-S407 or before S401 or after S407, that is, the corresponding user voice model may be updated periodically or quantitatively after each clustering. Therefore, the step execution sequence of S410 is not performed. limited.
  • the first quantity threshold, the second quantity threshold, and the third quantity threshold may be the same or different, and the first duration threshold, the second duration threshold, and the third duration threshold may be the same or different, and are not limited herein. .
  • the first quantity threshold may be set to be slightly smaller than the second quantity threshold (the difference between the two quantity thresholds is small), or the first time threshold is slightly smaller than the second time threshold (two The difference between the duration thresholds is small) to ensure that the target clustering model parameters are updated before each clustering, so that each clustering can be clustered based on the updated target clustering model parameters to improve The accuracy of each clustering; and the first quantity threshold and the second quantity threshold are both greater than the third quantity threshold, or the first duration threshold and the second duration threshold are both greater than the third duration threshold to avoid being too frequent
  • the target clustering model parameters and the speech feature cluster are updated.
  • too frequent updates tend to cause the two target clustering model parameters before and after the update to be too similar, which leads to waste of system resources, and too frequent updates are likely to lead to before and after the update.
  • the voice feature cluster does not change much, which leads to waste of system resources.
  • For the user voice model it can be updated more frequently. The accuracy of the user's speech model is guaranteed so that the user's voice can be matched to the correct user speech model faster and more accurately.
  • the GMM can also be updated periodically or quantitatively. As time passes, the accumulated historical voice data is more and more, and then all the history is increased according to the quantity.
  • the voice data training GMM can improve the accuracy of the GMM, and then improve the accuracy of the calculated i-vector after updating the GMM.
  • the voice interaction device may acquire sample voice data and set a corresponding sample user identity label for the sample voice data. (that is, the speaker information corresponding to each piece of sample speech data is known), and then the initial clustering model is trained according to the clustering algorithm performance parameter maximization condition, the correspondence between the sample speech data and the sample user identity tag. Parameters, and the trained initial cluster model parameters are determined as the target cluster model parameters.
  • the specific process of training the initial clustering model parameters refer to the process of updating the current clustering model parameters in step S408 above, and details are not described herein.
  • the first clustering may be performed according to the initial clustering model parameters, and the initial clustering model parameters are determined as the target clustering model parameters, and then the target clustering model may be periodically or quantitatively determined.
  • the parameters are updated. For example, acquiring 20 sets of wake-up speech data (ie, sample speech data) containing the actual identity tag of the speaker (ie, the sample user identity tag), each group containing 10 speakers, each speaker having 10 wake-up speech data, The wake-up speech data of 7 speakers is randomly selected from each group as the training set, and the wake-up speech data of the remaining 3 speakers is used as the verification set; for each set of data, the i-vector of the wake-up speech data is extracted and dimension reduced.
  • the training goal is to maximize the JC; to avoid the over-fitting of the training, determine the JC of the clustering model on the verification set during the training process, and select the JC value of the verification set to be the largest.
  • the Eps and MinPts parameters are used as initial cluster model parameters.
  • FIG. 5 is a schematic diagram of a scenario for updating a parameter according to an embodiment of the present application.
  • the voice interaction device may first acquire sample voice data, and generate an i-vector corresponding to the sample voice data (here, may be a dimensionally reduced i-vector), and according to the sample voice data.
  • the corresponding i-vector trains a DBSCAN clustering model, and the training goal is to maximize the JC; in order to avoid the over-fitting of the training, the JC of the clustering model on the verification set is determined during the training process, and the JC value of the verification set is selected to be the largest.
  • the Eps and MinPts parameters are used as initial cluster model parameters, ie, initialized Eps and MinPts.
  • the voice interaction device can be an i-vector corresponding to the historical voice data (here, the i-vector after the dimension reduction), and the historical voice data according to the initialized Eps and MinPts.
  • the corresponding i-vector performs the DBSCAN clustering, and the high-frequency user discovery and the user identity automatic registration can be performed according to the voice feature clusters obtained after the clustering (refer to S401-S407 in the corresponding embodiment of FIG. 4 above). As shown in FIG.
  • the user voice model of the bound user identity information obtained by the voice interaction device may include a user voice model a, a user voice model b, and a user voice model c.
  • the voice interaction device can also train a DBSCAN according to the voice feature cluster corresponding to the user voice model of the bound user identity information (such as the voice feature cluster a, the voice feature cluster b, and the voice feature cluster c) in a timed or quantitative manner.
  • the clustering model the training goal is to maximize the JC; to avoid the over-fitting of the training, determine the JC of the clustering model on the verification set during the training process, and select the Eps and MinPts parameters that maximize the JC value of the verification set as the updated
  • the clustering model parameters that is, the updated Eps and MinPts (refer to S408 in the corresponding embodiment of FIG. 4 above).
  • the voice interaction device can perform DBSCAN clustering on the i-vector corresponding to the historical voice data (including the newly added historical voice data) according to the updated Eps and MinPts, and obtain the voice features as shown in FIG.
  • the corresponding voice feature cluster is updated, and the voice feature cluster that does not satisfy the high frequency user condition is replaced.
  • the Eps and MinPts may be updated periodically or quantitatively according to the voice feature cluster corresponding to the user voice model of the bound user identity information, and the user voice model may be gradually trained as the user voice model of the user identity information is bound. More accurate and reasonable Eps and MinPts. Among them, the initialized Eps and MinPts are only used for the first clustering, and each subsequent cluster uses the latest updated Eps and MinPts.
  • the embodiment of the present application provides technical feasibility verification for the above solution.
  • Smart speakers are usually not attributed to a specific user and are used by multiple users, but their size is very limited. For example, in the speaker equipment used in the home, the number of users usually does not exceed 10; and members of the family, due to differences in age, gender, etc., the distinguishing characteristics of the voiceprint characteristics are more obvious.
  • 10 people are randomly selected from a group of 600 people as a group, and each person provides 10 sentences with the same identical wake-up words as voice samples.
  • the embodiments of the present application organize two sets of experiments for verifying the feasibility of the above clustering method and the feasibility of high frequency user discovery.
  • the feasibility verification process of the clustering method may be: randomly generating 10 sets of data (each set of data includes 10 identical speech samples provided by 10 individuals not repeated) as a training set, and each group is randomly selected.
  • the voice data of 7 people is used to train the model parameters (Eps, MinPts), the training goal is to maximize the JC, the voice data of the remaining 3 people is used for verification to mitigate the model overfitting; 10 sets of data are randomly generated as the test set, and the test training is performed.
  • the performance of the obtained clustering model is based on JC and RI to measure the performance of the clustering model.
  • FIG. 6 is a schematic diagram of a performance verification result provided by an embodiment of the present application.
  • the JC and RI in the 10 sets of test sets (such as group 1-group 10 in FIG. 6 ) are higher. That is to say, the clustering model has higher performance. Therefore, the clustering method in the embodiment of the present application is feasible.
  • the feasibility verification process discovered by the high-frequency user may be: first, obtain 10 sets of test sets in the feasibility verification process of the above clustering method, for each set of test sets, after clustering and high-frequency user discovery are completed,
  • the category of the voice feature cluster in which the found high frequency user is located is set as the category of the voice sample having the most occurrences in the voice feature cluster.
  • the precision and recall of each of the discovered voice feature clusters satisfying the high-frequency user condition in the test set can be determined to satisfy all high-frequency users.
  • the mean value of the precision and recall ratio of the conditional speech feature cluster indicates the performance of the high frequency user discovery algorithm on the test set; wherein the higher the precision and the recall rate, the higher the frequency cluster found accurate. Please refer to FIG.
  • FIG. 8 is a schematic structural diagram of a voice data processing apparatus according to an embodiment of the present application.
  • the voice data processing apparatus 1 can be applied to the voice interaction device in the foregoing embodiment of FIG. 3 or FIG. 4, the voice data processing apparatus 1 can include: a clustering module 10, a first training module 20, Requesting the initiating module 30, the binding module 40;
  • the clustering module 10 acquires historical voice data, acquires a historical phonetic feature vector corresponding to the historical phonetic data, and clusters the historical phonetic feature vector to obtain a phonetic feature cluster; the voice feature cluster includes at least one feature Similar historical phonetic feature vectors;
  • a first training module 20 configured to: if the voice feature cluster meets a high frequency user condition, train a corresponding user voice model according to the historical voice feature vector included in the voice feature cluster;
  • the first training module 20 may be specifically configured to perform mean calculation or interpolation calculation on the historical speech feature vector included in the speech feature cluster to obtain a target historical speech feature vector, and the target historical speech feature.
  • the vector is used as a model parameter of the user speech model corresponding to the speech feature cluster.
  • the request initiating module 30 is configured to initiate a user identity association request associated with the current voice data if the current voice feature vector of the current voice data is detected to match the user voice model;
  • the binding module 40 is configured to bind the user identity information in the response message to the user voice model if receiving a response message corresponding to the user identity association request.
  • the first training module 20 the request initiating module 30, and the binding module 40, refer to S301-S304 in the corresponding embodiment of FIG. 3, and details are not described herein.
  • the clustering module 10 may include: an acquisition training unit 101, a vector processing unit 102, and a clustering unit 103;
  • Obtaining a training unit 101 configured to acquire all historical voice data, and train a Gaussian mixture model and a global difference space matrix according to all the historical voice data;
  • the vector processing unit 102 is configured to project all the historical voice data into the vector space according to the Gaussian mixture model and the global difference spatial matrix, generate a historical speech feature vector corresponding to each historical speech data, and Historical speech feature vector for dimensionality reduction;
  • the clustering unit 103 is configured to cluster the reduced-dimension historical speech feature vectors according to the target clustering model parameters to obtain the speech feature clusters.
  • the target clustering model parameters include: a density domain radius and a core sample threshold.
  • the clustering unit 103 may include: a search subunit 1031, a cluster subunit 1032, and a notification subunit 1033;
  • the searching sub-unit 1031 is configured to generate a sample data set including the sample points by using all the reduced-dimensional historical speech feature vectors as sample points, and according to the density domain radius and the core sample threshold in the sample data set Find all sample points that are core points;
  • the clustering sub-unit 1032 is configured to determine any one of the core points as a starting point among all the core points, and find all sample points having a density-reachable relationship with the starting point in the sample data set as the reachable sample points, and Generating a speech feature cluster including the starting point and all of the reachable sample points;
  • a notification subunit 1033 configured to determine a next core point of all core points as the starting point, and notify the cluster subunit 1032 to generate the voice feature cluster corresponding to the starting point until all core points are Determined to be the starting point.
  • the specific function implementation manners of the locating sub-unit 1031, the arranging sub-unit 1032, and the notification sub-unit 1033 can be referred to the S403 in the corresponding embodiment of FIG. 4, and details are not described herein.
  • the voice data processing apparatus 1 may further include: an acquisition calculation module 50, a condition determination module 60, a sample setting module 70, a second training module 80, a first update module 90, a second update module 100, and a Three update module 110;
  • the obtaining calculation module 50 is configured to acquire the number of the historical speech feature vectors included in the speech feature cluster, and according to the number of the historical speech feature vectors included in the speech feature cluster, and the speech feature cluster
  • the included historical phonetic feature vector determines an intra-class divergence corresponding to the phonetic feature cluster
  • the condition determining module 60 is configured to determine the voice feature if the number of the historical phone feature vectors included in the phonetic feature cluster is greater than a system number threshold, and the intra-class divergence is less than a system intra-class divergence threshold The cluster meets high frequency user conditions.
  • the sample setting module 70 is configured to acquire sample voice data, and set a corresponding sample user identity label for the sample voice data;
  • the second training module 80 is configured to train the initial cluster model parameters according to the clustering algorithm performance parameter maximization condition, the correspondence between the sample voice data and the sample user identity tag, and assemble the initial cluster after training
  • the class model parameters are determined as the target cluster model parameters.
  • sample setting module 70 and the second training module 80 refer to the process of initializing the clustering model parameters in the corresponding embodiment of FIG. 4, and details are not described herein.
  • the first update module 90 is configured to acquire and bind the user if the number of newly added historical voice data after the clustering reaches a first quantity threshold, or the accumulated time length after the cluster reaches a first duration threshold. All historical speech feature vectors matched by the user speech model of the identity information, as the first historical speech feature vector, and according to the clustering algorithm performance parameter maximization condition, the first historical speech feature vector and the bound user Corresponding relationship between the identity information, updating the current cluster model parameters, and obtaining the target cluster model parameters.
  • the second update module 100 is configured to acquire and unbind the user if the number of newly added historical voice data after the clustering reaches a second threshold, or the accumulated duration after the cluster reaches a second duration threshold. All historical speech feature vectors matched by the user speech model of the identity information, and historical speech feature vectors that do not match all of the user speech models, as the second historical speech feature vector, and the second historical speech feature vector is aggregated a class, the currently generated speech feature cluster is obtained, and the speech feature cluster corresponding to the user speech model not bound to the user identity information is updated according to the currently generated speech feature cluster, and the high frequency is not satisfied The voice feature cluster of the user condition is replaced.
  • the third update module 110 is configured to acquire and bind the user if the number of newly added historical voice data after the clustering reaches a third number threshold, or the accumulated time length after the cluster reaches a third duration threshold. All historical speech feature vectors matched by the user speech model of the identity information are used as a third historical speech feature vector, and the user speech model to which the user identity information has been bound is updated according to the third historical speech feature vector;
  • the third update module 110 is further configured to acquire all historical voice feature vectors that match the user voice model that is not bound to the user identity information, as the fourth historical voice feature vector, and according to the fourth historical voice.
  • the feature vector updates the user speech model that is not bound to the user identity information.
  • FIG. 9 is a schematic structural diagram of a voice interaction device according to an embodiment of the present application.
  • the voice interaction device 1000 may be the voice interaction device in the foregoing embodiment of FIG. 3 or FIG. 4, and the voice interaction device 1000 may include: a processor 1001, a network interface 1004, and a memory 1005.
  • the voice interaction device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Among them, the communication bus 1002 is used to implement connection communication between these components.
  • the user interface 1003 may include a display and a keyboard.
  • the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 can include a standard wired interface, a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high speed RAM memory or a non-volatile memory such as at least one disk memory.
  • the memory 1005 may also be at least one storage device located remotely from the aforementioned processor 1001. As shown in FIG. 9, an operating system, a network communication module, a user interface module, and a device control application may be included in the memory 1005 as a computer storage medium.
  • the network interface 1004 can provide a network communication function; and the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control stored in the memory 1005.
  • the voice interaction device 1000 described in the embodiment of the present application may perform the voice data processing method in the embodiment corresponding to FIG. 3 to FIG. 4, and may also include the voice data in the foregoing embodiment of FIG.
  • the processing device 1 will not be described here.
  • the embodiment of the present application further provides a computer storage medium, and the computer storage medium stores the computer program executed by the voice data processing device 1 mentioned above, and the computer program
  • the program instruction is included, and when the processor executes the program instruction, the voice data processing method in the embodiment corresponding to FIG. 3 to FIG. 4 can be executed, and therefore, details are not described herein.
  • the description of the beneficial effects of the same method will not be repeated.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例公开了一种语音数据处理方法、语音交互设备及存储介质,其中方法包括:获取历史语音数据,并获取历史语音数据对应的历史语音特征向量,并对历史语音特征向量进行聚类,得到语音特征簇;语音特征簇包含至少一个特征相似的历史语音特征向量;若语音特征簇满足高频用户条件,则根据语音特征簇所包含的历史语音特征向量训练对应的用户语音模型;若检测到当前语音数据的当前语音特征向量与用户语音模型相匹配,则发起与当前语音数据相关联的用户身份关联请求;若接收到与用户身份关联请求对应的响应消息,则将响应消息中的用户身份信息与用户语音模型进行绑定。

Description

一种语音数据处理方法、语音交互设备及存储介质
本申请要求于2017年11月24日提交中国专利局、申请号为201711191651.3、发明名称为“一种语音数据处理方法、装置以及语音交互设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种语音数据处理方法、语音交互设备及存储介质。
发明背景
随着语音识别技术的发展,有越来越多的产品使用到了语音识别技术,例如可以音控的汽车、音箱、电视等等,即语音交互设备可以对说话人的语音进行识别并根据识别的内容实现自动化控制。
可进行语音识别的语音交互设备可以针对不同说话人的语音特征进行个性化服务,在此之前,说话人需要主动对语音交互设备进行语音注册,以注册该说话人的语音特征和该说话人的说话人信息之间的关系,从而在后续识别出某语音与该说话人的语音特征相匹配后,可以提供与该说话人的说话人信息对应的使用权限。但是目前的语音注册过程通常都需要说话人对着语音交互设备重复且清晰地说出许多遍的固定句子,以提取说话人的语音特征,由此可见,目前的语音注册方式是需要由说话人主动发起,且注册时间可能会花费较长时间,导致语音注册效率低下;而且在语音注册过程中,说话人很容易因一时粗心导致说话人的语音内容与系统提供的固定句子不同,进而导致语音注册失败,从而降低了语音注册的成功率。
发明内容
本申请实施例提供一种语音数据处理方法、语音交互设备及存储介质,可提高语音注册效率,且可以提高语音注册的成功率。
本申请的一方面提供了一种语音数据处理方法,由语音交互设备执行,包括:
获取历史语音数据,并获取所述历史语音数据对应的历史语音特征向量,并对所述历史语音特征向量进行聚类,得到语音特征簇;所述语音特征簇包含至少一个特征相似的历史语音特征向量;
若所述语音特征簇满足高频用户条件,则根据所述语音特征簇所包含的所述历史语音特征向量训练对应的用户语音模型;
若检测到当前语音数据的当前语音特征向量与所述用户语音模型相匹配,则发起与所述当前语音数据相关联的用户身份关联请求;
若接收到与所述用户身份关联请求对应的响应消息,则将所述响应消息中的用户身份信息与所述用户语音模型进行绑定。
本申请的另一方面提供了一种语音交互设备,包括:处理器、存储器;所述处理器与存储器相连,其中,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,以执行以下操作:
获取历史语音数据,并获取所述历史语音数据对应的历史语音特征向量,并对所述历史语音特征向量进行聚类,得到语音特征簇;所述语音特征簇包含至少一个特征相似的历史语音特征向量;
若所述语音特征簇满足高频用户条件,则根据所述语音特征簇所包含的所述历史语音特征向量训练对应的用户语音模型;
若检测到当前语音数据的当前语音特征向量与所述用户语音模型相匹配,则发起与所述当前语音数据相关联的用户身份关联请求;
若接收到与所述用户身份关联请求对应的响应消息,则将所述响应消息中的用户身份信息与所述用户语音模型进行绑定。
本申请的另一方面提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时,执行上述语音数据处理方法。
附图简要说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种系统架构的示意图;
图2a是本申请实施例提供的一种语音数据处理方法的场景示意图;
图2b是本申请实施例提供的另一种语音数据处理方法的场景示意图;
图2c是本申请实施例提供的又一种语音数据处理方法的场景示意图;
图3是本申请实施例提供的一种语音数据处理方法的流程示意图;
图4是本申请实施例提供的另一种语音数据处理方法的流程示意图;
图5是本申请实施例提供的一种参数更新方法的场景示意图;
图6是本申请实施例提供的一种性能验证结果的示意图;
图7是本申请实施例提供的另一种性能验证结果的示意图;
图8是本申请实施例提供的一种语音数据处理装置的结构示意图;
图9是本申请实施例提供的一种语音交互设备的结构示意图。
实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案 进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参见图1,是本申请实施例提供的一种系统架构的示意图。如图1所示,该系统架构可以包括语音交互设备100a和后台服务器100b,语音交互设备100a可以通过互联网连接后台服务器100b;语音交互设备100a可以包括可进行语音识别的智能设备,如智能音箱、智能手机、电脑、智能电视、智能空调等。语音交互设备100a可以接收用户的语音数据,并将语音数据发送到后台服务器100b,使得后台服务器100b可以对语音数据进行语音识别并根据识别出的语义生成控制指令,语音交互设备100a接收后台服务器100b发送的控制指令以执行对应的控制操作。例如,用户对着语音交互设备100a说“播放歌曲A”,则语音交互设备100a可以将语音数据“播放歌曲A”发送到后台服务器100b,后台服务器100b对语音数据“播放歌曲A”进行语音识别并生成对应的控制指令,语音交互设备100a接收该控制指令并根据该控制指令播放歌曲A。
其中,后台服务器100b还可以根据语音交互设备100a所发送的语音数据进行高频用户的发现以及主动对高频用户发起身份注册,该过程可以一并参见图2a-图2c,均为本申请实施例提供的一种语音数据处理方法的场景示意图。如图2a所示,语音交互设备100a可以接收多个用户的语音数据,每个语音数据都可以转发到后台服务器100b(其中根据语音数据进行语音控制的过程请参见上述图1对应实施例),因此,后台服务器100b可以存储大量语音数据,形成历史语音数据;如图2a所示,后台服务器100b可以对所有历史语音数据进行聚类,得到语音特征簇1、语音特征簇2、语音特征簇3、语音特征簇4,每个语音特征簇包含至少一个特征相似的历史语音特征向量,历史语音特征向量可以是 指历史语音数据对应的i-Vector。如图2a所示,后台服务器100b可以进一步基于所有语音特征簇进行高频用户发现,具体可以通过分析每个语音特征簇中的历史语音特征向量的数量和历史语音特征向量的分布密度,来确定语音特征簇所对应的用户是否为高频用户,在图2a中,后台服务器100b分析出语音特征簇1和语音特征簇4所对应的用户属于高频用户(经常向语音交互设备100a发送语音数据的用户可以确定为高频用户),后台服务器100b再进一步为语音特征簇1创建对应的用户语音模型1,并为语音特征簇4创建对应的用户语音模型4。其中,用户语音模型1和用户语音模型4均属于未绑定用户身份信息的用户语音模型,即未注册的用户语音模型。
进一步的,再请一并参见图2b,在图2b中,语音交互设备100a可以将当前时刻的用户1的语音数据转发至后台服务器100b(其中根据该语音数据进行语音控制的过程请参见上述图1对应实施例),后台服务器100b可以对该语音数据进行模型匹配,具体是将该语音数据对应的i-Vector分别与图2a中的用户语音模型1和用户语音模型4进行比较,如图2b所示,用户语音模型1与该语音数据相匹配,此时,可以向语音交互设备100a发起与用户语音模型1对应的用户身份关联请求,语音交互设备100a中的身份关联模块可以向用户1发出用户注册提示音(例如,该提示音为“请输入您的身份信息”),用户1根据该提示音可以通过语音或客户端的方式将用户1的身份信息发送给语音交互设备100a,语音交互设备100a再将用户1的身份信息转发给后台服务器100b,后台服务器100b可以对用户1的身份信息进行用户身份注册,该注册过程即为将用户语音模型1与用户1的身份信息进行绑定。其中,用户语音模型1属于已绑定用户身份信息的用户语音模型,即为已注册的用户语音模型;用户语音模型4仍属于未绑定用户身份信息的用户语音模型。由此可见,可以避免用户需要重复多次发出固定句子的语音内容才能实 现语音注册,即用户只需响应语音交互设备所发起的用户身份关联请求即可完成语音注册,进而可以提高语音注册的效率。
在对用户语音模型1进行身份信息绑定后,再一并参见图2c,如图2c所示,后台服务器100b可以持续接收语音交互设备100a发送的语音数据,以形成更多的历史语音数据,为了保证后台服务器100b可以继续发现新的高频用户,后台服务器100b可以定时或定量的对历史语音数据进行重新聚类。在图2c中,后台服务器100b可以对所有历史语音数据中除了与用户语音模型1相匹配的历史语音数据以外的其他历史语音数据进行聚类(由于用户语音模型1已完成注册,所以无需再对与用户语音模型1相匹配的历史语音数据进行聚类),得到语音特征簇2、语音特征簇3、语音特征簇4、语音特征簇5;图2c中的语音特征簇2可以包含图2a中的语音特征簇2,且还可以包含一些新增的历史语音特征向量;图2c中的语音特征簇3可以包含图2a中的语音特征簇3,且还可以包含一些新增的历史语音特征向量;图2c中的语音特征簇4可以包含图2a中的语音特征簇4,且还可以包含一些新增的历史语音特征向量;图2c中的语音特征簇5为新增加的语音特征簇。后台服务器100b进一步在语音特征簇2、语音特征簇3、语音特征簇4、语音特征簇5中进行高频用户发现,进而分析出语音特征簇3、语音特征簇4分别对应的用户属于高频用户,由于语音特征簇4已有对应的用户语音模型4,所以只需要创建语音特征簇3对应的用户语音模型3;其中,还可以通过图2c中的语音特征簇4中的所有历史语音特征向量对已有的用户语音模型4进行更新。如图2c所示,此时的后台服务器100b中的用户语音模型包括用户语音模型1、用户语音模型3、用户语音模型4,用户语音模型1属于已绑定用户身份信息的用户语音模型;用户语音模型3、用户语音模型4属于未绑定用户身份信息的用户语音模型,因此,后续在检测到语音数据与用户语音模型3或用户语音模型4相匹配时,可以发起用 户身份注册。随着用户使用量的增加,可以增加更多的未绑定用户身份信息的用户语音模型,且基于用户身份自动注册的机制,可以逐渐将未绑定用户身份信息的用户语音模型转换为已绑定用户身份信息的用户语音模型,即逐渐完成每个高频用户的身份注册。
后台服务器100b的所有功能均可以集成到语音交互设备100a中,即语音交互设备100a可以直接对语音数据进行语音识别以实现语音控制,语音交互设备100a也可以直接根据所接收到的语音数据进行高频用户发现以及主动对高频用户发起身份注册。
以下图3-图9对应的实施例,以后台服务器100b集成到语音交互设备100a为例,对高频用户发现以及主动对高频用户发起身份注册的具体过程进行详细描述。
请参见图3,是本申请实施例提供的一种语音数据处理方法的流程示意图,所述方法可以包括:
S301,获取历史语音数据,并获取所述历史语音数据对应的历史语音特征向量,并对所述历史语音特征向量进行聚类,得到语音特征簇;所述语音特征簇包含至少一个特征相似的历史语音特征向量;
具体的,在启动语音交互设备(该语音交互设备可以具体为上述图1对应实施例中的集成有后台服务器100b的所有功能的语音交互设备100a)后,语音交互设备可直接对所获取到的用户语音进行语义识别,进而执行与语义相关联的控制操作。其中,语音交互设备可以包括音响/音箱、电视、汽车、手机、VR(Virtual Reality,虚拟现实)设备等可进行语音交互识别和控制的智能设备。例如,用户对语音交互设备说“播下一首歌”,则语音交互设备在分析出该语义后,语音交互设备即可将当前歌曲切换到下一首歌进行播放。因此,语音交互设备可以无需等待用户完成语音注册后才能启动语音控制功能,即在将用户的语音特征与用户身份信息进行绑定之前,语音交互设备即可根据用户的语音内容执 行相关联的控制操作。而且多个不同的用户均可以向该语音交互设备说出用户语音,使得语音交互设备可以根据各用户语音指令执行相关联的控制操作,语音交互设备还可以记录和保存各用户语音,并将所保存的每条用户语音均确定为历史语音数据。
当所保存的历史语音数据的数量达到第一数量阈值时,获取所有历史语音数据,并确定每个历史语音数据分别对应的历史语音特征向量。历史语音特征向量可以为i-vector(identity-vector)。其中获取i-vector的过程可以为:首先利用所有历史语音数据训练得到高阶的高斯混合模型(GMM,Gaussian Mixture Model),GMM可以用于刻画说话人的语音特征空间,这个模型通常被称作通用背景模型(UBM,Universal Background Model),即GMM-UBM模型;再利用GMM-UBM模型对每条历史语音数据进行参数估计,以确定高斯混合模型各分量的混合权重、均值向量、方差矩阵分别对应的零阶、一阶和二阶Baum-Welch统计量,然后再利用EM(Expectation Maximization Algorithm,期望最大化算法)算法迭代得到全局差异矩阵T;通过矩阵T,可以把分别把每条历史语音数据中隐含在高维说话人语音空间中的说话人及通道相关特性,投影到低维空间中,从而获得每条历史语音数据的历史语音特征向量,也就是i-Vector。即确定i-Vector的具体公式可以为:M=m+Tw,其中,M代表高斯混合模型的高维均值超矢量,m代表与说话人信息和信道信息无关的一个超矢量,T为全局差异空间,w是包含整段语音中的说话人信息和信道信息的一个全差异因子(即i-Vector)。全局差异矩阵T也可以基于深度神经网络训练而成。
语音交互设备可以进一步对历史语音特征向量进行降维,并根据目标聚类模型参数对降维后的历史语音特征向量进行聚类,得到语音特征簇。其中,对历史语音特征向量进行降维的过程可以为:采用PCA(Principal Component Analysis,主成份分析)、tSNE(t-distributed  stochastic neighbor embedding,t分布领域嵌入)及LDA(Linear Discriminant Analysis,线性判别分析)等算法,对获取到的历史语音特征向量(即i-vector)进行数据降维处理,去除数据中冗余的多重共线成分,减小聚类的计算量。其中,利用PCA和tSNE的降维是非监督的,即毋须预先训练模型,可以直接应用于对i-vector进行降维;其中,利用LDA降维需要预先使用带实际标签的i-vector数据训练出最优的投影方向,然后将其应用于对i-vector进行降维。
其中,对所有历史语音特征向量进行聚类的具体过程可以为:采用DBSCAN(Density-Based Spatial Clustering of Applications with Noise,基于密度的带噪声应用的空间聚类)聚类算法,使用欧式距离作为样本的距离度量,将降维后的历史语音特征向量聚类成簇(即语音特征簇)。DBSCAN聚类算法可以找出特征空间中形状不规则的簇,且聚类时毋须事先设定簇的数量,从而可以满足本申请实施例中说话人数量事先未知的场景需求。
S302,若所述语音特征簇满足高频用户条件,则根据所述语音特征簇所包含的所述历史语音特征向量训练对应的用户语音模型;
具体的,在生成至少一个语音特征簇后,针对每一个语音特征簇,均可以根据语音特征簇中的历史语音特征向量的数量以及该语音特征簇中的历史语音特征向量的分布情况,确定该语音特征簇是否满足高频用户条件。例如,语音特征簇中的历史语音特征向量的数量超过预设的数量阈值以及该语音特征簇中的历史语音特征向量的分布密度也超过预设的密度阈值,则可以确定该语音特征簇满足高频用户条件,也即说明该语音特征簇所对应的说话人为经常与语音交互设备进行语音交互的用户。
在确定出某语音特征簇满足高频用户条件后,可以根据所述语音特征簇所包含的所述历史语音特征向量训练对应的用户语音模型。其中, 训练用户语音模型的过程可以为:获取满足高频用户条件的语音特征簇中的所有历史语音特征向量,并对所获取的这些历史语音特征向量进行均值计算或插值计算,得到目标历史语音特征向量,并将所述目标历史语音特征向量作为所述语音特征簇对应的用户语音模型的模型参数。其中,对语音特征簇中的历史语音特征向量进行均值计算的方式可以为:将语音特征簇中的各历史语音特征向量进行相加,再除以该语音特征簇中的历史语音特征向量的数量,得到目标历史语音特征向量。或者,对语音特征簇中的历史语音特征向量进行均值计算的方式可以为:根据权重系数对语音特征簇中的各历史语音特征向量进行权重相加,再除以该语音特征簇中的历史语音特征向量的数量,得到目标历史语音特征向量。
S303,若检测到当前语音数据的语音特征向量与所述用户语音模型相匹配,则发起与所述当前语音数据相关联的用户身份关联请求;
具体的,语音交互设备每接收到一条语音数据,都会获取该语音数据的语音特征向量(即i-vector),然后将该语音数据的语音特征向量与各个已创建的用户语音模型进行比较。如果当前接收到的一条语音数据(即当前语音数据)与某一个用户语音模型相匹配,且该用户语音模型未与用户身份信息进行绑定,则语音交互设备可以发起与当前语音数据相关联的用户身份关联请求,该用户身份关联请求的具体形式可以为:用于进行用户身份关联的语音提示(如语音交互设备发送语音“请绑定您的身份信息”),或者是向用户终端发送的用于进行用户身份关联的注册界面(如该注册界面可以在用户手机上显示,用户通过在注册界面上填写自己的身份信息,或也可以在注册界面上与用户账号进行绑定,以完成语音注册)。其中,可以通过欧氏距离分别确定当前语音数据的i-vector与每个用户语音模型中的i-vector之间的向量距离,将向量距离小于距离阈值的用户语音模型确定为与当前语音数据相匹配的用户语音模型。
若与当前语音数据相匹配的用户语音模型为已绑定用户身份信息的用户语音模型,则将当前语音数据对应的语音特征向量保存至相匹配的用户语音模型对应的语音特征簇中,以便于用于后续更新该用户语音模型以提高该用户语音模型的准确性,同时,语音交互设备也可以根据该用户语音模型向当前语音数据对应的说话人提供相应的个性化服务,例如,若当前语音数据为“播放A歌曲”,则语音交互设备可以根据该用户语音模型所绑定的用户身份信息,获取与该用户身份信息对应的用户习惯参数(如该用户所喜欢的音调、音量等参数),并根据用户习惯参数对A歌曲的音频参数进行调整和播放;或者,若该用户语音模型所绑定的用户身份信息为管理员身份信息,则语音交互设备可以向当前语音数据对应的说话人开放系统管理权限。
S304,若接收到与所述用户身份关联请求对应的响应消息,则将所述响应消息中的用户身份信息与所述用户语音模型进行绑定;
具体的,语音交互设备发起用户身份关联请求后,当前语音数据对应的说话人可以通过语音反馈对应的响应消息以完成语音注册。例如,说话人可以说出响应消息“我的身份信息为XXXX”,则语音交互设备可以通过语音识别得知该响应消息中的用户身份信息为“XXXX”,进行将用户身份信息“XXXX”与当前语音数据相匹配的用户语音模型进行绑定。
或者,若用户身份关联请求的具体形式是向用户终端发送的用于进行用户身份关联的注册界面,则当前语音数据对应的说话人可以通过该注册界面输入对应的响应消息以完成语音注册。例如,注册界面中包含用户名称输入框、密码输入框、用户兴趣爱好输入框等等,说话人可以在注册界面中的各个输入框中输入相应的数据,并在点击提交后,用户终端可以将注册界面中所输入的数据封装为响应消息,并将该响应消息发送给语音交互设备,使得语音交互设备将响应消息中的用户身份信息 (如包含注册界面中所输入的用户名称、密码、用户兴趣爱好等信息)与当前语音数据相匹配的用户语音模型进行绑定。
请参见图4,是本申请实施例提供的另一种语音数据处理方法的流程示意图,所述方法可以包括:
S401,获取所有历史语音数据,并根据所述所有历史语音数据训练高斯混合模型和全局差异空间矩阵;
S402,根据所述高斯混合模型和所述全局差异空间矩阵将所述所有历史语音数据投影至向量空间,生成每个历史语音数据分别对应的历史语音特征向量,并对所述历史语音特征向量进行降维;
其中,S401-S402步骤的具体实现方式可以参见上述图3对应实施例中的S301,这里不再进行赘述。
S403,根据目标聚类模型参数对降维后的历史语音特征向量进行聚类,得到所述语音特征簇;
具体的,语音交互设备(该语音交互设备可以具体为上述图1对应实施例中的集成有后台服务器100b的所有功能的语音交互设备100a)可以基于DBSCAN聚类算法对降维后的历史语音特征向量进行聚类,DBSCAN聚类算法可以假设聚类结构能够通过样本分布的紧密程度来确定,该紧密程度可由一对参数(Eps,MinPts)刻画,Eps为定义密度时的邻域半径,MinPts为定义核心样本时的阈值,即目标聚类模型参数可以包括:Eps(即密度领域半径)和MinPts(即核心样本阈值)。基于DBSCAN聚类算法可以以所有降维后的历史语音特征向量为样本点生成包含所述样本点的样本数据集,并根据所述密度领域半径和所述核心样本阈值在所述样本数据集中查找所有为核心点的样本点;在所有核心点中确定任意一个核心点为出发点,并在所述样本数据集中查找与所述出发点具有密度可达关系的所有样本点,作为可达样本点(为了区别其他的样本点,所以这里将与所述出发点具有密度可达关系的样本点定义 为可达样本点),并生成包含所述出发点和所有所述可达样本点的语音特征簇,并将所有核心点中的下一个核心点确定为所述出发点,重复执行本步骤,直至所有核心点均被确定为所述出发点。
例如,假设将所有历史语音数据对应的历史语音特征向量确定为样本数据集D={x 1,x 2,...,x m},其中每个样本点x j即为一条历史语音数据对应的历史语音特征向量,设任意两个样本点的距离函数为dist()。其中,Eps-邻域的定义为:对x j∈D,其Eps-邻域包含D中与x j的距离不大于Eps的样本点,即N Eps(x j)={x i∈D|dist(x i,x j)≤Eps}。密度直达的定义为:若x j位于x i的Eps-邻域中,且x i是核心点,则称x j由x i密度直达。密度可达的定义为:对x i与x j,若存在样本点序列p 1,p 2,...,p n,其中p 1=x i,p n=x j,且p i+1由p i密度直达,则称x j由x i密度可达。密度相连的定义为:对x i与x j,若存在x k使得x i与x j均由x k密度可达,则称x i与x j密度相连。基于上述概念,DBSCAN聚类算法可以将样本数据集D中的所有样本点分为三类:核心点、边界点以及噪声点;其中,核心点为:在半径Eps内含有不少于MinPts个样本点的样本点;边界点为:在半径Eps内样本点数量小于MinPts且落在某核心点邻域内的样本点;噪声点为:既非核心点也非边界点的样本点。因此,一个聚类簇定义为一个具有密度相连关系的核心点与边界点的集合。DBSCAN聚类算法首先根据参数(Eps,MinPts)找出样本数据集D中所有的核心点,然后以任意核心点为出发点找出所有由其密度可达的样本点,作为可达样本点,并生成包含出发点和所有可达样本点的语音特征簇,直到所有的核心点均被访问为止,即每个语音特征簇均可以包含至少一个特征相似的历史语音特征向量。生成语音特征簇的具体算法流程描述如下:
Figure PCTCN2018116590-appb-000001
Figure PCTCN2018116590-appb-000002
由上述内容可知,DBSCAN聚类算法中两个关键参数Eps和MinPts直接决定着聚类的性能。
S404,获取所述语音特征簇所包含的所述历史语音特征向量的数量,并根据所述语音特征簇所包含的所述历史语音特征向量的数量,以及所 述语音特征簇所包含的所述历史语音特征向量,确定所述语音特征簇对应的类内散度;
具体的,若该语音特征簇所包含的历史语音特征向量的数量大于系统数量阈值,且类内散度小于系统类内散度阈值,则确定该语音特征簇满足高频用户条件,进而可以分析出哪些语音特征簇满足高频用户条件。其中,确定类内散度divergence的公式为:
Figure PCTCN2018116590-appb-000003
其中,|C|表示某语音特征簇内的样本数量(即语音特征簇所包含的历史语音特征向量的数量),x i和x j是该语音特征簇的两个样本点(即该语音特征簇中的两个历史语音特征向量),‖·‖ 2表示计算代数式的2-范数。其中,系统数量阈值可以为base_frequency,系统类内散度阈值可以为base_divergence,即若该语音特征簇内的样本数量大于base_frequency且该语音特征簇的类内散度小于base_divergence,则可以确定该语音特征簇满足高频用户条件。其中,base_frequency和base_divergence为由系统设置的超参数。
S405,若所述语音特征簇所包含的所述历史语音特征向量的数量大于系统数量阈值,且所述类内散度小于系统类内散度阈值,则确定所述语音特征簇满足高频用户条件,并根据所述语音特征簇所包含的所述历史语音特征向量训练对应的用户语音模型;
具体的,在确定出某语音特征簇满足高频用户条件后,可以根据所述语音特征簇所包含的所述历史语音特征向量训练对应的用户语音模型。其中,训练用户语音模型的过程可以为:获取满足高频用户条件的语音特征簇中的所有历史语音特征向量,并对所获取的这些历史语音特征向量进行均值计算或插值计算,得到目标历史语音特征向量,并将所述目标历史语音特征向量作为所述语音特征簇对应的用户语音模型的 模型参数。
S406,若检测到当前语音数据的当前语音特征向量与所述用户语音模型相匹配,则发起与所述当前语音数据相关联的用户身份关联请求;
S407,若接收到与所述用户身份关联请求对应的响应消息,则将所述响应消息中的用户身份信息与所述用户语音模型进行绑定;
其中,S406-S407的具体实现方式可以参见上述图3对应实施例中的S303-S304,这里不再进行赘述。
S408,若聚类后所累计新增的历史语音数据的数量达到第一数量阈值,或聚类后所累计时长达到第一时长阈值,则更新当前的聚类模型参数,得到所述目标聚类模型参数;
具体的,由于随着历史语音数据的增加,可能会新增一些高频用户,因此,需要定时进行重新聚类,以划分出新的语音特征簇,并且在新的语音特征簇满足高频用户条件时可以进一步训练对应的用户语音模型以及绑定相应的用户身份信息。而DBSCAN聚类算法中两个关键参数Eps和MinPts直接决定着聚类的性能,为了逐步提高聚类算法的性能,可以定时对Eps和MinPts进行更新,即Eps和MinPts越准确,则聚类出的新的语音特征簇就越精确。因此,若聚类后所累计新增的历史语音数据的数量达到第一数量阈值,或聚类后所累计时长达到第一时长阈值,则获取与已绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第一历史语音特征向量;根据聚类算法性能参数最大化条件、所述第一历史语音特征向量与已绑定的所述用户身份信息之间的对应关系,更新当前的聚类模型参数,得到所述目标聚类模型参数。其中,聚类算法性能参数可以包括两个外部指标,即Jaccard系数(杰卡德系数,JC)和Rand指数(兰德指数,RI),通过JC和RI可以衡量聚类算法的性能,即当聚类性能提升时,JC和RI也会随之增大;其中,聚类算法性能参数最大化条件可以是指JC最大化的条件。其中,JC=SS/ (SS+SD+DS),RI=(SS+DD)/(SS+SD+DS+DD),其中,SS表示实际标签相同且聚类标签也相同的样本点对的数量,SD表示实际标签相同但聚类标签不同的样本点对的数量,DS表示实际标签不同但聚类标签相同的样本点对的数量,DD表示实际标签不同且聚类标签也不同的样本点对的数量(这里的标签可以是指说话人的身份信息)。
例如,更新当前的聚类模型参数的具体过程可以为:获取与已绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第一历史语音特征向量,并将70%的第一历史语音特征向量作为训练集,并将剩余的30%的第一历史语音特征向量作为验证集;使用训练集训练一个DBSCAN聚类模型,训练目标为最大化JC;为避免训练过拟合,在训练过程中确定该聚类模型在验证集上的JC,选择使验证集JC值最大的Eps和MinPts参数作为优化的模型参数(即目标聚类模型参数)。之后可以继续定时或定量的更新目标聚类模型参数,使得可以逐渐优化目标聚类模型参数。
也可以在生成目标聚类模型参数后累计新增的历史语音数据的数量,且该数量达到第一数量阈值时,执行S408步骤;或者,在生成目标聚类模型参数后开始累计时长,且该累计时长达到第一时长阈值时,执行S408步骤。
S409,若聚类后所累计新增的历史语音数据的数量达到第二数量阈值,或聚类后所累计时长达到第二时长阈值,则对未绑定所述用户身份信息的用户语音模型所对应的语音特征簇进行更新,并对未满足所述高频用户条件的语音特征簇进行替换;
具体的,由于随着历史语音数据的增加,可能会新增一些高频用户,因此,需要定时进行重新聚类,以划分出新的语音特征簇,进而可以通过新的语音特征簇来发现新的高频用户。因此,若聚类后所累计新增的历史语音数据的数量达到第二数量阈值,或聚类后所累计时长达到第二 时长阈值,则获取与未绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,以及与所有用户语音模型均不匹配的历史语音特征向量(即不属于高频用户的历史语音特征向量),作为第二历史语音特征向量,并对所述第二历史语音特征向量进行聚类,得到当前生成的语音特征簇;其中,对第二历史语音特征向量进行聚类的过程可以参见上述步骤S403,这里不再进行赘述。其中,对第二历史语音特征向量进行聚类之前,可以先对还未降维的第二历史语音特征向量进行降维处理。
再根据所述当前生成的语音特征簇对未绑定所述用户身份信息的用户语音模型所对应的语音特征簇进行更新,该更新过程可以具体为:检测每个当前生成的语音特征簇是否满足高频用户条件,并将当前生成的语音特征簇中满足高频用户条件的语音特征簇确定为待更新语音特征簇,训练待更新语音特征簇对应的用户语音模型,并将待更新语音特征簇对应的用户语音模型与重新聚类之前已存在的未绑定所述用户身份信息的用户语音模型进行比较,若存在某个待更新语音特征簇对应的用户语音模型与某个未绑定所述用户身份信息的用户语音模型相近似(如两个用户语音模型的i-Vector之间的向量距离小于预设距离阈值),则可以将该待更新语音特征簇中的用户画像数据传导和继承到与之具有相近似的用户语音模型的语音特征簇中,以完成对该未绑定所述用户身份信息的用户语音模型所对应的语音特征簇的更新。
再根据所有当前生成的语音特征簇中除了已用于传导和继承用户画像数据的待更新语音特征簇以外的语音特征簇,对未满足所述高频用户条件的语音特征簇进行替换,即将重新聚类之前已存在的未满足所述高频用户条件的语音特征簇删除,并保留所有当前生成的语音特征簇中除了已用于传导和继承用户画像数据的待更新语音特征簇以外的语音特征簇。例如,若重新聚类之前存在语音特征簇a1(未满足高频用户 条件)、语音特征簇a2(未满足高频用户条件)、语音特征簇a3(具有未绑定用户身份信息的用户语音模型)、语音特征簇a4(具有未绑定用户身份信息的用户语音模型)、语音特征簇a5(具有已绑定用户身份信息的用户语音模型);重新聚类后,得到当前生成的语音特征簇b1、语音特征簇b2、语音特征簇b3、语音特征簇b4,其中,语音特征簇b1、语音特征簇b2均未满足高频用户条件,语音特征簇b3和语音特征簇b4均满足高频用户条件,并进一步训练语音特征簇b3对应的用户语音模型和语音特征簇b4对应的用户语音模型,其中,语音特征簇b3对应的用户语音模型与语音特征簇a4对应的用户语音模型相近似,因此,可以将语音特征簇b3中的用户画像数据传导和继承到语音特征簇a4中,以完成对语音特征簇a4的更新;其中,语音特征簇b4对应的用户语音模型与语音特征簇a4对应的用户语音模型、语音特征簇a3对应的用户语音模型均不相似,进而可以保留语音特征簇b4、语音特征簇b1、语音特征簇b2,并删除语音特征簇a1和语音特征簇a2,此时,语音交互设备中的所有语音特征簇包括:语音特征簇b4、语音特征簇b1、语音特征簇b2、更新后的语音特征簇a4、语音特征簇a3以及语音特征簇a5。
S410,若聚类后所累计新增的历史语音数据的数量达到第三数量阈值,或聚类后所累计时长达到第三时长阈值,则更新用户语音模型;
具体的,若聚类后所累计新增的历史语音数据的数量达到第三数量阈值,或聚类后所累计时长达到第三时长阈值,则获取与已绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第三历史语音特征向量,并根据所述第三历史语音特征向量更新已绑定所述用户身份信息的用户语音模型;其中,某个已绑定所述用户身份信息的用户语音模型对应的第三历史语音特征向量可以包括:已有的历史语音特征向量和聚类后所累计新增的历史语音特征向量;该用户语音模型的模型参数(模型参数也为i-Vector)是根据已有的历史语音特征向量生 成的,因此,更新该用户语音模型的过程可以为:对该用户语音模型的模型参数和聚类后所累计新增的历史语音特征向量进行均值计算或插值计算,得到更新后的历史语音特征向量,并用该更新后的历史语音特征向量替换该用户语音模型的模型参数,以完成对该用户语音模型的更新。以均值计算的方式更新用户语音模型为例,某个已绑定用户身份信息的用户语音模型A包含模型参数a1,且聚类后所新增的且与该用户语音模型A相匹配的历史语音特征向量包括:历史语音特征向量b1、历史语音特征向量b2、……、历史语音特征向量bn,则更新后的用户语音模型A所包含的模型参数=(a*a1+b*(b1+b2+……+bn))/(n+1),其中,a和b为权重值。
在更新已绑定所述用户身份信息的用户语音模型的同时,还可以获取与未绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第四历史语音特征向量,并根据所述第四历史语音特征向量更新未绑定所述用户身份信息的用户语音模型。其中,更新未绑定所述用户身份信息的用户语音模型的具体过程与更新已绑定所述用户身份信息的用户语音模型的过程相同,这里不再进行赘述。
还可以在更新某用户语音模型后累计与该用户语音模型相匹配的新增历史语音数据的数量,且该数量达到第三数量阈值时,执行S410步骤;或者,在更新某用户语音模型后开始累计时长,且该累计时长达到第三时长阈值时,执行S410步骤。
其中,S408的步骤可以在S401-S407之间的任一时刻或S401之前或S407之后执行,即每次聚类后均可以定时或定量的更新当前的聚类模型参数,因此,不对S408的步骤执行顺序进行限定。S409的步骤可以在S401-S407之间的任一时刻或S401之前或S407之后执行,即可以定时或定量的进行重新聚类,以更新或替换相应的语音特征簇,因此,不对S409的步骤执行顺序进行限定。S410的步骤可以在S401-S407之 间的任一时刻或S401之前或S407之后执行,即每次聚类后均可以定时或定量的更新相应的用户语音模型,因此,不对S410的步骤执行顺序进行限定。
其中,第一数量阈值、第二数量阈值、第三数量阈值之间可以相同或不同,第一时长阈值、第二时长阈值、第三时长阈值之间也可以相同或不同,这里不对其进行限定。若为了保证语音交互设备的工作效率,可以设置第一数量阈值略小于第二数量阈值(两个数量阈值之间的差值很小),或第一时长阈值略小于第二时长阈值(两个时长阈值之间的差值很小),以保证在每次聚类之前都先更新目标聚类模型参数,使得每次聚类都可以基于更新后的目标聚类模型参数进行聚类,以提高每次聚类的准确性;且可以设置第一数量阈值和第二数量阈值均大于第三数量阈值,或设置第一时长阈值和第二时长阈值均大于第三时长阈值,以避免过于频繁的更新目标聚类模型参数和语音特征簇,因此,过于频繁的更新容易导致更新前后的两个目标聚类模型参数过于相似,进而导致系统资源的浪费,且过于频繁的更新也容易导致更新前后的语音特征簇没有太大的变化,进而导致系统资源的浪费;而对于用户语音模型,可以较为频繁的更新,以保证用户语音模型的准确性,使得用户的语音可以更快、更准确地匹配到正确的用户语音模型。
为了提高计算每个语音数据对应的i-vector的准确性,也可以定时或定量的更新GMM,随着时间的推移,所累计的历史语音数据越来越多,进而根据数量增加后的所有历史语音数据训练GMM,可以提高GMM的准确性,进而在更新GMM后,可以提高所计算出的i-vector的准确性。
在S401的步骤之前(如语音交互设备在出厂之前的阶段,即还未接到任何用户的语音),语音交互设备可以获取样本语音数据,并为所述样本语音数据设置对应的样本用户身份标签(即已知每一条样本语音 数据对应的说话人信息),再根据聚类算法性能参数最大化条件、所述样本语音数据与所述样本用户身份标签之间的对应关系,训练初始聚类模型参数,并将训练后的初始聚类模型参数确定为所述目标聚类模型参数。训练初始聚类模型参数的具体过程可以参见上述S408步骤中对当前的聚类模型参数进行更新的过程,这里不再进行赘述。在得到初始聚类模型参数后,可以根据初始聚类模型参数进行第一次聚类,并将初始聚类模型参数确定为目标聚类模型参数,此后即可定时或定量地对目标聚类模型参数进行更新。例如,获取20组含说话人实际身份标签(即样本用户身份标签)的唤醒词语音数据(即样本语音数据),每组包含10个说话人,每个说话人含10条唤醒词语音数据,从每组中随机选取7个说话人的唤醒词语音数据作为训练集,剩余3个说话人的唤醒词语音数据作为验证集;对于每组数据,提取唤醒词语音数据的i-vector并降维后,使用训练集训练一个DBSCAN聚类模型,训练目标为最大化JC;为避免训练过拟合,在训练过程中确定该聚类模型在验证集上的JC,选择使验证集JC值最大的Eps和MinPts参数作为初始聚类模型参数。
进一步的,请一并参见图5,是本申请实施例提供的一种参数更新方法的场景示意图。如图5所示,语音交互设备在首次聚类之前,可以先获取样本语音数据,并生成样本语音数据对应的i-vector(这里可以为降维后的i-vector),并根据样本语音数据对应的i-vector训练一个DBSCAN聚类模型,训练目标为最大化JC;为避免训练过拟合,在训练过程中确定该聚类模型在验证集上的JC,选择使验证集JC值最大的Eps和MinPts参数作为初始聚类模型参数,即初始化的Eps和MinPts。如图5所示,语音交互设备在首次聚类之前,可以先生成历史语音数据对应的i-vector(这里可以为降维后的i-vector),并根据初始化的Eps和MinPts对历史语音数据对应的i-vector进行DBSCAN聚类,进而可以根据聚类后所得到的语音特征簇进行高频用户发现和用户身份自动注册 (具体可以参见上述图4对应实施例中的S401-S407)。如图5所示,语音交互设备所得到的已绑定用户身份信息的用户语音模型可以包括用户语音模型a、用户语音模型b、用户语音模型c。语音交互设备还可以定时或定量的根据已绑定用户身份信息的用户语音模型对应的语音特征簇(如图5中的语音特征簇a、语音特征簇b、语音特征簇c),训练一个DBSCAN聚类模型,训练目标为最大化JC;为避免训练过拟合,在训练过程中确定该聚类模型在验证集上的JC,选择使验证集JC值最大的Eps和MinPts参数作为更新后的聚类模型参数,即更新后的Eps和MinPts(具体可以参见上述图4对应实施例中的S408)。进而语音交互设备在下一次聚类时,可以根据更新后的Eps和MinPts对历史语音数据(包含新增的历史语音数据)对应的i-vector进行DBSCAN聚类,得到如图5所示的语音特征簇1、语音特征簇2、……、语音特征簇n,并根据语音特征簇1、语音特征簇2、……、语音特征簇n,对未绑定所述用户身份信息的用户语音模型所对应的语音特征簇进行更新,并对未满足所述高频用户条件的语音特征簇进行替换(具体可以参见上述图4对应实施例中的S409)。此后,可以定时或定量根据已绑定用户身份信息的用户语音模型所对应的语音特征簇对Eps和MinPts进行更新,而且随着已绑定用户身份信息的用户语音模型的增加,可以逐步训练出更准确、合理的Eps和MinPts。其中,初始化的Eps和MinPts仅用于第一次聚类使用,此后的每一次聚类都使用最近一次更新后的Eps和MinPts。
以语音交互设备为智能音箱为例,本申请实施例对上述方案作了技术可行性验证。智能音箱通常不归属于某个特定的用户,由多个用户共同使用,但是其用户规模又十分有限。比如在家庭中使用的音箱设备,用户数目通常不超过10人;且在家庭中的成员,由于年龄、性别等方面的差异,其声纹特征的区分性比较明显。
首先,利用大规模的数据集合,随机从600人中不重复地抽取10 人作为一组,每人提供10句内容完全一样的唤醒词作为语音样本。本申请实施例组织了两组实验,分别用于验证上述的聚类方法的可行性以及高频用户发现的可行性。
其中,聚类方法的可行性验证过程可以为:随机生成10组数据(每一组数据包括不重复的10个人分别提供的10句内容完全一样的语音样本)作为训练集,每组中随机选取7人的语音数据用于训练模型参数(Eps,MinPts),训练目标为最大化JC,其余3人的语音数据用于验证以减轻模型过拟合;随机生成10组数据作为测试集,测试训练得到的聚类模型的性能,具体基于JC和RI衡量聚类模型的性能。请一并参见图6,是本申请实施例提供的一种性能验证结果的示意图,如图6所示,10组测试集(如图6中的group1-group10)中的JC和RI都较高,即表明聚类模型具有较高的性能,因此,本申请实施例中的聚类方法具有可行性。
其中,高频用户发现的可行性验证过程可以为:首先,获取上述聚类方法的可行性验证过程中的10组测试集,对于每组测试集,在聚类及高频用户发现完成后,将发现到的高频用户所在的语音特征簇的类别设定为所在语音特征簇中出现次数最多的语音样本的类别。此时,对于每一组测试集,均可以确定该测试集中每一个发现的满足高频用户条件的语音特征簇的查准率(Precision)和查全率(Recall),以所有满足高频用户条件的语音特征簇的查准率和查全率的均值来表示高频用户发现算法在该测试集上的性能;其中,查准率和查全率越高,表明所发现的高频簇越精确。请一并参见图7,是本申请实施例提供的另一种性能验证结果的示意图,如图7所示,10个测试集(如图7中的group1-group10)中的Precision和Recall都较高,即表明高频用户发现算法具有较高的性能,因此,本申请实施例中的高频用户发现具有可行性。
请参见图8,是本申请实施例提供的一种语音数据处理装置的结构示意图。如图8所示,该语音数据处理装置1可以应用于上述图3或图 4对应实施例中的语音交互设备,该语音数据处理装置1可以包括:聚类模块10、第一训练模块20、请求发起模块30、绑定模块40;
聚类模块10,获取历史语音数据,并获取所述历史语音数据对应的历史语音特征向量,并对所述历史语音特征向量进行聚类,得到语音特征簇;所述语音特征簇包含至少一个特征相似的历史语音特征向量;
第一训练模块20,用于若所述语音特征簇满足高频用户条件,则根据所述语音特征簇所包含的所述历史语音特征向量训练对应的用户语音模型;
其中,所述第一训练模块20可以具体用于对所述语音特征簇所包含的所述历史语音特征向量进行均值计算或插值计算,得到目标历史语音特征向量,并将所述目标历史语音特征向量作为所述语音特征簇对应的用户语音模型的模型参数。
请求发起模块30,用于若检测到当前语音数据的当前语音特征向量与所述用户语音模型相匹配,则发起与所述当前语音数据相关联的用户身份关联请求;
绑定模块40,用于若接收到与所述用户身份关联请求对应的响应消息,则将所述响应消息中的用户身份信息与所述用户语音模型进行绑定。
其中,聚类模块10、第一训练模块20、请求发起模块30、绑定模块40的具体功能实现方式可以参见上述图3对应实施例中的S301-S304,这里不再进行赘述。
如图8所示,聚类模块10可以包括:获取训练单元101、向量处理单元102、聚类单元103;
获取训练单元101,用于获取所有历史语音数据,并根据所述所有历史语音数据训练高斯混合模型和全局差异空间矩阵;
向量处理单元102,用于根据所述高斯混合模型和所述全局差异空间矩阵将所述所有历史语音数据投影至向量空间,生成每个历史语音数 据分别对应的历史语音特征向量,并对所述历史语音特征向量进行降维;
聚类单元103,用于根据目标聚类模型参数对降维后的历史语音特征向量进行聚类,得到所述语音特征簇。
其中,所述目标聚类模型参数包括:密度领域半径和核心样本阈值。
其中,获取训练单元101、向量处理单元102、聚类单元103的具体功能实现方式可以参见上述图4对应实施例中的S401-S403,这里不再进行赘述。
进一步的,如图8所示,所述聚类单元103可以包括:查找子单元1031、聚类子单元1032、通知子单元1033;
查找子单元1031,用于以所有降维后的历史语音特征向量为样本点生成包含所述样本点的样本数据集,并根据所述密度领域半径和所述核心样本阈值在所述样本数据集中查找所有为核心点的样本点;
聚类子单元1032,用于在所有核心点中确定任意一个核心点为出发点,并在所述样本数据集中查找与所述出发点具有密度可达关系的所有样本点,作为可达样本点,并生成包含所述出发点和所有所述可达样本点的语音特征簇;
通知子单元1033,用于将所有核心点中的下一个核心点确定为所述出发点,并通知所述聚类子单元1032生成所述出发点对应的所述语音特征簇,直至所有核心点均被确定为所述出发点。
其中,查找子单元1031、聚类子单元1032、通知子单元1033的具体功能实现方式可以参见上述图4对应实施例中的S403,这里不再进行赘述。
如图8所示,该语音数据处理装置1还可以包括:获取计算模块50、条件确定模块60、样本设置模块70、第二训练模块80、第一更新模块90、第二更新模块100、第三更新模块110;
获取计算模块50,用于获取所述语音特征簇所包含的所述历史语音 特征向量的数量,并根据所述语音特征簇所包含的所述历史语音特征向量的数量,以及所述语音特征簇所包含的所述历史语音特征向量,确定所述语音特征簇对应的类内散度;
条件确定模块60,用于若所述语音特征簇所包含的所述历史语音特征向量的数量大于系统数量阈值,且所述类内散度小于系统类内散度阈值,则确定所述语音特征簇满足高频用户条件。
其中,获取计算模块50、条件确定模块60的具体功能实现方式可以参见上述图4对应实施例中的S404-S405,这里不再进行赘述。
样本设置模块70,用于获取样本语音数据,并为所述样本语音数据设置对应的样本用户身份标签;
第二训练模块80,用于根据聚类算法性能参数最大化条件、所述样本语音数据与所述样本用户身份标签之间的对应关系,训练初始聚类模型参数,并将训练后的初始聚类模型参数确定为所述目标聚类模型参数。
其中,样本设置模块70、第二训练模块80的具体功能实现方式可以参见上述图4对应实施例中对初始化聚类模型参数的过程,这里不再进行赘述。
第一更新模块90,用于若聚类后所累计新增的历史语音数据的数量达到第一数量阈值,或聚类后所累计时长达到第一时长阈值,则获取与已绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第一历史语音特征向量,并根据聚类算法性能参数最大化条件、所述第一历史语音特征向量与已绑定的所述用户身份信息之间的对应关系,更新当前的聚类模型参数,得到所述目标聚类模型参数。
其中,第一更新模块90的具体功能实现方式可以参见上述图4对应实施例中的S408,这里不再进行赘述。
第二更新模块100,用于若聚类后所累计新增的历史语音数据的数量达到第二数量阈值,或聚类后所累计时长达到第二时长阈值,则获取 与未绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,以及与所有用户语音模型均不匹配的历史语音特征向量,作为第二历史语音特征向量,并对所述第二历史语音特征向量进行聚类,得到当前生成的语音特征簇,并根据所述当前生成的语音特征簇对未绑定所述用户身份信息的用户语音模型所对应的语音特征簇进行更新,并对未满足所述高频用户条件的语音特征簇进行替换。
其中,第二更新模块100的具体功能实现方式可以参见上述图4对应实施例中的S409,这里不再进行赘述。
第三更新模块110,用于若聚类后所累计新增的历史语音数据的数量达到第三数量阈值,或聚类后所累计时长达到第三时长阈值,则获取与已绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第三历史语音特征向量,并根据所述第三历史语音特征向量更新已绑定所述用户身份信息的用户语音模型;
所述第三更新模块110,还用于获取与未绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第四历史语音特征向量,并根据所述第四历史语音特征向量更新未绑定所述用户身份信息的用户语音模型。
其中,第三更新模块110的具体功能实现方式可以参见上述图4对应实施例中的S410,这里不再进行赘述。
请参见图9,是本申请实施例提供的一种语音交互设备的结构示意图。如图9所示,所述语音交互设备1000可以为上述图3或图4对应实施例中的语音交互设备,所述语音交互设备1000可以包括:处理器1001,网络接口1004和存储器1005,此外,所述语音交互设备1000还可以包括:用户接口1003,和至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口1003还可以 包括标准的有线接口、无线接口。网络接口1004可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005还可以是至少一个位于远离前述处理器1001的存储装置。如图9所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
在图9所示的语音交互设备1000中,网络接口1004可提供网络通讯功能;而用户接口1003主要用于为用户提供输入的接口;而处理器1001可以用于调用存储器1005中存储的设备控制应用程序,以实现图3至图4所述的语音数据处理方法。确定确定确定
应当理解,本申请实施例中所描述的语音交互设备1000可执行前文图3到图4所对应实施例的所述语音数据处理方法,也可包括前文图8对应实施例中的所述语音数据处理装置1,在此不再赘述。
此外,这里需要指出的是:本申请实施例还提供了一种计算机存储介质,且所述计算机存储介质中存储有前文提及的语音数据处理装置1所执行的计算机程序,且所述计算机程序包括程序指令,当所述处理器执行所述程序指令时,能够执行前文图3到图4所对应实施例中的所述语音数据处理方法,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random  Access Memory,RAM)等。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (19)

  1. 一种语音数据处理方法,由语音交互设备执行,包括:
    获取历史语音数据,并获取所述历史语音数据对应的历史语音特征向量,并对所述历史语音特征向量进行聚类,得到语音特征簇;所述语音特征簇包含至少一个特征相似的历史语音特征向量;
    若所述语音特征簇满足高频用户条件,则根据所述语音特征簇所包含的所述历史语音特征向量训练对应的用户语音模型;
    若检测到当前语音数据的当前语音特征向量与所述用户语音模型相匹配,则发起与所述当前语音数据相关联的用户身份关联请求;
    若接收到与所述用户身份关联请求对应的响应消息,则将所述响应消息中的用户身份信息与所述用户语音模型进行绑定。
  2. 如权利要求1所述的方法,还包括:
    获取所述语音特征簇所包含的所述历史语音特征向量的数量,并根据所述语音特征簇所包含的所述历史语音特征向量的数量,以及所述语音特征簇所包含的所述历史语音特征向量,确定所述语音特征簇对应的类内散度;
    若所述语音特征簇所包含的所述历史语音特征向量的数量大于系统数量阈值,且所述类内散度小于系统类内散度阈值,则确定所述语音特征簇满足高频用户条件。
  3. 如权利要求1所述的方法,所述获取历史语音数据,并获取所述历史语音数据对应的历史语音特征向量,并对所述历史语音特征向量进行聚类,得到语音特征簇,包括:
    获取所有历史语音数据,并根据所述所有历史语音数据训练高斯混合模型和全局差异空间矩阵;
    根据所述高斯混合模型和所述全局差异空间矩阵将所述所有历史语音数据投影至向量空间,生成每个历史语音数据分别对应的历史语音特征向量,并对所述历史语音特征向量进行降维;
    根据目标聚类模型参数对降维后的历史语音特征向量进行聚类,得到所述语音特征簇。
  4. 如权利要求3所述的方法,所述目标聚类模型参数包括:密度领域半径和核心样本阈值;
    所述根据目标聚类模型参数对降维后的历史语音特征向量进行聚类,得到所述语音特征簇,包括:
    以所有降维后的历史语音特征向量为样本点生成包含所述样本点的样本数据集,并根据所述密度领域半径和所述核心样本阈值在所述样本数据集中查找所有为核心点的样本点;
    在所有核心点中确定任意一个核心点为出发点,并在所述样本数据集中查找与所述出发点具有密度可达关系的所有样本点,作为可达样本点,并生成包含所述出发点和所有所述可达样本点的语音特征簇,并将所有核心点中的下一个核心点确定为所述出发点,重复执行本步骤,直至所有核心点均被确定为所述出发点。
  5. 如权利要求1所述的方法,所述根据所述语音特征簇所包含的所述历史语音特征向量训练对应的用户语音模型,具体包括:
    对所述语音特征簇所包含的所述历史语音特征向量进行均值确定或插值确定,得到目标历史语音特征向量,并将所述目标历史语音特征向量作为所述语音特征簇对应的用户语音模型的模型参数。
  6. 如权利要求3所述的方法,还包括:
    获取样本语音数据,并为所述样本语音数据设置对应的样本用户身份标签;
    根据聚类算法性能参数最大化条件、所述样本语音数据与所述样本用户身份标签之间的对应关系,训练初始聚类模型参数,并将训练后的初始聚类模型参数确定为所述目标聚类模型参数。
  7. 如权利要求3所述的方法,还包括:
    若聚类后所累计新增的历史语音数据的数量达到第一数量阈值,或聚类后所累计时长达到第一时长阈值,则获取与已绑定所述用户身份信 息的用户语音模型相匹配的所有历史语音特征向量,作为第一历史语音特征向量;
    根据聚类算法性能参数最大化条件、所述第一历史语音特征向量与已绑定的所述用户身份信息之间的对应关系,更新当前的聚类模型参数,得到所述目标聚类模型参数。
  8. 如权利要求3所述的方法,还包括:
    若聚类后所累计新增的历史语音数据的数量达到第二数量阈值,或聚类后所累计时长达到第二时长阈值,则获取与未绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,以及与所有用户语音模型均不匹配的历史语音特征向量,作为第二历史语音特征向量,并对所述第二历史语音特征向量进行聚类,得到当前生成的语音特征簇;
    根据所述当前生成的语音特征簇对未绑定所述用户身份信息的用户语音模型所对应的语音特征簇进行更新,并对未满足所述高频用户条件的语音特征簇进行替换。
  9. 如权利要求3所述的方法,还包括:
    若聚类后所累计新增的历史语音数据的数量达到第三数量阈值,或聚类后所累计时长达到第三时长阈值,则获取与已绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第三历史语音特征向量,并根据所述第三历史语音特征向量更新已绑定所述用户身份信息的用户语音模型;
    获取与未绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第四历史语音特征向量,并根据所述第四历史语音特征向量更新未绑定所述用户身份信息的用户语音模型。
  10. 一种语音交互设备,包括:处理器、存储器;
    所述处理器与存储器相连,其中,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,以执行以下操作:
    获取历史语音数据,并获取所述历史语音数据对应的历史语音特征向量,并对所述历史语音特征向量进行聚类,得到语音特征簇;所述语 音特征簇包含至少一个特征相似的历史语音特征向量;
    若所述语音特征簇满足高频用户条件,则根据所述语音特征簇所包含的所述历史语音特征向量训练对应的用户语音模型;
    若检测到当前语音数据的当前语音特征向量与所述用户语音模型相匹配,则发起与所述当前语音数据相关联的用户身份关联请求;
    若接收到与所述用户身份关联请求对应的响应消息,则将所述响应消息中的用户身份信息与所述用户语音模型进行绑定。
  11. 如权利要求10所述的语音交互设备,所述处理器进一步用于执行以下操作:
    获取所述语音特征簇所包含的所述历史语音特征向量的数量,并根据所述语音特征簇所包含的所述历史语音特征向量的数量,以及所述语音特征簇所包含的所述历史语音特征向量,确定所述语音特征簇对应的类内散度;
    若所述语音特征簇所包含的所述历史语音特征向量的数量大于系统数量阈值,且所述类内散度小于系统类内散度阈值,则确定所述语音特征簇满足高频用户条件。
  12. 如权利要求10所述的语音交互设备,所述处理器进一步用于执行以下操作:
    获取所有历史语音数据,并根据所述所有历史语音数据训练高斯混合模型和全局差异空间矩阵;
    根据所述高斯混合模型和所述全局差异空间矩阵将所述所有历史语音数据投影至向量空间,生成每个历史语音数据分别对应的历史语音特征向量,并对所述历史语音特征向量进行降维;
    根据目标聚类模型参数对降维后的历史语音特征向量进行聚类,得到所述语音特征簇。
  13. 如权利要求12所述的语音交互设备,所述目标聚类模型参数包括:密度领域半径和核心样本阈值;
    所述处理器进一步用于执行以下操作:
    以所有降维后的历史语音特征向量为样本点生成包含所述样本点的样本数据集,并根据所述密度领域半径和所述核心样本阈值在所述样本数据集中查找所有为核心点的样本点;
    在所有核心点中确定任意一个核心点为出发点,并在所述样本数据集中查找与所述出发点具有密度可达关系的所有样本点,作为可达样本点,并生成包含所述出发点和所有所述可达样本点的语音特征簇;
    将所有核心点中的下一个核心点确定为所述出发点,并通知所述聚类子单元生成所述出发点对应的所述语音特征簇,直至所有核心点均被确定为所述出发点。
  14. 如权利要求10所述的语音交互设备,所述处理器进一步用于执行以下操作:
    对所述语音特征簇所包含的所述历史语音特征向量进行均值计算或插值计算,得到目标历史语音特征向量,并将所述目标历史语音特征向量作为所述语音特征簇对应的用户语音模型的模型参数。
  15. 如权利要求12所述的语音交互设备,所述处理器进一步用于执行以下操作:
    获取样本语音数据,并为所述样本语音数据设置对应的样本用户身份标签;
    根据聚类算法性能参数最大化条件、所述样本语音数据与所述样本用户身份标签之间的对应关系,训练初始聚类模型参数,并将训练后的初始聚类模型参数确定为所述目标聚类模型参数。
  16. 如权利要求12所述的语音交互设备,所述处理器进一步用于执行以下操作:
    若聚类后所累计新增的历史语音数据的数量达到第一数量阈值,或聚类后所累计时长达到第一时长阈值,则获取与已绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第一历史语音特征向量;
    根据聚类算法性能参数最大化条件、所述第一历史语音特征向量与 已绑定的所述用户身份信息之间的对应关系,更新当前的聚类模型参数,得到所述目标聚类模型参数。
  17. 如权利要求12所述的语音交互设备,所述处理器进一步用于执行以下操作:
    若聚类后所累计新增的历史语音数据的数量达到第二数量阈值,或聚类后所累计时长达到第二时长阈值,则获取与未绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,以及与所有用户语音模型均不匹配的历史语音特征向量,作为第二历史语音特征向量,并对所述第二历史语音特征向量进行聚类,得到当前生成的语音特征簇;
    根据所述当前生成的语音特征簇对未绑定所述用户身份信息的用户语音模型所对应的语音特征簇进行更新,并对未满足所述高频用户条件的语音特征簇进行替换。
  18. 如权利要求12所述的语音交互设备,所述处理器进一步用于执行以下操作:
    若聚类后所累计新增的历史语音数据的数量达到第三数量阈值,或聚类后所累计时长达到第三时长阈值,则获取与已绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第三历史语音特征向量,并根据所述第三历史语音特征向量更新已绑定所述用户身份信息的用户语音模型;
    获取与未绑定所述用户身份信息的用户语音模型相匹配的所有历史语音特征向量,作为第四历史语音特征向量,并根据所述第四历史语音特征向量更新未绑定所述用户身份信息的用户语音模型。
  19. 一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,当处理器执行所述程序指令时执行如权利要求1-9任一项所述的方法。
PCT/CN2018/116590 2017-11-24 2018-11-21 一种语音数据处理方法、语音交互设备及存储介质 WO2019101083A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/600,421 US11189263B2 (en) 2017-11-24 2019-10-11 Voice data processing method, voice interaction device, and storage medium for binding user identity with user voice model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711191651.3 2017-11-24
CN201711191651.3A CN107978311B (zh) 2017-11-24 2017-11-24 一种语音数据处理方法、装置以及语音交互设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/600,421 Continuation US11189263B2 (en) 2017-11-24 2019-10-11 Voice data processing method, voice interaction device, and storage medium for binding user identity with user voice model

Publications (1)

Publication Number Publication Date
WO2019101083A1 true WO2019101083A1 (zh) 2019-05-31

Family

ID=62011555

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/116590 WO2019101083A1 (zh) 2017-11-24 2018-11-21 一种语音数据处理方法、语音交互设备及存储介质

Country Status (3)

Country Link
US (1) US11189263B2 (zh)
CN (1) CN107978311B (zh)
WO (1) WO2019101083A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161732A (zh) * 2019-12-30 2020-05-15 秒针信息技术有限公司 语音采集方法、装置、电子设备及存储介质
CN111612499A (zh) * 2020-04-03 2020-09-01 浙江口碑网络技术有限公司 信息的推送方法及装置、存储介质、终端

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
BR112015018905B1 (pt) 2013-02-07 2022-02-22 Apple Inc Método de operação de recurso de ativação por voz, mídia de armazenamento legível por computador e dispositivo eletrônico
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
DK201770427A1 (en) 2017-05-12 2018-12-20 Apple Inc. LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
CN107978311B (zh) 2017-11-24 2020-08-25 腾讯科技(深圳)有限公司 一种语音数据处理方法、装置以及语音交互设备
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11893999B1 (en) * 2018-05-13 2024-02-06 Amazon Technologies, Inc. Speech based user recognition
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
CN108922544B (zh) * 2018-06-11 2022-12-30 平安科技(深圳)有限公司 通用向量训练方法、语音聚类方法、装置、设备及介质
US10965553B2 (en) * 2018-08-20 2021-03-30 Arbor Networks, Inc. Scalable unsupervised host clustering based on network metadata
KR102637339B1 (ko) * 2018-08-31 2024-02-16 삼성전자주식회사 음성 인식 모델을 개인화하는 방법 및 장치
CN110888036B (zh) * 2018-09-07 2022-02-15 长鑫存储技术有限公司 测试项目确定方法及装置、存储介质和电子设备
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
CN111179940A (zh) * 2018-11-12 2020-05-19 阿里巴巴集团控股有限公司 一种语音识别方法、装置及计算设备
CN110164431B (zh) * 2018-11-15 2023-01-06 腾讯科技(深圳)有限公司 一种音频数据处理方法及装置、存储介质
KR102655628B1 (ko) * 2018-11-22 2024-04-09 삼성전자주식회사 발화의 음성 데이터를 처리하는 방법 및 장치
US11875790B2 (en) * 2019-03-01 2024-01-16 Google Llc Dynamically adapting assistant responses
CN111669353A (zh) * 2019-03-08 2020-09-15 顺丰科技有限公司 钓鱼网站检测方法及系统
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11200886B2 (en) * 2019-04-02 2021-12-14 Accenture Global Solutions Limited System and method for training a virtual agent to identify a user's intent from a conversation
CN110085209B (zh) * 2019-04-11 2021-07-23 广州多益网络股份有限公司 一种音色筛选方法及装置
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11468890B2 (en) 2019-06-01 2022-10-11 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
CN110689894B (zh) * 2019-08-15 2022-03-29 深圳市声扬科技有限公司 自动注册方法及装置、智能设备
CN110600040B (zh) * 2019-09-19 2021-05-25 北京三快在线科技有限公司 声纹特征注册方法、装置、计算机设备及存储介质
CN110991517A (zh) * 2019-11-28 2020-04-10 太原理工大学 一种面向脑卒中非平衡数据集的分类方法及系统
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11508380B2 (en) * 2020-05-26 2022-11-22 Apple Inc. Personalized voices for text messaging
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
TWI807203B (zh) * 2020-07-28 2023-07-01 華碩電腦股份有限公司 聲音辨識方法及使用其之電子裝置
KR20220040875A (ko) * 2020-09-24 2022-03-31 삼성전자주식회사 음성 인식 서비스를 위한 등록 사용자에 대한 화자 인증 학습 장치 및 그 동작 방법
CN112468377B (zh) * 2020-10-23 2023-02-24 和美(深圳)信息技术股份有限公司 智能语音设备的控制方法及系统
US20220148600A1 (en) * 2020-11-11 2022-05-12 Rovi Guides, Inc. Systems and methods for detecting a mimicked voice input signal
CN112420078B (zh) * 2020-11-18 2022-12-30 青岛海尔科技有限公司 一种监听方法、装置、存储介质及电子设备
US20220199078A1 (en) * 2020-12-22 2022-06-23 Samsung Electronics Co., Ltd. Electronic apparatus, system comprising electronic apparatus and server and controlling method thereof
CN112835853B (zh) * 2020-12-31 2024-03-22 北京聚云科技有限公司 一种数据处理类型确定方法及装置
CN113448975B (zh) * 2021-05-26 2023-01-17 科大讯飞股份有限公司 一种人物画像库的更新方法、装置、系统和存储介质
CN114694650A (zh) * 2022-03-29 2022-07-01 青岛海尔科技有限公司 智能设备的控制方法和装置、存储介质及电子设备
CN116110389A (zh) * 2023-01-06 2023-05-12 黄冈师范学院 基于自学习技术的互联网电器控制方法及系统
CN116913258B (zh) * 2023-09-08 2023-11-24 鹿客科技(北京)股份有限公司 语音信号识别方法、装置、电子设备和计算机可读介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103337241A (zh) * 2013-06-09 2013-10-02 北京云知声信息技术有限公司 一种语音识别方法和装置
CN103903621A (zh) * 2012-12-26 2014-07-02 联想(北京)有限公司 一种语音识别的方法及电子设备
EP3057092A1 (en) * 2015-02-11 2016-08-17 Hand Held Products, Inc. Methods for training a speech recognition system
CN105869645A (zh) * 2016-03-25 2016-08-17 腾讯科技(深圳)有限公司 语音数据处理方法和装置
CN107978311A (zh) * 2017-11-24 2018-05-01 腾讯科技(深圳)有限公司 一种语音数据处理方法、装置以及语音交互设备

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE60213595T2 (de) * 2001-05-10 2007-08-09 Koninklijke Philips Electronics N.V. Hintergrundlernen von sprecherstimmen
US7620547B2 (en) * 2002-07-25 2009-11-17 Sony Deutschland Gmbh Spoken man-machine interface with speaker identification
EP2048656B1 (en) * 2007-10-10 2010-02-10 Harman/Becker Automotive Systems GmbH Speaker recognition
US9305553B2 (en) * 2010-04-28 2016-04-05 William S. Meisel Speech recognition accuracy improvement through speaker categories
US9318114B2 (en) * 2010-11-24 2016-04-19 At&T Intellectual Property I, L.P. System and method for generating challenge utterances for speaker verification
GB2514943A (en) * 2012-01-24 2014-12-10 Auraya Pty Ltd Voice authentication and speech recognition system and method
US9042867B2 (en) * 2012-02-24 2015-05-26 Agnitio S.L. System and method for speaker recognition on mobile devices
US20140222423A1 (en) * 2013-02-07 2014-08-07 Nuance Communications, Inc. Method and Apparatus for Efficient I-Vector Extraction
US9368109B2 (en) * 2013-05-31 2016-06-14 Nuance Communications, Inc. Method and apparatus for automatic speaker-based speech clustering
KR102222318B1 (ko) * 2014-03-18 2021-03-03 삼성전자주식회사 사용자 인식 방법 및 장치
CN105096121B (zh) * 2015-06-25 2017-07-25 百度在线网络技术(北京)有限公司 声纹认证方法和装置
CN106340298A (zh) * 2015-07-06 2017-01-18 南京理工大学 融合内容识别和话者识别的声纹解锁方法
JP6580911B2 (ja) * 2015-09-04 2019-09-25 Kddi株式会社 音声合成システムならびにその予測モデル学習方法および装置
CN105681920B (zh) * 2015-12-30 2017-03-15 深圳市鹰硕音频科技有限公司 一种具有语音识别功能的网络教学方法及系统
CN106971729A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 一种基于声音特征范围提高声纹识别速度的方法及系统
CN106782564B (zh) * 2016-11-18 2018-09-11 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置
CN106782569A (zh) * 2016-12-06 2017-05-31 深圳增强现实技术有限公司 一种基于声纹注册的增强现实方法及装置
CN106847292B (zh) * 2017-02-16 2018-06-19 平安科技(深圳)有限公司 声纹识别方法及装置
CN106886599B (zh) * 2017-02-28 2020-03-03 北京京东尚科信息技术有限公司 图像检索方法以及装置
CN107147618B (zh) * 2017-04-10 2020-05-15 易视星空科技无锡有限公司 一种用户注册方法、装置及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103903621A (zh) * 2012-12-26 2014-07-02 联想(北京)有限公司 一种语音识别的方法及电子设备
CN103337241A (zh) * 2013-06-09 2013-10-02 北京云知声信息技术有限公司 一种语音识别方法和装置
EP3057092A1 (en) * 2015-02-11 2016-08-17 Hand Held Products, Inc. Methods for training a speech recognition system
CN105869645A (zh) * 2016-03-25 2016-08-17 腾讯科技(深圳)有限公司 语音数据处理方法和装置
CN107978311A (zh) * 2017-11-24 2018-05-01 腾讯科技(深圳)有限公司 一种语音数据处理方法、装置以及语音交互设备

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161732A (zh) * 2019-12-30 2020-05-15 秒针信息技术有限公司 语音采集方法、装置、电子设备及存储介质
CN111612499A (zh) * 2020-04-03 2020-09-01 浙江口碑网络技术有限公司 信息的推送方法及装置、存储介质、终端
CN111612499B (zh) * 2020-04-03 2023-07-28 浙江口碑网络技术有限公司 信息的推送方法及装置、存储介质、终端

Also Published As

Publication number Publication date
CN107978311B (zh) 2020-08-25
US20200043471A1 (en) 2020-02-06
US11189263B2 (en) 2021-11-30
CN107978311A (zh) 2018-05-01

Similar Documents

Publication Publication Date Title
WO2019101083A1 (zh) 一种语音数据处理方法、语音交互设备及存储介质
CN107886949B (zh) 一种内容推荐方法及装置
US11321535B2 (en) Hierarchical annotation of dialog acts
US11189277B2 (en) Dynamic gazetteers for personalized entity recognition
KR20190120353A (ko) 음성 인식 방법, 디바이스, 장치, 및 저장 매체
JP2019079034A (ja) 自己学習自然言語理解を伴うダイアログ・システム
CN111261151B (zh) 一种语音处理方法、装置、电子设备及存储介质
WO2020056621A1 (zh) 一种意图识别模型的学习方法、装置及设备
JP6224857B1 (ja) 分類装置、分類方法および分類プログラム
US20140207716A1 (en) Natural language processing method and system
JP2019040166A (ja) 音声合成辞書配信装置、音声合成配信システムおよびプログラム
CN108111399B (zh) 消息处理的方法、装置、终端及存储介质
JP6370962B1 (ja) 生成装置、生成方法および生成プログラム
JP2018151578A (ja) 決定装置、決定方法および決定プログラム
CN112230874A (zh) 电子装置及其控制方法
CN114925163A (zh) 一种智能设备及意图识别的模型训练方法
WO2021098175A1 (zh) 录制语音包功能的引导方法、装置、设备和计算机存储介质
WO2023093280A1 (zh) 语音控制方法、装置、电子设备及存储介质
CN111241106A (zh) 近似数据处理方法、装置、介质及电子设备
WO2021098876A1 (zh) 一种基于知识图谱的问答方法及装置
US11645468B2 (en) User data processing
EP3401795A1 (en) Classifying conversational services
US20220180865A1 (en) Runtime topic change analyses in spoken dialog contexts
US11514920B2 (en) Method and system for determining speaker-user of voice-controllable device
US11907676B1 (en) Processing orchestration for systems including distributed components

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18880991

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18880991

Country of ref document: EP

Kind code of ref document: A1