CN113066482A

CN113066482A - Voice model updating method, voice data processing method, voice model updating device, voice data processing device and storage medium

Info

Publication number: CN113066482A
Application number: CN201911285907.6A
Authority: CN
Inventors: 史鹏腾; 万玉龙
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2021-07-02

Abstract

The embodiment of the application provides a voice model updating method, a voice data processing method, voice model updating equipment and a storage medium. In some embodiments of the application, the speech model is trained by using the speech data of the user, so that a new speech model adapted to the user is obtained, and the user uses the speech model adapted to the user, which is beneficial to improving the accuracy of the model operation result.

Description

Voice model updating method, voice data processing method, voice model updating device, voice data processing device and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a speech model updating method, a speech data processing method, a device, and a storage medium.

Background

Neural network models are the basis for artificial intelligence. With the development of artificial intelligence, more and more products use a neural network model. Neural network models can be divided into two types, supervised and unsupervised. The supervised neural network model utilizes the label data to guide the clustering process, so that a better effect can be obtained.

At present, most of mainstream voice products or hardware are a large model uniformly deployed on a cloud end by a service provider, and when a large number of users use the voice products, the users concurrently request the cloud end to acquire information; therefore, the cloud requires very large resources and supports highly concurrent and highly reliable speech engines and services.

The training of the cloud large model generally needs to be adapted to the language characteristics and habits of all users, and the problems of low precision and the like exist in the using process of the model.

Disclosure of Invention

Aspects of the present application provide a speech model updating method, a data processing method, a device and a storage medium, which are used to improve the efficiency of model training and the accuracy of a model in a using process.

The embodiment of the application provides a voice model updating method, which is suitable for voice equipment, and the method comprises the following steps:

marking voice data of a user to obtain a marked data set, wherein the user is a user of voice equipment;

sending the labeled data set to a server so that the server can train a voice model currently used by voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user;

and receiving a new voice model issued by the server, and updating the currently used voice model by using the new voice model.

The embodiment of the present application further provides a method for updating a speech model, which is applicable to a server, and includes:

the method comprises the steps that a server receives a marking data set sent by voice equipment, wherein the marking data set is obtained by marking voice data of a user of the voice equipment;

training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model;

and issuing the new voice model to the voice equipment so that the voice equipment can update the currently used voice model.

An embodiment of the present application further provides a speech device, including: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

Embodiments of the present application provide a computer-readable storage medium storing a computer program that, when executed by one or more processors, causes the one or more processors to perform actions comprising:

An embodiment of the present application further provides a server, including: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

Embodiments of the present application also provide a computer-readable storage medium storing a computer program that, when executed by one or more processors, causes the one or more processors to perform actions comprising:

The embodiment of the present application further provides a method for updating a speech model, which is applicable to a speech device, and includes:

marking voice data of a user using a voice device to obtain a marked data set;

training a voice model currently used by voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user;

the currently used speech model is replaced with a new speech model.

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

marking voice data of a user using a voice device to obtain a marked data set;

the currently used speech model is replaced with a new speech model.

marking voice data of a user using a voice device to obtain a marked data set;

the currently used speech model is replaced with a new speech model.

displaying a first interface, wherein a labeling data set is displayed in the first interface, and the labeling data set is obtained by labeling voice data of a user using a voice device;

responding to training data selection operation, and selecting a training sample set from the labeling data set;

and sending the training sample set to a server so that the server trains the current voice model used by the voice equipment according to the training sample set to obtain a new voice model for providing voice service for the user.

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

An embodiment of the present application further provides a method for processing voice data, which is applicable to a first voice device, and includes:

displaying a first interface, wherein the first interface comprises a recording equipment switching control;

responding to the triggering operation of the user on the recording equipment switching control, and sending a voice input message to second voice equipment so that the user can input voice data by adopting the second voice equipment, wherein the tone quality of the second voice equipment is higher than that of the first voice equipment;

receiving voice data returned by the second voice equipment;

and labeling the voice data to obtain a labeled data set.

The embodiment of the present application further provides a voice data processing method, which is applicable to a second voice device, and includes:

receiving a voice input message sent by first voice equipment, wherein the tone quality of second voice equipment is higher than that of the first voice equipment;

responding to the triggering operation of the voice input message, and displaying a voice input page, wherein the voice input page comprises a recording control;

responding to the triggering operation of the recording control to acquire the voice data of the user;

and sending the voice data to a first voice device for labeling the voice data to obtain a labeled data set.

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

receiving voice data returned by the second voice equipment;

and labeling the voice data to obtain a labeled data set.

receiving voice data returned by the second voice equipment;

and labeling the voice data to obtain a labeled data set.

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

The embodiment of the present application further provides a voice data processing method, which is applicable to a voice recognition device, and the method includes:

acquiring voice data of a first user;

performing text recognition on the voice data of the first user by using a local voice recognition model;

wherein the local speech recognition model is pre-trained based on speech data of a user of the speech recognition device.

An embodiment of the present application further provides a speech recognition device, including: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

acquiring voice data of a first user;

The embodiment of the present application further provides a voice data processing method, which is applicable to a voice wake-up device, and the method includes:

acquiring voice data of a user;

extracting a voice segment containing a wakeup word from voice data by using a local voice wakeup model, wherein the local voice wakeup model is obtained by pre-training voice data of a user based on the voice wakeup device;

and if the voice segment of the awakening word meets the awakening condition, executing voice awakening operation.

An embodiment of the present application further provides a voice wake-up device, including: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

acquiring voice data of a user;

acquiring a voice synthesis text;

performing voice synthesis on the voice synthesis text to obtain voice data by using a local voice synthesis model, wherein the local voice synthesis model is obtained by pre-training based on the voice data of a specific user;

and playing the voice data.

An embodiment of the present application further provides a speech synthesis apparatus, including: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

acquiring a voice synthesis text;

and playing the voice data.

The embodiment of the present application further provides a voice data processing method, which is applicable to voiceprint recognition equipment, and the method includes:

acquiring voice data of a first user;

performing voiceprint recognition on the voice data by using a local voiceprint recognition model to generate a voiceprint recognition result, wherein the local voiceprint recognition model is obtained by pre-training based on the voice data of a specific user;

judging whether the first user is a user of the voiceprint recognition device or not according to the voiceprint recognition result;

if the judgment result is yes, the voiceprint recognition device enters the use mode.

An embodiment of the present application further provides a voiceprint recognition device, including: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

acquiring voice data of a first user;

wherein the local speech recognition model is trained based on speech data of selected speech sub-modules of the speech recognition device.

The embodiment of the present application further provides a model updating method, which is applicable to a speech recognition device, and the method includes:

acquiring voice data of a set period;

determining at least one voice submodule needing to be deleted according to the voice data of the set period;

and deleting the at least one voice submodule from the existing voice model to obtain an updated voice model, wherein the existing voice model is obtained by pre-training based on voice data of a user using the voice equipment.

In some embodiments of the application, the speech model is trained by using the speech data of the user, so that a new speech model adapted to the user is obtained, and the user uses the speech model adapted to the user, which is beneficial to improving the accuracy of the model operation result.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1a is a schematic diagram of an architecture of a speech model update system 10 according to an exemplary embodiment of the present application;

FIG. 1b is a block diagram of a speech data processing system 20 according to an exemplary embodiment of the present application;

FIG. 2a is a schematic flow chart illustrating a speech model updating method according to an exemplary embodiment of the present application;

FIG. 2b is a flowchart illustrating a method for updating a speech model according to an exemplary embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for updating a speech model according to an exemplary embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for updating a speech model according to an exemplary embodiment of the present application;

FIG. 5a is a flowchart illustrating a method for processing voice data according to an exemplary embodiment of the present application;

FIG. 5b is a flowchart illustrating a method for processing voice data according to an exemplary embodiment of the present application;

FIG. 6 is a flowchart illustrating a method for processing voice data according to an exemplary embodiment of the present application;

FIG. 7 is a flowchart illustrating a method for processing voice data according to an exemplary embodiment of the present application;

FIG. 8 is a flowchart illustrating a method for processing voice data according to an exemplary embodiment of the present application;

FIG. 9 is a flowchart illustrating a method for processing voice data according to an exemplary embodiment of the present application;

fig. 10a is a schematic flowchart of a voice data processing method according to an exemplary embodiment of the present application;

FIG. 10b is a flowchart illustrating a model updating method according to an exemplary embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application;

fig. 11 is a schematic structural diagram of a server according to an exemplary embodiment of the present application;

FIG. 12 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application;

FIG. 13 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application;

FIG. 14 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application;

FIG. 15 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application;

fig. 16 is a schematic structural diagram of a speech recognition device according to an exemplary embodiment of the present application;

fig. 17 is a schematic structural diagram of a voice wake-up apparatus according to an exemplary embodiment of the present application;

fig. 18 is a schematic structural diagram of a speech synthesis apparatus according to an exemplary embodiment of the present application;

fig. 19 is a schematic structural diagram of a voiceprint recognition device according to an exemplary embodiment of the present application;

FIG. 20 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application;

fig. 21 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, most of mainstream voice products or hardware are a large model uniformly deployed at a cloud end by a service provider. When a large number of users use voice products, concurrent requests are sent to the cloud to acquire information; the cloud end needs huge resources and supports high-concurrency and high-reliability voice engines and services. The training of the cloud large model generally needs to be adapted to the language characteristics and habits of all users, the data acquisition and labeling cost is high, the model training and parameter adjustment difficulty is high, the engine upgrading and iteration lasts for a long time, the large-area downtime is often caused by faults, and the problem of low precision also exists in the use process of the model.

Aiming at the problem that the existing voice model has low precision in the use process, in some embodiments of the application, the voice model is trained by utilizing the voice data of the user, so that a new voice model adaptive to the user is obtained, and the user uses the voice model adaptive to the user, so that the precision of the model operation result is improved.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1a is a schematic architecture diagram of a speech model updating system 10 according to an exemplary embodiment of the present application. As shown in FIG. 1a, the speech model update system 10 includes a speech device 10a and a server 10 b.

In this embodiment, the voice device 10a may have functions of computing, communication, internet access, and the like, in addition to the basic voice service function. The voice device 10a may be a smart phone, a personal computer, a wearable device, a tablet computer, or the like. The basic service functions of the speech device 10a may vary according to the application scenario. For example, in a speech recognition scenario, the speech device 10a acquires speech data of a first user, and performs text recognition on the speech data of the first user by using a speech recognition model to obtain a text after speech recognition; in the voice awakening scene, the voice device 10a acquires voice data of a user, extracts a voice segment containing an awakening word from the voice data by using a voice awakening model, and executes voice awakening operation to awaken the voice device for the user to use if the voice segment of the awakening word meets an awakening condition; in a voice synthesis scene, acquiring a voice synthesis text, carrying out voice synthesis on the voice synthesis text to obtain voice data by using a voice synthesis model, and playing the voice data; in a voiceprint recognition scene, acquiring voice data of a first user, carrying out voiceprint recognition on the voice data by using a voiceprint recognition model, and judging whether the first user is a user using voiceprint recognition equipment or not according to a voiceprint recognition result; if the judgment result is yes, the voiceprint recognition device enters the use mode.

In this embodiment, the server 10b may provide data support, computing services, and some management services for the voice device 10 a. In this embodiment, the implementation form of the server 10b is not limited, and for example, the server 10b may be a server device such as a conventional server, a cloud host, or a virtual center. The server device mainly comprises a processor, a hard disk, a memory, a system bus and the like, and a general computer architecture type. The server 10b may include one web server or a plurality of web servers.

In the present embodiment, the voice device 10a and the server 10b establish a communication connection by wireless or wired. Optionally, the voice device 10a may establish a communication connection with the server 10b by using communication methods such as WIFI, bluetooth, and infrared, or the voice device 10a may also establish a communication connection with the server 10b through a mobile network. The network format of the mobile network may be any one of 2G (gsm), 2.5G (gprs), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), WiMax, and 5G.

In this embodiment, the voice device 10a is used to label the voice device of the user, so as to obtain a labeled data set; the voice device 10a sends the labeled data set to the server 10b, and the server 10b trains the voice model currently used by the voice device 10 according to the labeled data set to obtain a new voice model for providing voice service for the user; the server 10b issues the new speech model to the speech device 10a, and the speech device 10a updates the currently used speech model with the new speech model. The voice model is trained by utilizing the voice data of the user, so that a new voice model adaptive to the user is obtained, and the user uses the voice model adaptive to the user, so that the accuracy of the model operation result is improved.

Further, the voice device 10a includes an electronic display screen, and the electronic display screen is provided with an interactive interface through which a user interacts with the voice device 10 a. The user annotates the user's voice data with the voice device 10a to obtain an annotated data set, including but not limited to the following embodiments:

the first implementation mode comprises the following steps: displaying a first interface, wherein reference text data are displayed on the first interface so that a user can input corresponding voice data; responding to the voice input operation of a user, and acquiring voice data of the user; and marking the voice data by using the reference text data to obtain a marked data set.

The second embodiment: responding to the uploading operation of the user and acquiring a custom text; responding to the voice input operation of a user, and acquiring voice data of the user; and marking the user-defined text by using voice to obtain a marked data set.

In the first embodiment, a second interface is displayed on the electronic display screen of the speech device 10a, where the second interface includes a text input control and a speech data input control; responding to an input operation initiated by a user through a text input control, and acquiring reference text data input by the user; the reference text data is pre-stored in the voice device 10a, responds to the triggering operation of the user on the text input control, acquires the reference text data and displays the reference text data in the second interface; the reference text data may be a commonly used word text. Each piece of reference text data can be correspondingly provided with a voice data input item, a trigger operation for the voice data input item is responded, corresponding voice is input for each piece of reference text data, and the voice data is sequentially marked by utilizing the reference text data to obtain a marked data set. It should be noted that, the user can define the common text data by himself. The selection principle of the common language data uses the recording as few as possible to cover the pronunciation and sentence patterns as wide as possible, and can embody the minimum set of the voice characteristics of the speaker.

In the second embodiment, a third interface is displayed on the electronic display screen of the speech device 10a, and a custom text entry is provided on the third interface, so that the custom text data is obtained in response to an operation of a user inputting an auto-meaning text in the custom text entry. The third interface also comprises a voice data input control which responds to the triggering operation of the voice data input item, inputs corresponding voice for each user-defined text data, and sequentially marks the voice data by using the user-defined text data to obtain a marked data set. The user-defined text data is a non-wording text and is a text which is defined by a user according to the actual situation.

It should be noted that the annotation data set in the present embodiment may include at least one of reference text data and non-reference text data.

The first interface, the second interface and the third interface can be located on the same page or different pages, and can be flexibly adjusted according to actual conditions.

In another embodiment, after the voice device 10a obtains the voice data, the voice device 10a converts the voice data into corresponding first text data by using the currently used voice model; calculating the matching degree between the first text data and the reference text data, and taking the first text data as a labeling result of the voice data to obtain a labeling data set if the matching degree is greater than or equal to a set threshold value; if the matching degree is smaller than the set threshold, sending out a reminding message to remind the user to re-input the voice data corresponding to the reference text data, for example, playing the voice reminding message to remind the user to re-input the voice data corresponding to the reference text data. The first voice data is any voice data, the set threshold value is not limited in the application, and the set threshold value can be adjusted according to actual conditions. Wherein, the matching degree between the first text data and the reference text data can be calculated by using the existing matching degree calculation model.

At present, in a speech recognition scenario, the speech recognition effect of the speech device 10a is better in a general common conversation, and the recognition of a person name, a place name, a mechanism name, and the like highly related to a speaker is poorer, so that the speech device 10a may also upload at least one of a contact, a song list, and a hotword list locally stored by the device to the server 10b as non-reference text data, so that the server 10b trains a new speech model by combining the non-reference text data and the labeled data set.

In an optional embodiment, after the speech device 10a obtains the labeled data set, the speech device 10a may select a training sample set from the labeled data set, so that the server 10b trains a speech model currently used by the speech device according to the training sample set. One way to implement this is to display a first interface, in which a labeled data set is displayed, wherein the labeled data set is obtained by labeling voice data of a user of the voice device; responding to the training data selection operation, and selecting a training sample set from the labeling data set; and sending the training sample set to a server so that the server trains the current voice model used by the voice equipment according to the training sample set to obtain a new voice model for providing voice service for the user. The method comprises the steps of obtaining the type of the missing labeled data according to the currently labeled data, and displaying a reminding message on a first interface, wherein the reminding message comprises the type of the missing labeled data so as to remind a user of inputting new labeled data.

After the speech device 10a acquires the annotation data set, the annotation data set is sent to the server 10b, and after the server 10b receives the annotation data set, the speech model currently used by the speech device 10a is trained according to the annotation data set. Before the server 10b trains the speech model, it also needs to perform data cleansing and data drying on the annotation data set to generate richer data to provide the performance of the speech model.

In an alternative embodiment, the annotated data set is data-washed prior to model training by server 10 b. An optional embodiment is that, a confidence of a maximum likelihood ratio of the first annotation data is calculated, wherein the first annotation data is any one piece of annotation data in the annotation data set; if the confidence coefficient of the first labeling data is larger than or equal to the set threshold, the labeling data is adopted to train the current voice model of the voice equipment; and if the confidence coefficient of the first labeling data is less than or equal to the set threshold, discarding the first labeling data. The marked data set is subjected to data cleaning in the mode, and the voice model is trained by using the marked data set subjected to data cleaning, so that the performance of the voice model is improved. The confidence of the maximum likelihood ratio of the first annotation data can be realized by adopting an existing confidence algorithm.

In an alternative embodiment, the server 10b performs data cleansing and data drying on the annotation data set to generate richer data. An optional embodiment is that, according to a set application scenario, the voice data of the user is expanded to obtain expanded voice data; and marking the expanded voice data to generate a marked data set. The method comprises the steps of expanding voice data of a user according to a set application scene to obtain expanded voice data, and optionally processing voice signals of any voice data in a noise, echo and quiet scene to obtain expanded voice data. For example, in a noise and echo scene, adding noise and an echo audio signal to any first voice data to obtain voice data added with noise; and under a quiet scene, removing the noise signal in the first voice data to obtain the voice data after noise reduction. After obtaining the expanded voice data, the server 10b labels the expanded voice data to generate a labeled data set.

In the above embodiment, training the speech model currently used by the speech device according to the annotation data set to obtain a new speech model includes at least one of the following triggering manners:

the first triggering mode is as follows: and responding to the periodic arrival event of the collected labeled data set, and training the current voice model used by the voice equipment according to the labeled data set to obtain a new voice model.

And a second triggering mode: and responding to a model training instruction sent by a user through the terminal equipment, and training the currently used voice model of the voice equipment according to the labeled data set to obtain a new voice model.

A third triggering mode: and responding that the capacity of the labeled data set reaches the preset capacity, and training the voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model.

In another embodiment, a server may not be included; marking the voice equipment of the user through the voice equipment to obtain a marked data set; the voice equipment trains a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user; the speech device updates the currently used speech model with the new speech model. The voice model is trained by utilizing the voice data of the user, so that a new voice model adaptive to the user is obtained, and the user uses the voice model adaptive to the user, so that the accuracy of the model operation result is improved.

In this embodiment, training a speech model currently used by a speech device according to an annotation data set to obtain a new speech model for providing a speech service for a user includes at least one of the following ways:

responding to a cycle arrival event of collecting a labeled data set, and training a voice model currently used by voice equipment according to the labeled data set to obtain a new voice model for providing voice service for a user;

responding to a model training instruction sent by a user through terminal equipment, and training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user;

responding that the capacity of the labeled data set reaches the preset capacity, and training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user;

and fourthly, responding to a model training instruction sent by the user in a voice mode, and training the voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user.

In this embodiment, both the process of obtaining the labeled data set and the method of training the speech model can refer to the descriptions of the corresponding parts of the foregoing embodiments, and are not described herein again.

In the embodiment of the speech model updating system, the speech model is trained by using the speech data of the user, so that a new speech model adapted to the user is obtained, and the user uses the speech model adapted to the user, which is beneficial to improving the accuracy of the model operation result.

Fig. 1b is a schematic structural diagram of a speech data processing system 20 according to an exemplary embodiment of the present application, and as shown in fig. 1b, the speech data processing system 20 includes a first speech device 20a and a second speech device 20 b.

In this embodiment, the first voice device 20a and the second voice device 20b may have functions of computing, communication, internet access, and the like, in addition to the basic voice service function. The first voice device 20a may be a smart phone, a personal computer, a wearable device, a tablet computer, or the like; the second speech device 20b may be a smart phone, a personal computer, a wearable device, a tablet computer, etc.

In the present embodiment, the first voice device 20a and the second voice device 20b establish a communication connection through wireless or wired connection. Optionally, the first voice device 20a may establish a communication connection with the second voice device 20b by using communication manners such as WIFI, bluetooth, and infrared, or the first voice device 20a may also establish a communication connection with the second voice device 20b through a mobile network. The network format of the mobile network may be any one of 2G (gsm), 2.5G (gprs), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), WiMax, and 5G.

In the speech processing system of the embodiment of the present application, the first speech device 20a and the second speech device 20b include an electronic display screen, and the electronic display screen is provided with an interactive interface through which a user interacts with the first speech device 20a and the second speech device 20 b. A first interface is displayed on an electronic display screen of the first voice device 20a, and the first interface comprises a recording device switching control; responding to the triggering operation of the user on the recording equipment switching control, sending a voice input message to the second voice equipment 20b, inputting voice data by the user through the second voice equipment 20b, and receiving the voice data returned by the second voice equipment 20b through the first voice equipment 20 a; and marking the voice data to obtain a marked data set. And the tone quality of the second voice equipment is higher than that of the first voice equipment.

In the above embodiment, the first speech device 20a responds to a second interface opening operation to display a second interface, where the second interface includes a text input control, and responds to an input operation initiated by a user through the text input control to obtain reference text data input by the user; and displaying the reference text data on a first interface for a user to input corresponding voice data. For the process of data annotation by using the reference text data, reference may be made to the corresponding parts described above, and details are not described herein.

In the embodiment of the voice data processing system, in the data annotation process, when the requirement on the voice quality is high, the second voice device with high tone quality is adopted to input the voice data, so that an annotated data set with high quality can be obtained.

In addition to the above-mentioned speech model updating system 10, some embodiments of the present application also provide a speech model updating method, and the speech model updating method of the embodiments of the present application is not limited to the above-mentioned speech model updating system.

From the perspective of a speech device, fig. 2a is a schematic flowchart of a speech model updating method provided in an exemplary embodiment of the present application. As shown in fig. 2a, the method comprises:

s211: marking voice data of a user to obtain a marked data set, wherein the user is a user of voice equipment;

s212: sending the labeled data set to a server so that the server can train a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for a user;

s213: and receiving a new voice model issued by the server, and updating the currently used voice model by using the new voice model.

From the perspective of a server, fig. 2b is a schematic flowchart of a speech model updating method provided in an exemplary embodiment of the present application. As shown in fig. 2b, the method comprises:

s221: the server receives a labeled data set sent by the voice equipment, wherein the labeled data set is obtained by labeling the voice data of a user of the voice equipment;

s222: training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model;

s223: and sending the new voice model to the voice equipment so that the voice equipment can update the currently used voice model.

In this embodiment, the execution subject is a voice device with a voice service function, and the voice device may also have functions of computing, communication, internet access, and the like. The voice device can be a smart phone, a personal computer, a wearable device, a tablet computer and the like.

The basic service functions of the voice device are different according to different application scenarios. For example, in a speech recognition scene, a speech device acquires speech data of a first user, and performs text recognition on the speech data of the first user by using a speech recognition model to obtain a text after the speech recognition; in a voice awakening scene, voice equipment acquires voice data of a user, a voice awakening model is utilized to extract a voice segment containing an awakening word from the voice data, if the voice segment of the awakening word meets an awakening condition, voice awakening operation is executed, and the voice equipment is awakened for the user to use; in a voice synthesis scene, acquiring a voice synthesis text, carrying out voice synthesis on the voice synthesis text to obtain voice data by using a voice synthesis model, and playing the voice data; in a voiceprint recognition scene, acquiring voice data of a first user, carrying out voiceprint recognition on the voice data by using a voiceprint recognition model, and judging whether the first user is a user using voiceprint recognition equipment or not according to a voiceprint recognition result; if the judgment result is yes, the voiceprint recognition device enters the use mode.

In the embodiment, a voice device of a user is labeled through the voice device to obtain a labeled data set; the voice equipment sends the labeled data set to a server, and the server trains a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for a user; and the server issues the new voice model to the voice equipment, and the voice equipment updates the currently used voice model by using the new voice model. The voice model is trained by utilizing the voice data of the user, so that a new voice model adaptive to the user is obtained, and the user uses the voice model adaptive to the user, so that the accuracy of the model operation result is improved. In other embodiments, the user's speech device may also be labeled by a device other than the speech device to obtain the labeled data set.

Furthermore, the voice equipment comprises an electronic display screen, and the electronic display screen is provided with an interactive interface through which a user interacts with the voice equipment. The user annotates the voice data of the user with the voice device to obtain an annotated data set, including but not limited to the following embodiments:

the first embodiment is as follows: displaying a first interface, wherein reference text data are displayed on the first interface so that a user can input corresponding voice data; responding to the voice input operation of a user, and acquiring voice data of the user; and marking the voice data by using the reference text data to obtain a marked data set.

Example two: responding to the uploading operation of the user and acquiring a custom text; responding to the voice input operation of a user, and acquiring voice data of the user; and marking the user-defined text by using voice to obtain a marked data set.

In the first embodiment, a second interface is displayed on an electronic display screen of the voice device, and the second interface includes a text input control and a voice data input control; responding to an input operation initiated by a user through a text input control, and acquiring reference text data input by the user; the reference text data is pre-stored in the voice equipment, responds to the triggering operation of a user on the text input control, acquires the reference text data and displays the reference text data in a second interface; the reference text data may be a commonly used word text. Each piece of reference text data can be correspondingly provided with a voice data input item, a trigger operation for the voice data input item is responded, corresponding voice is input for each piece of reference text data, and the voice data is sequentially marked by utilizing the reference text data to obtain a marked data set. It should be noted that, the user can define the common text data by himself. The selection principle of the common language data uses the recording as few as possible to cover the pronunciation and sentence patterns as wide as possible, and can embody the minimum set of the voice characteristics of the speaker.

In the second embodiment, a third interface is displayed on the electronic display screen of the speech device, and a custom text entry is provided on the third interface, so that the custom text data is obtained in response to an operation of a user inputting an auto-meaning text in the custom text entry. The third interface also comprises a voice data input control which responds to the triggering operation of the voice data input item, inputs corresponding voice for each user-defined text data, and sequentially marks the voice data by using the user-defined text data to obtain a marked data set. The user-defined text data is a non-wording text and is a text which is defined by a user according to the actual situation.

In another embodiment, after the voice device acquires the voice data, the voice device converts the voice data into corresponding first text data by using a currently used voice model; calculating the matching degree between the first text data and the reference text data, and taking the first text data as a labeling result of the voice data to obtain a labeling data set if the matching degree is greater than or equal to a set threshold value; if the matching degree is smaller than the set threshold, sending out a reminding message to remind the user to re-input the voice data corresponding to the reference text data, for example, playing the voice reminding message to remind the user to re-input the voice data corresponding to the reference text data. The first voice data is any voice data, the set threshold value is not limited in the application, and the set threshold value can be adjusted according to actual conditions. Wherein, the matching degree between the first text data and the reference text data can be calculated by using the existing matching degree calculation model.

At present, in a speech recognition scene, the speech recognition effect of the speech equipment is better in a common conversation, and the recognition of a person name, a place name, a mechanism name and the like highly related to a speaker is poorer, so that the speech equipment can upload at least one kind of data in a contact person, a song list and a hotword list locally stored by the equipment to a server as non-reference text data so that the server can train a new speech model by combining the non-reference text data and a labeled data set.

In an optional embodiment, after the labeled data set is obtained, a training sample set is selected from the labeled data set, so that the server trains the voice model currently used by the voice device according to the training sample set. One way to implement this is to display a first interface, in which a labeled data set is displayed, wherein the labeled data set is obtained by labeling voice data of a user of the voice device; responding to the training data selection operation, and selecting a training sample set from the labeling data set; and sending the training sample set to a server so that the server trains the current voice model used by the voice equipment according to the training sample set to obtain a new voice model for providing voice service for the user. The method comprises the steps of obtaining the type of the missing labeled data according to the currently labeled data, and displaying a reminding message on a first interface, wherein the reminding message comprises the type of the missing labeled data so as to remind a user of inputting new labeled data.

And after receiving the labeled data set, the server trains the voice model currently used by the voice equipment according to the labeled data set. Before the server trains the speech model, the annotation data set also needs to be data-washed and data-dried to generate richer data to provide the performance of the speech model.

In an alternative embodiment, the annotated data set is data-washed prior to model training by the server. An optional embodiment is that, a confidence of a maximum likelihood ratio of the first annotation data is calculated, wherein the first annotation data is any one piece of annotation data in the annotation data set; if the confidence coefficient of the first labeling data is larger than or equal to the set threshold, the labeling data is adopted to train the current voice model of the voice equipment; and if the confidence coefficient of the first labeling data is less than or equal to the set threshold, discarding the first labeling data. The marked data set is subjected to data cleaning in the mode, and the voice model is trained by using the marked data set subjected to data cleaning, so that the performance of the voice model is improved. The confidence of the maximum likelihood ratio of the first annotation data can be realized by adopting an existing confidence algorithm.

In an alternative embodiment, the server performs data cleansing and data drying on the annotated data set to generate richer data. An optional embodiment is that, according to a set application scenario, the voice data of the user is expanded to obtain expanded voice data; and marking the expanded voice data to generate a marked data set. The method comprises the steps of expanding voice data of a user according to a set application scene to obtain expanded voice data, and optionally processing voice signals of any voice data in a noise, echo and quiet scene to obtain expanded voice data. For example, in a noise and echo scene, adding noise and an echo audio signal to any first voice data to obtain voice data added with noise; and under a quiet scene, removing the noise signal in the first voice data to obtain the voice data after noise reduction. And after the server obtains the expanded voice data, marking the expanded voice data to generate a marked data set.

Fig. 3 is a flowchart illustrating a speech model updating method according to an exemplary embodiment of the present application, where as shown in fig. 3, the method includes:

s301: marking voice data of a user using a voice device to obtain a marked data set;

s302: training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user;

s303: the currently used speech model is replaced with a new speech model.

In this embodiment, the main execution body of the method is a voice device with a voice service function, and the voice device further has functions of computing, communicating, accessing the internet, and the like. The voice device can be a smart phone, a personal computer, a wearable device, a tablet computer and the like.

In the embodiment, a voice device of a user is labeled through the voice device to obtain a labeled data set; the voice equipment trains a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user; the speech device updates the currently used speech model with the new speech model. The voice model is trained by utilizing the voice data of the user, so that a new voice model adaptive to the user is obtained, and the user uses the voice model adaptive to the user, so that the accuracy of the model operation result is improved.

Fig. 4 is a schematic flowchart of a speech model updating method according to an embodiment of the present application, and as shown in fig. 4, the method includes:

s401: displaying a first interface, wherein a labeling data set is displayed in the first interface, and the labeling data set is obtained by labeling voice data of a user of the voice equipment;

s402: responding to the training data selection operation, and selecting a training sample set from the labeling data set;

s403: and sending the training sample set to a server so that the server trains the current voice model used by the voice equipment according to the training sample set to obtain a new voice model for providing voice service for the user.

In this embodiment, the first interface is displayed in response to an operation of opening the first interface by a user, where a tagging data set is displayed in the first interface, and the tagging data set is obtained by tagging voice data of a user using a voice device.

In this embodiment, a training sample set is selected from the annotation data set in response to a training data selection operation. An optional embodiment is that a labeled data set and a confirmation control are displayed on a first interface, and a training sample set is selected from the labeled data set in response to the triggering operation of a user on labeled data; and responding to the trigger operation of the confirmation control to obtain a training sample set.

In this embodiment, the training sample set is sent to the server, so that the server trains the speech model currently used by the speech device according to the training sample set. An optional embodiment is that a second interface is displayed, and the second interface comprises a data sending control; and responding to the triggering operation of the user on the data sending control, and sending the training sample set to the server so that the server trains the voice model currently used by the voice equipment according to the training sample set.

In this embodiment, both the process of data annotation and the process of server training the speech model can be referred to the description of the corresponding parts of the above embodiments, and are not described herein again.

From the perspective of the first speech device, fig. 5a is a schematic flowchart of a speech data processing method provided in an exemplary embodiment of the present application. As shown in fig. 5a, the method comprises:

s511: displaying a first interface, wherein the first interface comprises a recording equipment switching control;

s512: responding to the triggering operation of the user on the recording equipment switching control, and sending a voice input message to second voice equipment so that the user can input voice data by adopting the second voice equipment, wherein the tone quality of the second voice equipment is higher than that of the first voice equipment;

s513: receiving voice data returned by the second voice equipment;

s514: and marking the voice data to obtain a marked data set.

From the perspective of the second speech device, fig. 5b is a schematic flowchart of a speech data processing method provided in an exemplary embodiment of the present application. As shown in fig. 5b, the method comprises:

s521: receiving a voice input message sent by first voice equipment, wherein the tone quality of second voice equipment is higher than that of the first voice equipment;

s522: responding to the triggering operation of the voice input message, and displaying a voice input page, wherein the voice input page comprises a recording control;

s523: responding to the triggering operation of the recording control to acquire the voice data of the user;

s524: and sending the voice data to the first voice equipment for labeling the voice data to obtain a labeled data set.

In this embodiment, the first voice device and the second voice device may have functions of computing, communication, internet access, and the like, in addition to the basic voice service function. The first voice equipment can be a smart phone, a personal computer, wearable equipment, a tablet computer and the like; the second voice device can be a smart phone, a personal computer, a wearable device, a tablet computer and the like.

In the voice processing system in the embodiment of the application, the first voice device and the second voice device include an electronic display screen, and the electronic display screen is provided with an interactive interface through which a user interacts with the first voice device and the second voice device. A first interface is displayed on an electronic display screen of the first voice device, and the first interface comprises a recording device switching control; responding to the triggering operation of the user on the recording equipment switching control, sending a voice input message to second voice equipment, inputting voice data by the user through the second voice equipment, and receiving the voice data returned by the second voice equipment through the first voice equipment; and marking the voice data to obtain a marked data set. And the tone quality of the second voice equipment is higher than that of the first voice equipment.

In the embodiment of the application, in the process of data annotation, when the requirement on the voice quality is high, the second voice equipment with high tone quality is adopted to input the voice data, so that an annotation data set with high quality can be obtained. For example, the first voice device is a mobile phone used by a user, the second voice device is an intelligent sound box, and the sound quality of the intelligent sound box is higher than that of the mobile phone; responding to the triggering operation of a user on an intelligent sound box icon in an interface on the mobile phone, and sending a voice input message to the intelligent sound box; the interface of the intelligent sound box displays the voice input message, responds to the triggering operation of the user on the voice input message, starts a recording program, responds to the operation of inputting voice by the user, and acquires voice data. Adopt the higher intelligent audio amplifier of tone quality to acquire voice data, improve the quality of mark data.

In the embodiment, the first voice device responds to a second interface opening operation to display a second interface, the second interface comprises a text input control, and responds to an input operation initiated by a user through the text input control to acquire reference text data input by the user; and displaying the reference text data on a first interface for a user to input corresponding voice data. For the process of data annotation by using the reference text data, reference may be made to the corresponding parts described above, and details are not described herein.

In the following, a voice data processing method is provided respectively by taking a voice recognition scene, a voice awakening scene, a voice synthesis scene and a voiceprint recognition scene as examples.

Fig. 6 is a flowchart illustrating a voice data processing method according to an exemplary embodiment of the present application.

As shown in fig. 6, the method includes:

s601: acquiring voice data of a first user;

s602: performing text recognition on voice data of a first user by using a local voice recognition model;

the local voice recognition model is obtained by pre-training based on voice data of a user using the voice recognition device.

In this embodiment, the speech recognition device performs text recognition on the speech data of the first user by using a speech recognition model obtained by pre-training speech data of the user, so as to obtain a more accurate text.

Fig. 7 is a flowchart illustrating a voice data processing method according to an exemplary embodiment of the present application.

As shown in fig. 7, the method includes:

s701: acquiring voice data of a user;

s702: extracting a voice segment containing a wakeup word from voice data by using a local voice wakeup model, wherein the local voice wakeup model is obtained by pre-training voice data of a user based on voice wakeup equipment;

s703: and if the voice segment of the awakening word meets the awakening condition, executing voice awakening operation.

In this embodiment, the voice wake-up device extracts a voice segment including a wake-up word from voice data by using a local voice wake-up model, determines whether the voice segment of the wake-up word meets a wake-up condition, and executes a voice wake-up operation if the voice segment of the wake-up word meets the wake-up condition, so as to wake-up the device for a user to interact with the voice wake-up device.

Fig. 8 is a flowchart illustrating a voice data processing method according to an exemplary embodiment of the present application.

As shown in fig. 8, the method includes:

s801: acquiring a voice synthesis text;

s802: carrying out voice synthesis on the voice synthesis text to obtain voice data by utilizing a local voice synthesis model, wherein the local voice synthesis model is obtained by pre-training based on the voice data of a specific user;

s803: and playing the voice data.

In the present embodiment, a speech synthesis text is acquired. Including but not limited to the following acquisition modes:

the method comprises the steps that in the first obtaining mode, a voice synthesis text is obtained in response to the operation of inputting the voice synthesis text by a user;

the second acquisition mode is to respond to the operation of inputting voice data by a user and acquire the voice data; and performing text conversion on the voice data by using the voice recognition model to obtain a voice synthesis text.

The voice synthesis model utilizes the local voice awakening model to carry out voice synthesis on the voice synthesis text to obtain voice data, and personalized voice data can be synthesized.

Fig. 9 is a flowchart illustrating a voice data processing method according to an exemplary embodiment of the present application.

As shown in fig. 9, the method includes:

s901: acquiring voice data of a first user;

s902: performing voiceprint recognition on voice data by using a local voiceprint recognition model to generate a voiceprint recognition result, wherein the local voiceprint recognition model is obtained by pre-training based on the voice data of a specific user;

s903: judging whether the first user is a user of the voiceprint recognition device or not according to the voiceprint recognition result; if yes, executing step 904, otherwise, executing step 905;

s904: the voiceprint recognition device enters a use mode;

s905: and finishing the voiceprint recognition operation.

In other embodiments of the present application, the voice sub-module of the voice device may change along with the change of the usage scenario, and the usage frequency may also change correspondingly; or the binding degree of some voice sub-modules and the user is higher, and the binding degree of other voice sub-modules and the user is lower, so that voice data related to some voice sub-modules can be selected according to actual conditions to carry out model training, and a more reasonable voice recognition model can be obtained. Therefore, fig. 10a is a schematic flowchart of a voice data processing method according to an exemplary embodiment of the present application. As shown in fig. 10a, the method comprises:

s1001: acquiring voice data of a first user;

s1002: performing text recognition on voice data of a first user by using a local voice recognition model; the local voice recognition model is trained based on voice data of a selected voice sub-module of the voice recognition equipment.

In this embodiment, for example, in a scene of ordering songs and listening to news, the relevance between the types of songs played and news played and the user themselves is high, so that the voice device is used for collecting voice data of voice sub-modules associated with songs ordering and listening to news to perform model training, and a voice model with partial personalized functions is obtained.

In other embodiments of the present application, the usage habit of the user may be changed, and there may be a case where the usage frequency of some speech sub-modules is less or not used, so that partial data of corresponding speech sub-modules in the existing speech model may be deleted to obtain a simplified speech model, so as to improve the computational performance of the speech model. Therefore, fig. 10b is a flowchart illustrating a model updating method according to an exemplary embodiment of the present application. As shown in fig. 10b, the method comprises:

s1011: acquiring voice data of a set period;

s1012: determining at least one voice submodule data needing to be deleted according to the voice data of the set period;

s1013: and deleting at least one voice submodule datum from the existing voice model to obtain an updated voice model, wherein the existing voice model is obtained by pre-training based on the voice data of the user using the voice equipment.

In the embodiment of the present application, the setting period is not limited, and the setting period may be one year, one month, or one week. The set period can be adjusted according to actual conditions.

In the embodiment of the application, after the voice data of the set period is obtained, the voice sub-module data lacking or having a low occurrence frequency in the voice data is analyzed according to the voice data of the set period, and the part of the data is data which is not used or has a low use frequency in the set period by a user, so that part of data of a corresponding voice sub-module in an existing voice model can be deleted to obtain a simplified voice model, and the calculation performance of the voice model is improved. For example, if the user listens to songs for a long time by using the tianmao elfin, the data of the song listening module in the voice device is deleted, and a simplified voice model is obtained.

In this embodiment, the voiceprint recognition device performs voiceprint recognition on voice data by using a local voiceprint recognition model to generate a voiceprint recognition result; judging whether the first user is a user of the voiceprint recognition device or not according to the voiceprint recognition result; if the judgment result is yes, the voiceprint recognition device enters a use mode and can perform operations such as voiceprint registration and voiceprint login, and if the judgment result is no, the voiceprint recognition process is ended.

In the embodiment of the method, the voice model is trained by utilizing the voice data of the user, so that a new voice model adaptive to the user is obtained, and the user uses the voice model adaptive to the user, which is favorable for improving the accuracy of the model operation result.

Fig. 10 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application. As shown in fig. 10, the apparatus includes: memory 1010 and processor 1002, as well as necessary components including communications component 1003 and power component 1004.

A memory 1001 for storing a computer program and may be configured to store other various data to support operations on the voice device. Examples of such data include instructions for any application or method operating on the server.

The memory 1001 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

A communication component 1003 for establishing a communication connection with other devices.

The processor 1002, may execute computer instructions stored in the memory 1001 for: marking voice data of a user to obtain a marked data set, wherein the user is a user of voice equipment; sending the labeled data set to a server so that the server can train a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for a user; and receiving a new voice model issued by the server, and updating the currently used voice model by using the new voice model.

Optionally, when the processor 1002 performs annotation on the voice data of the user to obtain an annotated data set, the processor is specifically configured to: displaying a first interface, wherein reference text data are displayed on the first interface so that a user can input corresponding voice data; responding to the voice input operation of a user, and acquiring voice data of the user; and marking the voice data by using the reference text data to obtain a marked data set.

Optionally, before presenting the first interface, the processor 1002 may be further configured to: displaying a second interface, wherein the second interface comprises a text input control; and responding to the input operation initiated by the user through the text input control, and acquiring the reference text data input by the user.

Optionally, when the processor 1002 labels the voice data by using the reference text data to obtain a labeled data set, specifically configured to: converting the voice data into corresponding first text data by using a currently used voice model; calculating the matching degree between the first text data and the reference text data; and if the matching degree is greater than or equal to the set threshold, taking the first text data as the labeling result of the voice data to obtain a labeling data set.

Optionally, the processor 1002 may be further configured to: and if the matching degree is smaller than the set threshold, sending out reminding information to remind the user to re-input the voice data corresponding to the reference text data.

Optionally, the processor 1002 may be further configured to: and acquiring non-reference text data, and uploading the non-reference text data to a server so that the server trains a new voice model by combining the non-reference text data and the labeled data set.

Optionally, when the processor 1002 is configured to obtain the non-reference text data, it is specifically configured to: and acquiring at least one of contact, song list and hotword list locally stored by the voice equipment as non-reference text data.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 2 a.

Fig. 11 is a schematic structural diagram of a server according to an exemplary embodiment of the present application. As shown in fig. 11, the server includes: a memory 1101 and a processor 1102, as well as necessary components including a communications component 1103 and a power component 1104.

A memory 1101 for storing computer programs and may be configured to store other various data to support operations on the voice device. Examples of such data include instructions for any application or method operating on the server.

The memory 1101 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A communication component 1103 for establishing a communication connection with the other device.

The processor 1102, which may execute computer instructions stored in the memory 1101, is configured to: the server receives a labeled data set sent by the voice equipment, wherein the labeled data set is obtained by labeling the voice data of a user of the voice equipment; training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model; and sending the new voice model to the voice equipment so that the voice equipment can update the currently used voice model.

Optionally, the processor 1102, when training the speech model currently used by the speech device according to the annotation data set to obtain a new speech model, includes at least one of: responding to a periodic arrival event of the collected tagged data set, and training a voice model currently used by the voice equipment according to the tagged data set to obtain a new voice model; responding to a model training instruction sent by a user through terminal equipment, and training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model; and responding that the capacity of the labeled data set reaches the preset capacity, and training the voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model.

Optionally, the processor 1102 is further configured to: and receiving actual voice data sent by the voice equipment, and updating the new voice model, wherein the actual voice data is the voice data corrected by the user of the voice equipment.

Optionally, the processor 1102, before training the speech model currently used by the speech device according to the annotation data set to obtain a new speech model, may further be configured to: calculating the confidence of the maximum likelihood ratio of the first annotation data, wherein the first annotation data is any one piece of annotation data in the annotation data set; and if the confidence coefficient of the first labeling data is greater than or equal to the set threshold, training the current voice model of the voice equipment by using the labeling data.

Optionally, the processor 1102 is further configured to: and if the confidence coefficient of the first labeling data is less than the set threshold, discarding the first labeling data.

Optionally, the processor 1102, before training the speech model currently used by the speech device according to the annotation data set to obtain a new speech model, may further be configured to: according to a set application scene, expanding the voice data of a user to obtain expanded voice data; and marking the expanded voice data to generate a marked data set.

Optionally, when the processor 1102 expands the voice data of the user according to the set application scenario to obtain expanded voice data, the processor is specifically configured to: and processing the voice signal of any voice data in a noise, echo and quiet scene to obtain the expanded voice data.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 2 b.

Fig. 12 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application. As shown in fig. 12, the apparatus includes: a memory 1201 and a processor 1202, as well as necessary components including a communication component 1203 and a power component 1204.

A memory 1201 for storing a computer program and may be configured to store other various data to support operations on the voice device. Examples of such data include instructions for any application or method operating on the server.

The memory 1201 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A communication component 1203 is configured to establish a communication connection with the other device.

A processor 1202 that can execute computer instructions stored in memory 1201 to: marking voice data of a user using a voice device to obtain a marked data set; training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user; the currently used speech model is replaced with a new speech model.

Optionally, the processor 1202 in training the speech model currently used by the speech device according to the annotation data set to obtain a new speech model for providing speech service for the user includes at least one of: responding to a cycle arrival event of collecting a labeled data set, and training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for a user; responding to a model training instruction sent by a user through terminal equipment, and training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user; responding that the capacity of the labeled data set reaches the preset capacity, and training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user; and responding to a model training instruction sent by the user in a voice mode, and training the voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 3.

Fig. 13 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application. As shown in fig. 13, the apparatus includes: memory 1301 and processor 1302, as well as necessary components including communications component 1303 and power component 1304.

A memory 1301 for storing a computer program and may be configured to store other various data to support operations on the voice device. Examples of such data include instructions for any application or method operating on the server.

The memory 1301, may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

And a communication component 1303 configured to establish a communication connection with another device.

Processor 1302, may execute computer instructions stored in memory 1301 for: displaying a first interface, wherein a labeling data set is displayed in the first interface, and the labeling data set is obtained by labeling voice data of a user of the voice equipment; responding to the training data selection operation, and selecting a training sample set from the labeling data set; and sending the training sample set to a server so that the server trains the current voice model used by the voice equipment according to the training sample set to obtain a new voice model for providing voice service for the user.

Optionally, the processor 1302 may be further configured to: and displaying a reminding message on the first interface, wherein the reminding message comprises the type of the missing annotation data so as to remind the user of inputting new annotation data.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 4.

Fig. 14 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application. As shown in fig. 14, the apparatus includes: memory 1401 and processor 1402, as well as necessary components including communications component 1403 and power component 1404.

A memory 1401 for storing the computer program and may be configured to store other various data to support the operation on the voice device. Examples of such data include instructions for any application or method operating on the server.

The memory 1401, which may be implemented by any type of volatile or non-volatile memory device or combination thereof, may include a Static Random Access Memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic disk, or an optical disk.

A communication component 1403 for establishing a communication connection with the other device.

The processor 1402, may execute computer instructions stored in the memory 1401 for: displaying a first interface, wherein the first interface comprises a recording equipment switching control; responding to the triggering operation of the user on the recording equipment switching control, and sending a voice input message to second voice equipment so that the user can input voice data by adopting the second voice equipment, wherein the tone quality of the second voice equipment is higher than that of the first voice equipment; receiving voice data returned by the second voice equipment; and marking the voice data to obtain a marked data set.

Optionally, the processor 1402 may be further configured to: displaying a second interface, wherein the second interface comprises a text input control; responding to an input operation initiated by a user through a text input control, and acquiring reference text data input by the user; and displaying the reference text data on a first interface for a user to input corresponding voice data.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 5 a.

Fig. 15 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application. As shown in fig. 15, the apparatus includes: memory 1501 and processor 1502, as well as necessary components including communications component 1503 and power component 1504.

A memory 1501 is used for storing a computer program and may be configured to store other various data to support operations on the voice device. Examples of such data include instructions for any application or method operating on the server.

The memory 1501 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A communication component 1503 for establishing a communication connection with other devices.

The processor 1502, which may execute computer instructions stored in the memory 1501, is configured to: receiving a voice input message sent by first voice equipment, wherein the tone quality of second voice equipment is higher than that of the first voice equipment; responding to the triggering operation of the voice input message, and displaying a voice input page, wherein the voice input page comprises a recording control; responding to the triggering operation of the recording control to acquire the voice data of the user; and sending the voice data to the first voice equipment for labeling the voice data to obtain a labeled data set.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 5 b.

Fig. 16 is a schematic structural diagram of a speech recognition device according to an exemplary embodiment of the present application. As shown in fig. 16, the apparatus includes: memory 1601 and processor 1602, as well as necessary components including communications components 1603 and power components 1604.

A memory 1601 is used to store computer programs and may be configured to store other various data to support operations on the speech recognition device. Examples of such data include instructions for any application or method operating on the server.

The memory 1601, which may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A communication component 1603 for establishing a communication connection with the other device.

Processor 1602, may execute computer instructions stored in memory 1601 to: acquiring voice data of a first user; performing text recognition on voice data of a first user by using a local voice recognition model; the local voice recognition model is obtained by pre-training based on voice data of a user using the voice recognition device.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 6.

Fig. 17 is a schematic structural diagram of a voice wake-up device according to an exemplary embodiment of the present application. As shown in fig. 17, the apparatus includes: memory 1701 and processor 1702, as well as necessary components including communications components 1703 and power components 1704.

The memory 1701 is used to store computer programs and may be configured to store other various data to support operations on the voice wake-up device. Examples of such data include instructions for any application or method operating on the server.

The memory 1701 may be implemented using any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A communication component 1703 for establishing a communication connection with other devices.

The processor 1702, which may execute computer instructions stored in the memory 1701, is configured to: acquiring voice data of a user; extracting a voice segment containing a wakeup word from voice data by using a local voice wakeup model, wherein the local voice wakeup model is obtained by pre-training voice data of a user based on voice wakeup equipment; and if the voice segment of the awakening word meets the awakening condition, executing voice awakening operation.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 7.

Fig. 18 is a schematic structural diagram of a speech synthesis apparatus according to an exemplary embodiment of the present application. As shown in fig. 18, the apparatus includes: memory 1801 and processor 1802, as well as necessary components including communications component 1803 and power component 1804.

The memory 1801 is used to store computer programs and may be configured to store various other data to support operations on the speech synthesis device. Examples of such data include instructions for any application or method operating on the server.

The memory 1801 may be implemented using any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A communication component 1803, configured to establish a communication connection with another device.

The processor 1802, may execute the computer instructions stored in the memory 1801 to: acquiring a voice synthesis text; carrying out voice synthesis on the voice synthesis text to obtain voice data by utilizing a local voice synthesis model, wherein the local voice synthesis model is obtained by pre-training based on the voice data of a specific user; and playing the voice data.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 8.

Fig. 19 is a schematic structural diagram of a voiceprint recognition device according to an exemplary embodiment of the present application. As shown in fig. 19, the apparatus includes: a memory 1901 and a processor 1902, and further includes necessary components such as a communication component 1903 and a power component 1904.

A memory 1901 for storing computer programs and may be configured to store other various data to support operations on the voiceprint recognition device. Examples of such data include instructions for any application or method operating on the server.

The memory 1901 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A communication component 1903 for establishing a communication connection with other devices.

The processor 1902, may execute computer instructions stored in the memory 1901 to: acquiring voice data of a first user; performing voiceprint recognition on voice data by using a local voiceprint recognition model to generate a voiceprint recognition result, wherein the local voiceprint recognition model is obtained by pre-training based on the voice data of a specific user; judging whether the first user is a user of the voiceprint recognition device or not according to the voiceprint recognition result; if the judgment result is yes, the voiceprint recognition device enters the use mode.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 9.

Fig. 20 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application. As shown in fig. 20, the apparatus includes: a memory 2001 and a processor 2002, as well as necessary components including a communication component 2003 and a power component 2004.

A memory 2001 for storing computer programs and may be configured to store other various data to support operations on the speech recognition device. Examples of such data include instructions for any application or method operating on the server.

The memory 2001, which may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A communication component 2003 for establishing communication connections with other devices.

The processor 2002, which may execute computer instructions stored in the memory 2001, is to: acquiring voice data of a first user; performing text recognition on voice data of a first user by using a local voice recognition model; the local voice recognition model is trained based on voice data of a selected voice sub-module of the voice recognition equipment.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 10 a.

Fig. 21 is a schematic structural diagram of a speech device according to an exemplary embodiment of the present application. As shown in fig. 21, the apparatus includes: a memory 2101 and a processor 2102, as well as necessary components including a communications component 2103 and a power component 2104.

A memory 2101 for storing computer programs and may be configured to store other various data to support operations on the speech recognition device. Examples of such data include instructions for any application or method operating on the server.

The memory 2101 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A communication component 2103 for establishing a communication connection with other devices.

A processor 2102 that can execute computer instructions stored in the memory 2101 to: acquiring voice data of a set period; determining at least one voice submodule data needing to be deleted according to the voice data of the set period; and deleting at least one voice submodule datum from the existing voice model to obtain an updated voice model, wherein the existing voice model is obtained by pre-training based on the voice data of the user using the voice equipment.

Correspondingly, the embodiment of the application also provides a computer readable storage medium storing the computer program. The computer readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 10 b.

In the above device embodiment of the present application, the speech model is trained by using the speech data of the user, so as to obtain a new speech model adapted to the user, and the user uses the speech model adapted to the user, which is beneficial to improving the accuracy of the model operation result.

The communications components of fig. 10-19 described above are configured to facilitate communications between the device in which the communications component is located and other devices in a wired or wireless manner. The device in which the communication component is located may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component further includes Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and the like to facilitate short-range communications.

The power supply components of fig. 10-19 described above provide power to the various components of the device in which the power supply components are located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A speech model updating method is suitable for speech equipment, and is characterized by comprising the following steps:

2. The method of claim 1, wherein annotating voice data of a user to obtain an annotated data set comprises:

displaying a first interface, wherein reference text data are displayed on the first interface so that a user can input corresponding voice data;

responding to the voice input operation of a user, and acquiring voice data of the user;

and marking the voice data by using the reference text data to obtain a marked data set.

3. The method of claim 2, prior to presenting the first interface, further comprising:

displaying a second interface, wherein the second interface comprises a text input control;

and responding to the input operation initiated by the user through the text input control, and acquiring the reference text data input by the user.

4. The method of claim 2, wherein labeling the speech data with the reference text data to obtain a labeled data set comprises:

converting the voice data into corresponding first text data by using a currently used voice model;

calculating the matching degree between the first text data and the reference text data;

and if the matching degree is greater than or equal to a set threshold value, taking the first text data as a labeling result of the voice data to obtain the labeling data set.

5. The method of claim 4, further comprising:

and if the matching degree is smaller than a set threshold value, sending out reminding information to remind a user of re-inputting the voice data corresponding to the reference text data.

6. The method of claim 1, further comprising:

and acquiring non-reference text data, and uploading the non-reference text data to a server so that the server trains a new voice model by combining the non-reference text data and the labeled data set.

7. The method of claim 6, wherein obtaining non-reference textual data comprises:

and acquiring at least one of contact, song list and hotword list locally stored by the voice equipment as the non-reference text data.

8. A speech model updating method is suitable for a server and is characterized by comprising the following steps:

9. The method of claim 8, wherein training a speech model currently used by a speech device according to the annotation data set to obtain a new speech model comprises at least one of:

responding to a periodic arrival event of a collected tagged data set, and training a voice model currently used by the voice equipment according to the tagged data set to obtain a new voice model;

responding to a model training instruction sent by a user through terminal equipment, and training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model;

and responding that the capacity of the labeled data set reaches the preset capacity, and training the current voice model used by the voice equipment according to the labeled data set to obtain a new voice model.

10. The method of claim 8, further comprising:

and receiving actual voice data sent by the voice equipment, and updating the new voice model, wherein the actual voice data is the voice data corrected by the user of the voice equipment.

11. The method of claim 8, further comprising, before training a speech model currently used by a speech device according to the annotation data set to obtain a new speech model:

calculating the confidence of the maximum likelihood ratio of the first annotation data, wherein the first annotation data is any one piece of annotation data in the annotation data set;

and if the confidence coefficient of the first labeling data is greater than or equal to the set threshold, training the current voice model of the voice equipment by using the labeling data.

12. The method of claim 11, further comprising:

and if the confidence coefficient of the first labeling data is less than the set threshold value, discarding the first labeling data.

13. The method according to any of claims 8-12, further comprising, before training the speech model currently used by the speech device according to the annotation data set to obtain a new speech model:

according to a set application scene, expanding the voice data of a user to obtain expanded voice data;

and labeling the expanded voice data to generate the labeled data set.

14. The method of claim 13, wherein expanding the voice data of the user according to the set application scenario to obtain expanded voice data comprises:

and processing the voice signal of any voice data in a noise, echo and quiet scene to obtain the expanded voice data.

15. A speech device, comprising: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

16. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform acts comprising:

17. A server, comprising: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

18. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform acts comprising:

19. A speech model updating method is suitable for speech equipment, and is characterized by comprising the following steps:

marking voice data of a user using a voice device to obtain a marked data set;

the currently used speech model is replaced with a new speech model.

20. The method of claim 19, wherein training a speech model currently used by a speech device according to the annotation data set to obtain a new speech model for providing speech services to the user comprises at least one of:

responding to a periodic arrival event of a collected tagged data set, and training a voice model currently used by voice equipment according to the tagged data set to obtain a new voice model for providing voice service for the user;

responding to the fact that the capacity of the labeled data set reaches the preset capacity, and training a voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user;

and responding to a model training instruction sent by a user in a voice mode, and training the voice model currently used by the voice equipment according to the labeled data set to obtain a new voice model for providing voice service for the user.

21. A speech device, comprising: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

marking voice data of a user using a voice device to obtain a marked data set;

the currently used speech model is replaced with a new speech model.

22. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform acts comprising:

marking voice data of a user using a voice device to obtain a marked data set;

the currently used speech model is replaced with a new speech model.

23. A speech model updating method is suitable for speech equipment, and is characterized by comprising the following steps:

24. The method of claim 23, further comprising:

and displaying a reminding message on the first interface, wherein the reminding message comprises the type of the missing annotation data so as to remind the user of inputting new annotation data.

25. A speech device, comprising: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

26. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform acts comprising:

27. A voice data processing method is suitable for a first voice device, and is characterized by comprising the following steps:

receiving voice data returned by the second voice equipment;

and labeling the voice data to obtain a labeled data set.

28. The method of claim 27, further comprising:

responding to an input operation initiated by a user through the text input control, and acquiring reference text data input by the user;

and displaying the reference text data on a first interface for a user to input corresponding voice data.

29. A voice data processing method applied to a second voice device, comprising:

30. A speech device, comprising: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

receiving voice data returned by the second voice equipment;

and labeling the voice data to obtain a labeled data set.

31. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform acts comprising:

receiving voice data returned by the second voice equipment;

and labeling the voice data to obtain a labeled data set.

32. A speech device, comprising: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

33. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform acts comprising:

34. A speech data processing method adapted to a speech recognition device, the method comprising:

acquiring voice data of a first user;

35. A speech recognition device, comprising: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

acquiring voice data of a first user;

36. A voice data processing method is suitable for voice wake-up equipment, and is characterized by comprising the following steps:

acquiring voice data of a user;

37. A voice wake-up device, comprising: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

acquiring voice data of a user;

38. A voice data processing method is suitable for voice wake-up equipment, and is characterized by comprising the following steps:

acquiring a voice synthesis text;

and playing the voice data.

39. A speech synthesis apparatus, characterized by comprising: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

acquiring a voice synthesis text;

and playing the voice data.

40. A voice data processing method is suitable for a voiceprint recognition device, and is characterized by comprising the following steps:

acquiring voice data of a first user;

41. A voiceprint recognition apparatus, comprising: a memory and a processor;

the memory to store one or more computer instructions;

the processor to execute the one or more computer instructions to:

acquiring voice data of a first user;

42. A speech data processing method adapted to a speech recognition device, the method comprising:

acquiring voice data of a first user;

43. A model updating method applied to a speech recognition device, the method comprising:

acquiring voice data of a set period;

determining at least one voice submodule data needing to be deleted according to the voice data of the set period;

and deleting the at least one voice sub-module data from the existing voice model to obtain an updated voice model, wherein the existing voice model is obtained by pre-training based on the voice data of the user using the voice equipment.