WO2020182042A1

WO2020182042A1 - Keyword sample determining method, voice recognition method and apparatus, device, and medium

Info

Publication number: WO2020182042A1
Application number: PCT/CN2020/077912
Authority: WO
Inventors: 李敬
Original assignee: 广州市网星信息技术有限公司
Priority date: 2019-03-13
Filing date: 2020-03-05
Publication date: 2020-09-17
Also published as: CN109979440A; CN109979440B

Abstract

Disclosed are a keyword sample determining method, a voice recognition method and apparatus, a device, and a medium. The keyword sample determining method comprises: obtaining a keyword; obtaining, from an existing voice recognition sample library, a target voice sample that comprises the keyword; and determining a keyword voice segment in the target voice sample so as to obtain a keyword sample.

Description

Keyword sample determination method, voice recognition method, device, equipment and medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with an application number of 201910189413.1 on March 13, 2019. The entire content of this application is incorporated into this application by reference.

Technical field

The embodiments of the application relate to the field of speech recognition technology, for example, to a method for determining a keyword sample, a method for speech recognition, an apparatus, equipment, and a medium.

Background technique

With the increasing number of smart speakers on the market, related technologies in the field of speech recognition have been greatly developed and applied. Key Word Spotting (KWS) technology in speech recognition has also been widely used as the basis of voice interactive control.

KWS technology uses a method based on multiple types of neural networks to recognize keywords carried in speech. At this time, it is necessary to collect a large amount of audio data containing pre-defined keywords and non-keywords. The neural network constructed by the audio data pair The parameters are trained, verified and tested, so that the constructed neural network can accurately recognize the keyword information in the user's voice.

In the related technology, the keyword training set is obtained by manually recording the corresponding keyword voice to collect a large amount of audio data, which requires a high cost, and requires the recording environment of the collected audio data and the actual location of the predefined keywords. The environment is consistent, which leads to certain limitations in the generation of multiple types of keywords.

Summary of the invention

The embodiments of the present application provide a keyword sample determination method, voice recognition method, device, equipment, and medium, so as to improve the comprehensiveness of keyword sample determination and enhance the accuracy of voice recognition.

The embodiment of the application provides a method for determining a keyword sample, and the method includes:

Get keywords;

Acquire target speech samples including the keywords from an existing speech recognition sample library;

The keyword voice segment in the target voice sample is determined to obtain the keyword sample.

The embodiment of the present application provides a voice recognition method, which includes:

Obtain the user's voice instructions;

Recognizing the keywords in the voice instructions through a keyword recognition model, the keyword recognition model being trained in advance through keyword samples determined by the keyword sample determining method;

The operation corresponding to the keyword is triggered according to the keyword.

The embodiment of the present application provides a keyword sample determining device, which includes:

Keyword acquisition module, set to acquire keywords;

The target voice acquisition module is configured to acquire target voice samples including the keywords in an existing voice recognition sample library;

The keyword sample determining module is configured to determine the keyword voice segment in the target voice sample to obtain the keyword sample.

An embodiment of the present application provides a voice recognition device, which includes:

The voice command acquisition module is set to acquire the user's voice command;

The keyword recognition module is configured to recognize keywords in the voice instructions through a keyword recognition model, which is trained in advance by keyword samples determined by the keyword sample determining device;

The operation trigger module is configured to trigger an operation corresponding to the keyword according to the keyword.

An embodiment of the present application provides a device, which includes:

One or more processors;

Storage device, set to store one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the keyword sample determination method described in this application, or implement the voice described in this application recognition methods.

The embodiment of the application provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the keyword sample determination method described in this application is implemented, or the voice described in this application is implemented. recognition methods.

Description of the drawings

FIG. 1A is a flowchart of a method for determining a keyword sample provided in Embodiment 1 of this application;

FIG. 1B is a schematic diagram of the principle of determining keyword samples in the method provided in Embodiment 1 of this application;

2A is a flowchart of a method for determining a keyword sample provided in the second embodiment of the application;

2B is a schematic diagram of the principle of a keyword sample determination process provided in the second embodiment of the application;

2C is a schematic diagram of a waveform of audio data in a voice sample in the method provided in Embodiment 2 of this application;

FIG. 3A is a flowchart of a voice recognition method provided in Embodiment 3 of this application;

3B is a schematic diagram of the principle of the speech recognition process in the method provided in the third embodiment of this application;

4 is a schematic structural diagram of a keyword sample determining device provided in Embodiment 4 of the application;

FIG. 5 is a schematic structural diagram of a voice recognition device provided in Embodiment 5 of this application;

FIG. 6 is a schematic structural diagram of a device provided in Embodiment 6 of this application.

detailed description

The application will be described below with reference to the drawings and embodiments. The drawings only show a part but not all of the structure related to this application. In addition, if there is no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

Since the voice interaction control is carried out by recognizing the keywords carried in the user’s voice, it has been widely used in the field of voice recognition, and the keywords at this time can be any kind of keywords that users are interested in in daily life, but the keywords The data set is generally only the keywords publicly used by some companies or institutions for scientific research. It cannot match the keywords of interest in daily life, and it is difficult to find the corresponding voice data sets of keywords of interest. At this time, it is more critical Word recognition technology, the training data set in any type of speech recognition contains more content. Therefore, in the embodiment of this application, the existing speech recognition sample library is used to find the target speech sample containing the corresponding keyword, and The corresponding keyword speech fragments are intercepted from the voice samples to obtain the corresponding keyword samples. There is no need to determine the keyword samples by recording the keyword voices of multiple users in multiple different actual scenarios, which reduces the acquisition cost of keyword samples , Improve the comprehensiveness of keyword sample determination and effectively reduce the workload of keyword sample determination. The keyword recognition model obtained through the keyword sample training can identify the keywords contained in the corresponding user’s voice and improve the speech recognition Accuracy.

Example one

FIG. 1A is a flowchart of a method for determining a keyword sample according to Embodiment 1 of the application. This embodiment can be applied to any situation where a keyword sample for model training needs to be determined. The solution of the embodiment of the present application may be applicable to how to solve the problem of high acquisition cost and limitation of keyword samples. The keyword sample determination method provided in this embodiment can be executed by the keyword sample determination device provided in this embodiment of the application, and the device can be implemented by software and/or hardware, and is integrated in the device that executes the method. , The device can be any kind of smart terminal device, such as a laptop, tablet, or desktop.

In an embodiment, referring to FIG. 1A, the method may include the following steps:

S110: Acquire keywords.

Among them, the keyword refers to any words that are of interest to the user in daily life set by the developer in advance according to the voice interaction requirements, and the corresponding trigger operation can be executed by recognizing the keyword in the user's voice.

In one embodiment, when performing voice interaction control through keyword recognition technology, the developer will specify a keyword according to the development requirements in the voice interaction, which is used to indicate that the corresponding trigger operation is achieved through the keyword; at this time, the developer Input the specified keywords into the device that executes the keyword sample determination method in this embodiment, so that the device obtains the keywords predefined by the developer, so that the corresponding keyword samples can be automatically generated subsequently, and then the set keywords are recognized The model is trained.

S120: Obtain target voice samples including keywords from an existing voice recognition sample library.

Among them, because the speech recognition technology has been studied by developers in many fields earlier than the keyword recognition technology, the speech data contained in the training data set corresponding to the speech recognition technology is also richer. At this time, the speech recognition sample library refers to the speech Recognition technology has been pre-built during the development process to store a large number of user voices in multiple fields, that is, the large vocabulary continuous speech recognition (Large Vocabulary Continuous Speech Recognition, LVCSR) system provides multiple scenarios. A large vocabulary sample collection of user speech. Exemplarily, the speech recognition sample library in this embodiment may be a speech recognition tool library, such as a multi-type speech toolkit under a speech recognition framework such as Kaldi, Sphinx, or HTK.

Optionally, when a keyword pre-designated by the developer is obtained, the keyword can be used in the existing speech recognition sample library, that is, the large vocabulary continuous speech recognition system provides multiple scenarios in multiple scenarios. Select the target voice sample that includes the keyword from the large vocabulary sample set of user voice; at this time, because the voice recognition technology is studied and used by developers in multiple fields in multiple scenarios, the existing voice recognition sample library It includes a large number of multiple user voices in multiple scenarios, so that the obtained target voice samples are diverse voice samples in multiple scenarios, and can guarantee the samples of the target voice samples obtained in the existing voice recognition sample library The number is large enough to build a training sample set for training the keyword recognition model later.

Optionally, as shown in FIG. 1B, in this embodiment, obtaining target voice samples including keywords in an existing voice recognition sample library may include: searching for labeled data in the existing voice recognition sample library The voice samples with keywords are included in and the found voice samples are used as the target voice samples.

In an embodiment, the voice samples contained in the existing voice recognition sample library can be composed of two parts: corresponding audio data and annotation data; wherein the audio data can represent the frequency, amplitude change and duration of the user’s voice in the voice sample For data such as sound signal characteristics, each audio data can be displayed by recording the sound waveform in the corresponding user's voice; the label data can be the number and text information that record the user's voice content. At this time, when the specified keywords are obtained, the existing speech recognition sample library can be queried. By traversing each voice sample contained in the existing speech recognition sample library, the annotation data that constitutes each speech sample can be analyzed. Determine whether the specified keywords are included in the labeled data, so as to find the voice samples composed of labeled data including the specified keywords, ignore the voice samples composed of labeled data that do not include the specified keywords, and then search for the voice samples As a target voice sample for subsequent keyword analysis.

Take the Kaldi speech recognition framework as an example to illustrate the search process. A large number of public speech recognition sample libraries are provided under the Kaldi speech recognition framework, such as aishell and thchs30 sample libraries in Chinese, and wsj and librispeech sample libraries in English; At this time, the existing speech recognition sample library contains a large number of speech samples composed of audio data and labeled data. For example, the labeled data is as follows: "BAC009S0002W0130 fiscal and financial policies follow immediately"; among them, "BAC009S0002W0130" means that The number of the voice sample composed of the annotation data can clearly indicate the matching relationship between the data and the voice sample; "Financial and financial policies follow immediately" means that the voice sample composed of the annotation data contains textual information of the content. In one embodiment, if the acquired keyword is "finance", query the existing voice recognition sample library, traverse the label data of multiple voice samples contained therein, and extract the label data including the keyword "finance" "" voice sample, such as the content of the above example is the voice sample of "Financial and financial policies follow the climate". The voice sample found is used as the target voice sample. At this time, a large number of already provided under the Kaldi voice recognition framework A large number of target voice samples containing the keyword "finance" in multiple scenarios are obtained in the public voice recognition sample library, and the target voice samples are subsequently processed to obtain corresponding keyword voices in multiple scenarios.

S130: Determine a keyword voice segment in the target voice sample to obtain a keyword sample.

Among them, the keyword voice segment refers to that the voice sample only carries the voice corresponding to the specified keyword, and there is no voice segment corresponding to the voice of other content.

In one embodiment, after acquiring the target voice sample, this embodiment recognizes the target voice sample through a specific voice recognition technology to obtain a recognition result representing the voice feature information of the target voice sample, and determine according to the recognition result Find out the speech range of the keywords contained in the target speech sample, and then determine the corresponding keyword speech segment in the target speech sample, and intercept the keyword speech segment in the corresponding speech range in the target speech sample. At this time, the keyword voice segment only contains the content of the keyword and the sound feature information, and there is no information other than the keyword, so the keyword voice segment is used as the keyword sample in this embodiment.

In one embodiment, by traversing each voice sample in the existing voice recognition sample library, a large number of target voice samples that include specified keywords in the labeled data in multiple scenarios can be obtained, so from the target voice samples The number of keyword speech fragments determined in is also large enough, so that keyword samples in multiple scenarios can be obtained, so that the corresponding keyword recognition model can be subsequently trained through the keyword samples in multiple scenarios.

The technical solution provided in this embodiment obtains target voice samples containing keywords from an existing voice recognition sample library, and intercepts the keyword voice fragments in the target voice samples to obtain keyword samples. The recognition sample library contains a large number of voice samples in multiple types of users or in multiple scenarios. At this time, the target voice samples that contain keywords are also in multiple voice scenarios, so that the extracted keyword voice fragments are also in multiple types. Under this type of voice scenario, diversified keyword samples can be obtained. There is no need to generate keyword samples by repeatedly recording keyword voices of multiple users in multiple scenarios. This reduces the acquisition cost of keyword samples and increases keywords The comprehensiveness of the sample determination.

Example two

FIG. 2A is a flowchart of a method for determining a keyword sample provided in the second embodiment of the application, and FIG. 2B is a schematic diagram of the principle of a method for determining a keyword sample provided in the second embodiment of the application. This embodiment is based on the technical solution provided in the foregoing embodiment. In this embodiment, the process of determining the keyword speech segment in the target speech sample is explained.

Optionally, as shown in FIG. 2A, this embodiment may include the following steps:

S210: Acquire keywords.

S220: Acquire target voice samples including keywords from an existing voice recognition sample library.

S230: Determine the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target voice sample.

Among them, the phoneme is the smallest phonetic unit divided according to the speech attributes, which can be analyzed according to the pronunciation action of the user's voice; the phoneme in this embodiment may be multiple initials and finals in the speech composition. In this embodiment, a corresponding number is set for each existing phoneme in advance, and is stored in the corresponding phoneme table, so that the target speech sample can be subsequently identified according to the number of each phoneme. At the same time, since the audio data of the target voice sample is the data representing the characteristics of the sound signal such as the frequency, amplitude change, and duration of the user’s voice, that is, the voice data that lasts for a period of time, every word uttered by the user contained in the audio data is Matching has a corresponding start and end time range. At this time, the start time point refers to the time point at which the user starts to pronounce the keyword in the audio data of the target voice sample, and the end time point refers to the audio data of the target voice sample The point in time when the user ended sending the keyword.

In one embodiment, when a target voice sample that includes keywords in the tagging data is obtained in this embodiment, voice recognition is performed on the audio data that constitutes the target voice sample, and since the audio data is voice feature data that lasts for a period of time, And it belongs to a quasi-steady-state voice signal. At this time, when performing voice recognition on audio data, the framing situation of the audio data will be determined. Generally, the length of the voice frame is set to 20ms-30ms. The length of the voice frame in this embodiment is 20ms, and then recognize the phonemes contained in the audio data in each speech frame. At this time, the audio data in the target speech sample is recognized according to the preset phoneme number and the length of the speech frame, and the corresponding phoneme recognition result is obtained and determined The range in which the phoneme of the keyword exists in the phoneme recognition result, that is, the starting point and ending point of the phoneme of the keyword in the phoneme recognition result, and then according to the set speech frame length and the starting point and ending point in the phoneme recognition result The number of corresponding phonemes determines the start time point and end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample.

Exemplarily, for the target speech sample of "Financial and Financial Policy", the keyword is "Finance", the waveform corresponding to the audio data is shown in Figure 2C, and the phoneme corresponding to the keyword "Finance" is j, In, r and ong, there may be a certain period of silence between the two characters when the user is speaking, so there will be a certain amount of silence between "金" and "融" in the keywords contained in the audio data. The preset number of mute is "1", the number of j is "17", the number of in is "23", the number of r is "18" and the number of ong is "27", and the voice frame length is 20ms. According to the phoneme number and the length of the voice frame, the audio data is recognized, and the corresponding phoneme recognition result is “1 1 1 1 1…17 17 17 17 23 23 23 23 23 23 23 1 18 18 18 27 27 27 27 27 27 …", each number corresponds to the length of a voice frame. At this time, it can be observed that the number 17 of the phoneme "j" corresponding to the keyword "gold" has 4 frames, and the number 23 of "in" has 7 frames, "Rong" The number 18 of the corresponding phoneme "r" has 3 frames, and the number 27 of "ong" has 6 frames. At this time, the first frame of the phoneme "j" corresponding to "Gold" is the 63rd frame in the entire phoneme recognition result, so The start time point of "gold" in the keyword "finance" in the audio data is 62*20ms=1.24s, and "gold" lasts for 11 frames in the phoneme recognition result, so "gold" in the audio data The duration is 11*20ms=0.22s; in the same way, the starting time point of "fusion" in the audio data is 1.24s+0.22s+20ms=1.48s, and the "fusion" lasts for 9 frames in the phoneme recognition result. , The corresponding duration is 9*20ms=0.18s, so the total duration of "finance" in the audio data of the target voice sample is 0.22s+20ms+0.18s=0.42s; thus it is determined that the keyword "finance" is in the target The start time point in the audio data of the voice sample is 1.24s, and the end time point is 1.66s.

S240: Intercept the audio data between the start time point and the end time point according to the start time point and the end time point to obtain a keyword voice segment.

Optionally, when determining the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample, the audio data that is located between the start time point and the end time point can be truncated Audio data segment, that is, in the audio data corresponding to the target voice sample of the above-mentioned "Financial and Financial Policy", intercept the audio data segment between 1.24s and 1.66s, or in the audio data Starting from 1.24s, an audio data segment with a duration of 0.42s is intercepted and used as the keyword voice segment in this embodiment. At this time, the keyword voice segment only contains the voice information of the keyword "finance".

S250: Fill the silence data of a preset length before the start time point of the keyword speech segment and after the end time point to obtain a keyword sample.

Optionally, in order to ensure the independence of the keyword samples when the corresponding keyword speech fragment is obtained, in this embodiment, mute data of a preset length can be filled in the positions before and after the obtained keyword speech fragment. The mute data in the embodiment may be data "0" of the preset voice frame length, so as to obtain an independent keyword sample, which is convenient for subsequent differentiation from other voice samples.

Take the aishell speech recognition sample library as an example, which contains 178 hours and 400 people's speech samples in multiple fields. At this time, a total of 610 target speech samples containing the keyword "finance" can be found. Through this implementation The keyword sample determination method in the example performs keyword interception on the 610 target voice samples found, and 610 keyword samples with the keyword "financial" can be obtained, and then a diversified keyword sample set is obtained, which is the follow-up The training of the keyword recognition model created certain conditions.

The technical solution provided in this embodiment determines the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample, and intercepts the audio data of the target speech sample at the start time point and the end time point. Keyword speech fragments between time points to obtain keyword samples to ensure the diversification of keyword samples. There is no need to generate keyword samples by repeatedly recording the keyword voices of multiple users in multiple scenarios, reducing keywords The acquisition cost of the sample improves the comprehensiveness and accuracy of the keyword sample determination.

Example three

FIG. 3A is a flowchart of a voice recognition method provided in Embodiment 3 of this application. This embodiment can be applied to any situation of recognizing keywords included in a user's voice instruction. The solution of the embodiment of the present application may be applicable to how to solve the problem of cumbersome training process of the keyword recognition model. The voice recognition method provided in this embodiment can be executed by the voice recognition device provided in the embodiment of this application. The device can be implemented by software and/or hardware, and is integrated in the device that executes the method. The device can It is any kind of smart terminal equipment, such as laptop, tablet or desktop.

Referring to FIG. 3A, this embodiment may include the following steps:

S310: Acquire a user's voice instruction.

In one embodiment, when the user needs to perform an operation, a voice carrying a keyword corresponding to the operation will be emitted, and the device will generate a corresponding voice command when receiving the voice uttered by the user, and the voice command carries Corresponding keywords; in this embodiment, the matching relationship between multiple keywords and different operations is preset according to different application scenarios. For example, in a short video application, you can set different predefined keywords and different video effects In the live broadcast application, you can set predefined keywords to present corresponding gifts in the live broadcast room.

S320: Recognize the keywords in the voice instruction through the keyword recognition model.

Among them, the keyword recognition model is trained in advance by keyword samples determined by the keyword sample determination method provided in the embodiments of the present application. Exemplarily, this embodiment acquires keywords pre-specified by the user, queries each voice sample included in the existing voice recognition sample library, determines whether the specified keywords are included in the annotation data that constitutes the voice sample, and then The marked data includes the speech sample of the specified keyword as the target speech sample, and the start time point and the end time point of the keyword phoneme in the audio data phoneme of the target speech sample are determined according to the word phoneme, and cut out at the start time point The audio data segment between and the end time point is used as the keyword speech segment, and then a large number of keyword samples are obtained. In this embodiment, after the keyword samples of multiple types of keywords are obtained, a corresponding keyword sample library will be generated. The keyword sample library contains different scenarios under multiple keywords specified by the user and only messages sent by different users. Keyword samples containing the key word voice.

In one embodiment, as shown in FIG. 3B, after obtaining a keyword sample library containing keyword samples in different scenarios corresponding to multiple keywords, a large number of keyword sample pairs contained in the keyword sample library can be used The pre-set keyword recognition model is trained. At this time, by inputting the keyword samples corresponding to multiple keywords into the preset keyword recognition model, the keyword recognition results corresponding to the keyword samples are obtained, and the original Recognize the existing classification loss. When the classification loss exceeds the preset loss threshold, repair the keyword recognition model according to the classification loss, and continue to obtain the corresponding keyword samples under the keyword, and enter the key after repair again Keyword recognition is performed in the word recognition model until the classification loss obtained does not exceed the preset loss threshold. At this time, the keyword sample corresponding to the next keyword in the keyword sample library is obtained and training is carried out here until the keyword sample library The keyword samples under each keyword contained in are trained to obtain the final keyword recognition model. At this time, the keyword recognition model can accurately recognize the keywords in any speech.

Optionally, when the user’s voice command is acquired in this embodiment, the voice command can be input into a pre-trained keyword recognition model, and the keyword recognition model parses the voice command to accurately recognize the voice command. The keyword carried in the voice instruction, so that the corresponding operation can be performed according to the keyword.

S330: Trigger an operation corresponding to the keyword according to the keyword.

In one embodiment, after the keyword carried in the user's voice instruction is recognized by the keyword recognition model, the carried keyword is analyzed to determine the operation matching the keyword, and then the execution of the operation is triggered to achieve The corresponding voice interactive control.

The technical solution provided in this embodiment trains a preset keyword recognition model through the keyword samples determined by the keyword sample determination method described above, so that the keyword recognition model can accurately recognize the keywords carried in the voice command , And then trigger the execution of corresponding operations based on the identified keywords, simplifying the cumbersome operation of collecting keyword samples during model training, and reducing the cost of acquiring keyword samples. The keyword recognition model obtained through the keyword sample training Recognizing the keywords carried in the corresponding user's voice improves the accuracy of voice recognition.

Example four

FIG. 4 is a schematic structural diagram of a keyword sample determining device provided in Embodiment 4 of this application. As shown in FIG. 4, the device may include:

The keyword acquisition module 410 is set to acquire keywords;

The target voice acquisition module 420 is configured to acquire target voice samples including keywords from an existing voice recognition sample library;

The keyword sample determining module 430 is configured to determine the keyword voice segment in the target voice sample to obtain the keyword sample.

The technical solution provided in this embodiment obtains target speech samples containing keywords in an existing speech recognition sample library, and intercepts the keyword speech fragments in the target speech samples to obtain keyword samples. The recognition sample library contains a large number of voice samples in multiple types of users or in multiple scenarios. At this time, the target voice samples that contain keywords are also in multiple voice scenarios, so that the extracted keyword voice fragments are also in multiple types. Under this type of voice scenario, diversified keyword samples can be obtained. There is no need to generate keyword samples by repeatedly recording keyword voices of multiple users in multiple scenarios. This reduces the acquisition cost of keyword samples and increases keywords The comprehensiveness of the sample determination.

The keyword sample determining device provided in this embodiment is applicable to the keyword sample determining method provided in any embodiment of the present application, and has corresponding functions and effects.

Example five

FIG. 5 is a schematic structural diagram of a speech recognition device provided in Embodiment 5 of this application. As shown in FIG. 5, the device may include:

The voice instruction acquiring module 510 is configured to acquire the user's voice instruction;

The keyword recognition module 520 is configured to recognize keywords in voice instructions through a keyword recognition model, which is trained in advance by keyword samples determined by the keyword sample determining device provided in the above-mentioned embodiment;

The operation trigger module 530 is configured to trigger the operation corresponding to the keyword according to the keyword.

The technical solution provided in this embodiment trains a preset keyword recognition model through the keyword samples determined by the above keyword sample determiner, so that the keyword recognition model can accurately recognize the keywords carried in the voice command , And then trigger the execution of corresponding operations based on the identified keywords, simplifying the cumbersome operation of collecting keyword samples during model training, and reducing the cost of acquiring keyword samples. The keyword recognition model obtained through the keyword sample training Recognizing the keywords carried in the corresponding user's voice improves the accuracy of voice recognition.

The voice recognition device provided in this embodiment is applicable to the voice recognition method provided in any embodiment of the above application, and has corresponding functions and effects.

Example Six

FIG. 6 is a schematic structural diagram of a device provided by Embodiment 6 of this application. As shown in FIG. 6, the device includes a processor 60, a storage device 61, and a communication device 62; the number of processors 60 in the device may be one or more. One, a processor 60 is taken as an example in FIG. 6; the processor 60, the storage device 61, and the communication device 62 in the device may be connected by a bus or other means. In FIG. 6, the connection by a bus is taken as an example.

The device provided in this embodiment can be used to execute the keyword sample determination method or the voice recognition method provided in any of the foregoing embodiments, and has corresponding functions and effects.

Example Seven

The seventh embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method for determining a keyword sample in any of the foregoing embodiments can be implemented. The method can include:

Get keywords;

Obtain target speech samples including keywords from the existing speech recognition sample library;

Determine the keyword voice segment in the target voice sample to obtain the keyword sample.

Or, to implement the voice recognition method in any of the foregoing embodiments, the method may include:

Obtain the user's voice instructions;

Recognizing the keywords in the voice instructions through a keyword recognition model, which is trained in advance through keyword samples determined by the keyword sample determination method provided in any of the above embodiments;

Trigger corresponding actions based on keywords.

An embodiment of the application provides a storage medium containing computer-executable instructions. The computer-executable instructions are not limited to the method operations described above, and can also execute the keyword sample determination method or voice provided by any embodiment of the application. Relevant operations in the identification method.

This application can be implemented with the help of software and necessary general-purpose hardware, or can be implemented with hardware. This application can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), and Random Access Memory (Random Access Memory). , RAM), flash memory (FLASH), hard disk or optical disk, etc., including at least one instruction to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in multiple embodiments of the present application.

In the above embodiments of the keyword sample determination device or the speech recognition device, the multiple units and modules included are only divided according to the functional logic, but are not limited to the above division, as long as the corresponding function can be realized; The names of multiple functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of this application.

Claims

A method for determining keyword samples, including:

Get keywords;

Acquire target speech samples including the keywords from an existing speech recognition sample library;

The keyword voice segment in the target voice sample is determined to obtain the keyword sample.
The method according to claim 1, wherein said obtaining target speech samples including said keywords from an existing speech recognition sample library comprises:

In the existing speech recognition sample library, search for speech samples that include the keyword in the labeled data, and use the searched speech samples as the target speech samples.
The method according to claim 1, wherein said determining the keyword speech segment in the target speech sample comprises:

Determining the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample;

The audio data between the start time point and the end time point is intercepted according to the start time point and the end time point to obtain the keyword speech segment.
The method according to any one of claims 1 to 3, wherein said obtaining a keyword sample comprises:

Fill a preset length of mute data before the start time point and after the end time point of the keyword speech segment to obtain the keyword sample.
A voice recognition method, including:

Obtain the user's voice instructions;

Recognizing the keywords in the voice instructions through a keyword recognition model, the keyword recognition model being trained in advance through keyword samples determined by the keyword sample determining method according to any one of claims 1 to 4;

The operation corresponding to the keyword is triggered according to the keyword.
A keyword sample determining device, including:

Keyword acquisition module, set to acquire keywords;

The target voice acquisition module is configured to acquire target voice samples including the keywords in an existing voice recognition sample library;

The keyword sample determining module is configured to determine the keyword voice segment in the target voice sample to obtain the keyword sample.
The device according to claim 6, wherein the target voice acquisition module is set to:

In the existing speech recognition sample library, search for speech samples that include the keyword in the labeled data, and use the searched speech samples as the target speech samples.
A speech recognition device includes:

The voice command acquisition module is set to acquire the user's voice command;

The keyword recognition module is configured to recognize keywords in the voice instruction through a keyword recognition model, the keyword recognition model being trained in advance through keyword samples determined by the keyword sample determining device according to claim 6 or 7. ；

The operation trigger module is configured to trigger an operation corresponding to the keyword according to the keyword.
A device that includes:

One or more processors;

Storage device, set to store one or more programs;

The one or more programs are executed by the one or more processors, so that the one or more processors implement the keyword sample determination method according to any one of claims 1 to 4, or implement The voice recognition method described in claim 5.
A computer-readable storage medium that stores a computer program that, when executed by a processor, implements the keyword sample determination method according to any one of claims 1-4, or implements the keyword sample determination method according to claim 5 The voice recognition method described.