WO2020182042A1 - Procédé de détermination d'échantillon de mot-clé, procédé de reconnaissance vocale, et appareil, dispositif et support - Google Patents

Procédé de détermination d'échantillon de mot-clé, procédé de reconnaissance vocale, et appareil, dispositif et support Download PDF

Info

Publication number
WO2020182042A1
WO2020182042A1 PCT/CN2020/077912 CN2020077912W WO2020182042A1 WO 2020182042 A1 WO2020182042 A1 WO 2020182042A1 CN 2020077912 W CN2020077912 W CN 2020077912W WO 2020182042 A1 WO2020182042 A1 WO 2020182042A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
voice
sample
speech
recognition
Prior art date
Application number
PCT/CN2020/077912
Other languages
English (en)
Chinese (zh)
Inventor
李敬
Original Assignee
广州市网星信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市网星信息技术有限公司 filed Critical 广州市网星信息技术有限公司
Publication of WO2020182042A1 publication Critical patent/WO2020182042A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the embodiments of the application relate to the field of speech recognition technology, for example, to a method for determining a keyword sample, a method for speech recognition, an apparatus, equipment, and a medium.
  • KWS technology uses a method based on multiple types of neural networks to recognize keywords carried in speech. At this time, it is necessary to collect a large amount of audio data containing pre-defined keywords and non-keywords.
  • the neural network constructed by the audio data pair The parameters are trained, verified and tested, so that the constructed neural network can accurately recognize the keyword information in the user's voice.
  • the keyword training set is obtained by manually recording the corresponding keyword voice to collect a large amount of audio data, which requires a high cost, and requires the recording environment of the collected audio data and the actual location of the predefined keywords.
  • the environment is consistent, which leads to certain limitations in the generation of multiple types of keywords.
  • the embodiments of the present application provide a keyword sample determination method, voice recognition method, device, equipment, and medium, so as to improve the comprehensiveness of keyword sample determination and enhance the accuracy of voice recognition.
  • the embodiment of the application provides a method for determining a keyword sample, and the method includes:
  • Acquire target speech samples including the keywords from an existing speech recognition sample library
  • the keyword voice segment in the target voice sample is determined to obtain the keyword sample.
  • the embodiment of the present application provides a voice recognition method, which includes:
  • the keyword recognition model being trained in advance through keyword samples determined by the keyword sample determining method
  • the operation corresponding to the keyword is triggered according to the keyword.
  • the embodiment of the present application provides a keyword sample determining device, which includes:
  • Keyword acquisition module set to acquire keywords
  • the target voice acquisition module is configured to acquire target voice samples including the keywords in an existing voice recognition sample library
  • the keyword sample determining module is configured to determine the keyword voice segment in the target voice sample to obtain the keyword sample.
  • An embodiment of the present application provides a voice recognition device, which includes:
  • the voice command acquisition module is set to acquire the user's voice command
  • the keyword recognition module is configured to recognize keywords in the voice instructions through a keyword recognition model, which is trained in advance by keyword samples determined by the keyword sample determining device;
  • the operation trigger module is configured to trigger an operation corresponding to the keyword according to the keyword.
  • An embodiment of the present application provides a device, which includes:
  • One or more processors are One or more processors;
  • Storage device set to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the keyword sample determination method described in this application, or implement the voice described in this application recognition methods.
  • the embodiment of the application provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the keyword sample determination method described in this application is implemented, or the voice described in this application is implemented. recognition methods.
  • FIG. 1A is a flowchart of a method for determining a keyword sample provided in Embodiment 1 of this application;
  • FIG. 1B is a schematic diagram of the principle of determining keyword samples in the method provided in Embodiment 1 of this application;
  • 2A is a flowchart of a method for determining a keyword sample provided in the second embodiment of the application
  • 2B is a schematic diagram of the principle of a keyword sample determination process provided in the second embodiment of the application.
  • FIG. 2C is a schematic diagram of a waveform of audio data in a voice sample in the method provided in Embodiment 2 of this application;
  • FIG. 3A is a flowchart of a voice recognition method provided in Embodiment 3 of this application.
  • 3B is a schematic diagram of the principle of the speech recognition process in the method provided in the third embodiment of this application.
  • FIG. 4 is a schematic structural diagram of a keyword sample determining device provided in Embodiment 4 of the application.
  • FIG. 5 is a schematic structural diagram of a voice recognition device provided in Embodiment 5 of this application.
  • FIG. 6 is a schematic structural diagram of a device provided in Embodiment 6 of this application.
  • the voice interaction control is carried out by recognizing the keywords carried in the user’s voice, it has been widely used in the field of voice recognition, and the keywords at this time can be any kind of keywords that users are interested in in daily life, but the keywords
  • the data set is generally only the keywords publicly used by some companies or institutions for scientific research. It cannot match the keywords of interest in daily life, and it is difficult to find the corresponding voice data sets of keywords of interest.
  • the training data set in any type of speech recognition contains more content. Therefore, in the embodiment of this application, the existing speech recognition sample library is used to find the target speech sample containing the corresponding keyword, and The corresponding keyword speech fragments are intercepted from the voice samples to obtain the corresponding keyword samples.
  • the keyword recognition model obtained through the keyword sample training can identify the keywords contained in the corresponding user’s voice and improve the speech recognition Accuracy.
  • FIG. 1A is a flowchart of a method for determining a keyword sample according to Embodiment 1 of the application.
  • This embodiment can be applied to any situation where a keyword sample for model training needs to be determined.
  • the solution of the embodiment of the present application may be applicable to how to solve the problem of high acquisition cost and limitation of keyword samples.
  • the keyword sample determination method provided in this embodiment can be executed by the keyword sample determination device provided in this embodiment of the application, and the device can be implemented by software and/or hardware, and is integrated in the device that executes the method.
  • the device can be any kind of smart terminal device, such as a laptop, tablet, or desktop.
  • the method may include the following steps:
  • the keyword refers to any words that are of interest to the user in daily life set by the developer in advance according to the voice interaction requirements, and the corresponding trigger operation can be executed by recognizing the keyword in the user's voice.
  • the developer when performing voice interaction control through keyword recognition technology, the developer will specify a keyword according to the development requirements in the voice interaction, which is used to indicate that the corresponding trigger operation is achieved through the keyword; at this time, the developer Input the specified keywords into the device that executes the keyword sample determination method in this embodiment, so that the device obtains the keywords predefined by the developer, so that the corresponding keyword samples can be automatically generated subsequently, and then the set keywords are recognized
  • the model is trained.
  • S120 Obtain target voice samples including keywords from an existing voice recognition sample library.
  • the speech recognition sample library refers to the speech Recognition technology has been pre-built during the development process to store a large number of user voices in multiple fields, that is, the large vocabulary continuous speech recognition (Large Vocabulary Continuous Speech Recognition, LVCSR) system provides multiple scenarios.
  • LVCSR Large Vocabulary Continuous Speech Recognition
  • a large vocabulary sample collection of user speech may be a speech recognition tool library, such as a multi-type speech toolkit under a speech recognition framework such as Kaldi, Sphinx, or HTK.
  • the keyword can be used in the existing speech recognition sample library, that is, the large vocabulary continuous speech recognition system provides multiple scenarios in multiple scenarios.
  • the existing voice recognition sample library It includes a large number of multiple user voices in multiple scenarios, so that the obtained target voice samples are diverse voice samples in multiple scenarios, and can guarantee the samples of the target voice samples obtained in the existing voice recognition sample library The number is large enough to build a training sample set for training the keyword recognition model later.
  • obtaining target voice samples including keywords in an existing voice recognition sample library may include: searching for labeled data in the existing voice recognition sample library The voice samples with keywords are included in and the found voice samples are used as the target voice samples.
  • the voice samples contained in the existing voice recognition sample library can be composed of two parts: corresponding audio data and annotation data; wherein the audio data can represent the frequency, amplitude change and duration of the user’s voice in the voice sample
  • each audio data can be displayed by recording the sound waveform in the corresponding user's voice; the label data can be the number and text information that record the user's voice content.
  • the existing speech recognition sample library can be queried. By traversing each voice sample contained in the existing speech recognition sample library, the annotation data that constitutes each speech sample can be analyzed.
  • Kaldi speech recognition framework Take the Kaldi speech recognition framework as an example to illustrate the search process.
  • a large number of public speech recognition sample libraries are provided under the Kaldi speech recognition framework, such as aishell and thchs30 sample libraries in Chinese, and wsj and librispeech sample libraries in English;
  • the existing speech recognition sample library contains a large number of speech samples composed of audio data and labeled data.
  • the labeled data is as follows: “BAC009S0002W0130 fiscal and financial policies follow immediately"; among them, “BAC009S0002W0130” means that The number of the voice sample composed of the annotation data can clearly indicate the matching relationship between the data and the voice sample; “Financial and financial policies follow immediately” means that the voice sample composed of the annotation data contains textual information of the content.
  • the acquired keyword is "finance”
  • query the existing voice recognition sample library traverse the label data of multiple voice samples contained therein, and extract the label data including the keyword "finance" "" voice sample, such as the content of the above example is the voice sample of "Financial and financial policies follow the climate".
  • the voice sample found is used as the target voice sample.
  • S130 Determine a keyword voice segment in the target voice sample to obtain a keyword sample.
  • the keyword voice segment refers to that the voice sample only carries the voice corresponding to the specified keyword, and there is no voice segment corresponding to the voice of other content.
  • this embodiment recognizes the target voice sample through a specific voice recognition technology to obtain a recognition result representing the voice feature information of the target voice sample, and determine according to the recognition result Find out the speech range of the keywords contained in the target speech sample, and then determine the corresponding keyword speech segment in the target speech sample, and intercept the keyword speech segment in the corresponding speech range in the target speech sample.
  • the keyword voice segment only contains the content of the keyword and the sound feature information, and there is no information other than the keyword, so the keyword voice segment is used as the keyword sample in this embodiment.
  • a large number of target voice samples that include specified keywords in the labeled data in multiple scenarios can be obtained, so from the target voice samples
  • the number of keyword speech fragments determined in is also large enough, so that keyword samples in multiple scenarios can be obtained, so that the corresponding keyword recognition model can be subsequently trained through the keyword samples in multiple scenarios.
  • the technical solution provided in this embodiment obtains target voice samples containing keywords from an existing voice recognition sample library, and intercepts the keyword voice fragments in the target voice samples to obtain keyword samples.
  • the recognition sample library contains a large number of voice samples in multiple types of users or in multiple scenarios.
  • the target voice samples that contain keywords are also in multiple voice scenarios, so that the extracted keyword voice fragments are also in multiple types.
  • diversified keyword samples can be obtained. There is no need to generate keyword samples by repeatedly recording keyword voices of multiple users in multiple scenarios. This reduces the acquisition cost of keyword samples and increases keywords The comprehensiveness of the sample determination.
  • FIG. 2A is a flowchart of a method for determining a keyword sample provided in the second embodiment of the application
  • FIG. 2B is a schematic diagram of the principle of a method for determining a keyword sample provided in the second embodiment of the application. This embodiment is based on the technical solution provided in the foregoing embodiment. In this embodiment, the process of determining the keyword speech segment in the target speech sample is explained.
  • this embodiment may include the following steps:
  • S220 Acquire target voice samples including keywords from an existing voice recognition sample library.
  • S230 Determine the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target voice sample.
  • the phoneme is the smallest phonetic unit divided according to the speech attributes, which can be analyzed according to the pronunciation action of the user's voice; the phoneme in this embodiment may be multiple initials and finals in the speech composition.
  • a corresponding number is set for each existing phoneme in advance, and is stored in the corresponding phoneme table, so that the target speech sample can be subsequently identified according to the number of each phoneme.
  • the audio data of the target voice sample is the data representing the characteristics of the sound signal such as the frequency, amplitude change, and duration of the user’s voice, that is, the voice data that lasts for a period of time
  • every word uttered by the user contained in the audio data is Matching has a corresponding start and end time range.
  • the start time point refers to the time point at which the user starts to pronounce the keyword in the audio data of the target voice sample
  • the end time point refers to the audio data of the target voice sample The point in time when the user ended sending the keyword.
  • a target voice sample that includes keywords in the tagging data when a target voice sample that includes keywords in the tagging data is obtained in this embodiment, voice recognition is performed on the audio data that constitutes the target voice sample, and since the audio data is voice feature data that lasts for a period of time, And it belongs to a quasi-steady-state voice signal.
  • the framing situation of the audio data will be determined.
  • the length of the voice frame is set to 20ms-30ms.
  • the length of the voice frame in this embodiment is 20ms, and then recognize the phonemes contained in the audio data in each speech frame.
  • the audio data in the target speech sample is recognized according to the preset phoneme number and the length of the speech frame, and the corresponding phoneme recognition result is obtained and determined
  • the range in which the phoneme of the keyword exists in the phoneme recognition result that is, the starting point and ending point of the phoneme of the keyword in the phoneme recognition result, and then according to the set speech frame length and the starting point and ending point in the phoneme recognition result
  • the number of corresponding phonemes determines the start time point and end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample.
  • the keyword is "Finance”
  • the waveform corresponding to the audio data is shown in Figure 2C
  • the phoneme corresponding to the keyword “Finance” is j
  • the phoneme corresponding to the keyword “Finance” is j
  • there may be a certain period of silence between the two characters when the user is speaking so there will be a certain amount of silence between " ⁇ ” and " ⁇ ” in the keywords contained in the audio data.
  • the preset number of mute is “1”
  • the number of j is “17”
  • the number of in is “23”
  • the number of r is "18”
  • the number of ong is “27”
  • the voice frame length is 20ms.
  • each number corresponds to the length of a voice frame.
  • the number 17 of the phoneme “j” corresponding to the keyword “gold” has 4 frames
  • the number 23 of "in” has 7 frames
  • the number 18 of the corresponding phoneme “r” has 3 frames
  • the number 27 of "ong” has 6 frames.
  • the first frame of the phoneme "j" corresponding to "Gold” is the 63rd frame in the entire phoneme recognition result, so
  • the start time point in the audio data of the voice sample is 1.24s, and the end time point is 1.66s.
  • S240 Intercept the audio data between the start time point and the end time point according to the start time point and the end time point to obtain a keyword voice segment.
  • the audio data that is located between the start time point and the end time point can be truncated Audio data segment, that is, in the audio data corresponding to the target voice sample of the above-mentioned "Financial and Financial Policy", intercept the audio data segment between 1.24s and 1.66s, or in the audio data Starting from 1.24s, an audio data segment with a duration of 0.42s is intercepted and used as the keyword voice segment in this embodiment. At this time, the keyword voice segment only contains the voice information of the keyword "finance".
  • S250 Fill the silence data of a preset length before the start time point of the keyword speech segment and after the end time point to obtain a keyword sample.
  • mute data of a preset length can be filled in the positions before and after the obtained keyword speech fragment.
  • the mute data in the embodiment may be data "0" of the preset voice frame length, so as to obtain an independent keyword sample, which is convenient for subsequent differentiation from other voice samples.
  • aishell speech recognition sample library as an example, which contains 178 hours and 400 people's speech samples in multiple fields. At this time, a total of 610 target speech samples containing the keyword "finance" can be found.
  • the keyword sample determination method in the example performs keyword interception on the 610 target voice samples found, and 610 keyword samples with the keyword "financial" can be obtained, and then a diversified keyword sample set is obtained, which is the follow-up The training of the keyword recognition model created certain conditions.
  • the technical solution provided in this embodiment determines the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample, and intercepts the audio data of the target speech sample at the start time point and the end time point. Keyword speech fragments between time points to obtain keyword samples to ensure the diversification of keyword samples. There is no need to generate keyword samples by repeatedly recording the keyword voices of multiple users in multiple scenarios, reducing keywords The acquisition cost of the sample improves the comprehensiveness and accuracy of the keyword sample determination.
  • FIG. 3A is a flowchart of a voice recognition method provided in Embodiment 3 of this application.
  • This embodiment can be applied to any situation of recognizing keywords included in a user's voice instruction.
  • the solution of the embodiment of the present application may be applicable to how to solve the problem of cumbersome training process of the keyword recognition model.
  • the voice recognition method provided in this embodiment can be executed by the voice recognition device provided in the embodiment of this application.
  • the device can be implemented by software and/or hardware, and is integrated in the device that executes the method.
  • the device can It is any kind of smart terminal equipment, such as laptop, tablet or desktop.
  • this embodiment may include the following steps:
  • a voice carrying a keyword corresponding to the operation when the user needs to perform an operation, a voice carrying a keyword corresponding to the operation will be emitted, and the device will generate a corresponding voice command when receiving the voice uttered by the user, and the voice command carries Corresponding keywords;
  • the matching relationship between multiple keywords and different operations is preset according to different application scenarios. For example, in a short video application, you can set different predefined keywords and different video effects In the live broadcast application, you can set predefined keywords to present corresponding gifts in the live broadcast room.
  • S320 Recognize the keywords in the voice instruction through the keyword recognition model.
  • the keyword recognition model is trained in advance by keyword samples determined by the keyword sample determination method provided in the embodiments of the present application.
  • this embodiment acquires keywords pre-specified by the user, queries each voice sample included in the existing voice recognition sample library, determines whether the specified keywords are included in the annotation data that constitutes the voice sample, and then The marked data includes the speech sample of the specified keyword as the target speech sample, and the start time point and the end time point of the keyword phoneme in the audio data phoneme of the target speech sample are determined according to the word phoneme, and cut out at the start time point The audio data segment between and the end time point is used as the keyword speech segment, and then a large number of keyword samples are obtained.
  • a corresponding keyword sample library will be generated.
  • the keyword sample library contains different scenarios under multiple keywords specified by the user and only messages sent by different users. Keyword samples containing the key word voice.
  • a large number of keyword sample pairs contained in the keyword sample library can be used.
  • the pre-set keyword recognition model is trained.
  • the keyword recognition results corresponding to the keyword samples are obtained, and the original Recognize the existing classification loss.
  • the classification loss exceeds the preset loss threshold, repair the keyword recognition model according to the classification loss, and continue to obtain the corresponding keyword samples under the keyword, and enter the key after repair again Keyword recognition is performed in the word recognition model until the classification loss obtained does not exceed the preset loss threshold.
  • the keyword sample corresponding to the next keyword in the keyword sample library is obtained and training is carried out here until the keyword sample library The keyword samples under each keyword contained in are trained to obtain the final keyword recognition model.
  • the keyword recognition model can accurately recognize the keywords in any speech.
  • the voice command can be input into a pre-trained keyword recognition model, and the keyword recognition model parses the voice command to accurately recognize the voice command.
  • the keyword carried in the voice instruction so that the corresponding operation can be performed according to the keyword.
  • S330 Trigger an operation corresponding to the keyword according to the keyword.
  • the carried keyword is analyzed to determine the operation matching the keyword, and then the execution of the operation is triggered to achieve The corresponding voice interactive control.
  • the technical solution provided in this embodiment trains a preset keyword recognition model through the keyword samples determined by the keyword sample determination method described above, so that the keyword recognition model can accurately recognize the keywords carried in the voice command , And then trigger the execution of corresponding operations based on the identified keywords, simplifying the cumbersome operation of collecting keyword samples during model training, and reducing the cost of acquiring keyword samples.
  • the keyword recognition model obtained through the keyword sample training Recognizing the keywords carried in the corresponding user's voice improves the accuracy of voice recognition.
  • FIG. 4 is a schematic structural diagram of a keyword sample determining device provided in Embodiment 4 of this application. As shown in FIG. 4, the device may include:
  • the keyword acquisition module 410 is set to acquire keywords
  • the target voice acquisition module 420 is configured to acquire target voice samples including keywords from an existing voice recognition sample library
  • the keyword sample determining module 430 is configured to determine the keyword voice segment in the target voice sample to obtain the keyword sample.
  • the technical solution provided in this embodiment obtains target speech samples containing keywords in an existing speech recognition sample library, and intercepts the keyword speech fragments in the target speech samples to obtain keyword samples.
  • the recognition sample library contains a large number of voice samples in multiple types of users or in multiple scenarios.
  • the target voice samples that contain keywords are also in multiple voice scenarios, so that the extracted keyword voice fragments are also in multiple types.
  • diversified keyword samples can be obtained. There is no need to generate keyword samples by repeatedly recording keyword voices of multiple users in multiple scenarios. This reduces the acquisition cost of keyword samples and increases keywords The comprehensiveness of the sample determination.
  • the keyword sample determining device provided in this embodiment is applicable to the keyword sample determining method provided in any embodiment of the present application, and has corresponding functions and effects.
  • FIG. 5 is a schematic structural diagram of a speech recognition device provided in Embodiment 5 of this application. As shown in FIG. 5, the device may include:
  • the voice instruction acquiring module 510 is configured to acquire the user's voice instruction
  • the keyword recognition module 520 is configured to recognize keywords in voice instructions through a keyword recognition model, which is trained in advance by keyword samples determined by the keyword sample determining device provided in the above-mentioned embodiment;
  • the operation trigger module 530 is configured to trigger the operation corresponding to the keyword according to the keyword.
  • the technical solution provided in this embodiment trains a preset keyword recognition model through the keyword samples determined by the above keyword sample determiner, so that the keyword recognition model can accurately recognize the keywords carried in the voice command , And then trigger the execution of corresponding operations based on the identified keywords, simplifying the cumbersome operation of collecting keyword samples during model training, and reducing the cost of acquiring keyword samples.
  • the keyword recognition model obtained through the keyword sample training Recognizing the keywords carried in the corresponding user's voice improves the accuracy of voice recognition.
  • the voice recognition device provided in this embodiment is applicable to the voice recognition method provided in any embodiment of the above application, and has corresponding functions and effects.
  • FIG. 6 is a schematic structural diagram of a device provided by Embodiment 6 of this application.
  • the device includes a processor 60, a storage device 61, and a communication device 62; the number of processors 60 in the device may be one or more.
  • a processor 60 is taken as an example in FIG. 6; the processor 60, the storage device 61, and the communication device 62 in the device may be connected by a bus or other means. In FIG. 6, the connection by a bus is taken as an example.
  • the device provided in this embodiment can be used to execute the keyword sample determination method or the voice recognition method provided in any of the foregoing embodiments, and has corresponding functions and effects.
  • the seventh embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the method for determining a keyword sample in any of the foregoing embodiments can be implemented.
  • the method can include:
  • the method may include:
  • An embodiment of the application provides a storage medium containing computer-executable instructions.
  • the computer-executable instructions are not limited to the method operations described above, and can also execute the keyword sample determination method or voice provided by any embodiment of the application. Relevant operations in the identification method.
  • This application can be implemented with the help of software and necessary general-purpose hardware, or can be implemented with hardware.
  • This application can be embodied in the form of a software product.
  • the computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), and Random Access Memory (Random Access Memory). , RAM), flash memory (FLASH), hard disk or optical disk, etc., including at least one instruction to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in multiple embodiments of the present application.
  • the multiple units and modules included are only divided according to the functional logic, but are not limited to the above division, as long as the corresponding function can be realized;
  • the names of multiple functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of this application.

Abstract

La présente invention concerne un procédé de détermination d'échantillon de mot-clé, un procédé de reconnaissance vocale, et un appareil, un dispositif et un support. Le procédé de détermination d'échantillon de mot-clé consiste à : obtenir un mot-clé ; obtenir, à partir d'une bibliothèque d'échantillons de reconnaissance vocale existante, un échantillon vocal cible qui comprend le mot-clé ; et déterminer un segment vocal de mot-clé dans l'échantillon vocal cible de façon à obtenir un échantillon de mot-clé.
PCT/CN2020/077912 2019-03-13 2020-03-05 Procédé de détermination d'échantillon de mot-clé, procédé de reconnaissance vocale, et appareil, dispositif et support WO2020182042A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910189413.1 2019-03-13
CN201910189413.1A CN109979440B (zh) 2019-03-13 2019-03-13 关键词样本确定方法、语音识别方法、装置、设备和介质

Publications (1)

Publication Number Publication Date
WO2020182042A1 true WO2020182042A1 (fr) 2020-09-17

Family

ID=67078805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/077912 WO2020182042A1 (fr) 2019-03-13 2020-03-05 Procédé de détermination d'échantillon de mot-clé, procédé de reconnaissance vocale, et appareil, dispositif et support

Country Status (2)

Country Link
CN (1) CN109979440B (fr)
WO (1) WO2020182042A1 (fr)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109979440B (zh) * 2019-03-13 2021-05-11 广州市网星信息技术有限公司 关键词样本确定方法、语音识别方法、装置、设备和介质
CN110689895B (zh) * 2019-09-06 2021-04-02 北京捷通华声科技股份有限公司 语音校验方法、装置、电子设备及可读存储介质
CN110675896B (zh) * 2019-09-30 2021-10-22 北京字节跳动网络技术有限公司 用于音频的文字时间对齐方法、装置、介质及电子设备
CN111833856B (zh) * 2020-07-15 2023-10-24 厦门熙重电子科技有限公司 基于深度学习的语音关键信息标定方法
CN113515454A (zh) * 2021-07-01 2021-10-19 深圳创维-Rgb电子有限公司 测试用例生成方法、装置、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040002868A1 (en) * 2002-05-08 2004-01-01 Geppert Nicolas Andre Method and system for the processing of voice data and the classification of calls
US20040006464A1 (en) * 2002-05-08 2004-01-08 Geppert Nicolas Andre Method and system for the processing of voice data by means of voice recognition and frequency analysis
CN1889170A (zh) * 2005-06-28 2007-01-03 国际商业机器公司 基于录制的语音模板生成合成语音的方法和系统
CN105654943A (zh) * 2015-10-26 2016-06-08 乐视致新电子科技(天津)有限公司 一种语音唤醒方法、装置及系统
CN108009303A (zh) * 2017-12-30 2018-05-08 北京百度网讯科技有限公司 基于语音识别的搜索方法、装置、电子设备和存储介质
CN108182937A (zh) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 关键词识别方法、装置、设备及存储介质
CN109979440A (zh) * 2019-03-13 2019-07-05 广州市网星信息技术有限公司 关键词样本确定方法、语音识别方法、装置、设备和介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1113330C (zh) * 1997-08-15 2003-07-02 英业达股份有限公司 语音合成中的语音规整方法
CN104700832B (zh) * 2013-12-09 2018-05-25 联发科技股份有限公司 语音关键字检测系统及方法
US9953632B2 (en) * 2014-04-17 2018-04-24 Qualcomm Incorporated Keyword model generation for detecting user-defined keyword
KR101703214B1 (ko) * 2014-08-06 2017-02-06 주식회사 엘지화학 문자 데이터의 내용을 문자 데이터 송신자의 음성으로 출력하는 방법
US9959863B2 (en) * 2014-09-08 2018-05-01 Qualcomm Incorporated Keyword detection using speaker-independent keyword models for user-designated keywords
CN104517605B (zh) * 2014-12-04 2017-11-28 北京云知声信息技术有限公司 一种用于语音合成的语音片段拼接系统和方法
CN105100460A (zh) * 2015-07-09 2015-11-25 上海斐讯数据通信技术有限公司 一种声音操控智能终端的方法及系统
CN105096932A (zh) * 2015-07-14 2015-11-25 百度在线网络技术(北京)有限公司 有声读物的语音合成方法和装置
CN105117384A (zh) * 2015-08-19 2015-12-02 小米科技有限责任公司 分类器训练方法、类型识别方法及装置
CN107451131A (zh) * 2016-05-30 2017-12-08 贵阳朗玛信息技术股份有限公司 一种语音识别方法及装置
CN107040452B (zh) * 2017-02-08 2020-08-04 浙江翼信科技有限公司 一种信息处理方法、装置和计算机可读存储介质
US10540961B2 (en) * 2017-03-13 2020-01-21 Baidu Usa Llc Convolutional recurrent neural networks for small-footprint keyword spotting
CN109065046A (zh) * 2018-08-30 2018-12-21 出门问问信息科技有限公司 语音唤醒的方法、装置、电子设备及计算机可读存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040002868A1 (en) * 2002-05-08 2004-01-01 Geppert Nicolas Andre Method and system for the processing of voice data and the classification of calls
US20040006464A1 (en) * 2002-05-08 2004-01-08 Geppert Nicolas Andre Method and system for the processing of voice data by means of voice recognition and frequency analysis
CN1889170A (zh) * 2005-06-28 2007-01-03 国际商业机器公司 基于录制的语音模板生成合成语音的方法和系统
CN105654943A (zh) * 2015-10-26 2016-06-08 乐视致新电子科技(天津)有限公司 一种语音唤醒方法、装置及系统
CN108009303A (zh) * 2017-12-30 2018-05-08 北京百度网讯科技有限公司 基于语音识别的搜索方法、装置、电子设备和存储介质
CN108182937A (zh) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 关键词识别方法、装置、设备及存储介质
CN109979440A (zh) * 2019-03-13 2019-07-05 广州市网星信息技术有限公司 关键词样本确定方法、语音识别方法、装置、设备和介质

Also Published As

Publication number Publication date
CN109979440B (zh) 2021-05-11
CN109979440A (zh) 2019-07-05

Similar Documents

Publication Publication Date Title
WO2020182042A1 (fr) Procédé de détermination d'échantillon de mot-clé, procédé de reconnaissance vocale, et appareil, dispositif et support
CN110322869B (zh) 会议分角色语音合成方法、装置、计算机设备和存储介质
CN110517689B (zh) 一种语音数据处理方法、装置及存储介质
US7860713B2 (en) Reducing time for annotating speech data to develop a dialog application
CN109616096B (zh) 多语种语音解码图的构建方法、装置、服务器和介质
CN105931644A (zh) 一种语音识别方法及移动终端
TW201203222A (en) Voice stream augmented note taking
KR20030078388A (ko) 음성대화 인터페이스를 이용한 정보제공장치 및 그 방법
CN108305618B (zh) 语音获取及搜索方法、智能笔、搜索终端及存储介质
CN110047469B (zh) 语音数据情感标注方法、装置、计算机设备及存储介质
CN111798833A (zh) 一种语音测试方法、装置、设备和存储介质
CN112435653A (zh) 语音识别方法、装置和电子设备
CN111897511A (zh) 一种语音绘图方法、装置、设备及存储介质
US20240064383A1 (en) Method and Apparatus for Generating Video Corpus, and Related Device
CN112669842A (zh) 人机对话控制方法、装置、计算机设备及存储介质
CN108877779B (zh) 用于检测语音尾点的方法和装置
WO2020233381A1 (fr) Procédé et appareil de requête de service sur la base d'une reconnaissance vocale, et dispositif informatique
CN113343108A (zh) 推荐信息处理方法、装置、设备及存储介质
Lopez-Otero et al. Efficient query-by-example spoken document retrieval combining phone multigram representation and dynamic time warping
WO2021102754A1 (fr) Dispositif et procédé de traitement de données et support de stockage
CN115174285A (zh) 会议记录生成方法、装置及电子设备
Le et al. Automatic quality estimation for speech translation using joint ASR and MT features
CN110264994B (zh) 一种语音合成方法、电子设备及智能家居系统
CN113066473A (zh) 一种语音合成方法、装置、存储介质及电子设备
EP4099320A2 (fr) Procédé et appareil de traitement de la parole, dispositif électronique, support de stockage et produit de programme

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20770810

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20770810

Country of ref document: EP

Kind code of ref document: A1